WEBVTT

00:00.120 --> 00:06.510
And finally we've arrived at our last entity to store which is hash tags or tags and we save this one

00:06.510 --> 00:12.620
for last because as I mentioned in the intro video it's the hardest I guess.

00:12.750 --> 00:15.600
Well it doesn't have to be but there's a couple different solutions.

00:15.600 --> 00:20.540
We could go about and I wanted to have a discussion about which is best if there is one answer.

00:20.550 --> 00:24.960
So let's step through three potential structures that we could use.

00:25.140 --> 00:31.280
And remember the way that this works is that one photo can have you know 10 20 50 hash tags.

00:31.290 --> 00:33.680
I do think there is a limit on Instagram.

00:33.680 --> 00:41.140
I mean you can keep adding them as many as you want but only 38 believe actually will show up or it's

00:41.160 --> 00:46.520
not that they won't show up but it's that when you search for a hash tag your photo will not show up.

00:46.800 --> 00:48.150
If you use less than 30.

00:48.150 --> 00:54.330
Anyway as for getting off track here but the idea is you can add hash tags to a photo but then you can

00:54.330 --> 00:59.140
also you know go click on a hash tag and see all the associated photos with it.

00:59.370 --> 01:06.180
So here's one way we could structure it which might seem maybe the least obvious to you which would

01:06.180 --> 01:14.310
be to do it like this where we actually just have our photos table but we add a column called tags and

01:14.340 --> 01:21.560
it's a string Savarkar that just has our tags and we can separate them by the hash sign the Octa Thorpe

01:21.560 --> 01:25.730
or by spaces or dashes or something but we're just storing them together.

01:26.040 --> 01:28.790
And this could work in places to use this.

01:28.800 --> 01:34.140
I've seen it suggested at least I haven't seen it in practice but you can imagine how would work you

01:34.140 --> 01:39.570
need to add a new tag into something you'd basically just take all the other tags and then concatenate

01:39.570 --> 01:41.280
in at the end a new tack.

01:41.280 --> 01:47.070
Or at the beginning and then you could have an order to you know figure out which ones were posted first

01:47.070 --> 01:51.560
I guess based off of that order that they're in and it's pretty simple.

01:51.600 --> 01:57.800
You know if you want to find all photos that are tagged with cute you just basically have to use the

01:57.800 --> 02:04.770
like where tags like cute but then as you can see down here there are a couple problems.

02:04.770 --> 02:07.740
Let's start with an advantage which is that it's very easy to implement.

02:07.740 --> 02:11.630
You don't need any other tables but there are a couple of disadvantages.

02:11.850 --> 02:17.520
One is that there's a limit to how many tags you can have in there which is the length of you know whatever

02:17.520 --> 02:22.450
the column is of a bar char whatever the maximum length is you can only put so many in there.

02:23.430 --> 02:28.920
The other problem or any other problem actually is that you can't store additional information like

02:29.370 --> 02:32.610
you know if we want to store the first time a hash tag was used.

02:32.990 --> 02:39.960
If I want to be able to see the first person to use the hash tag whatever cat or something I'm not storing

02:39.960 --> 02:45.800
the time that something was tag because you know on Instagram you can add hash tags after the fact.

02:45.840 --> 02:51.290
Sure they're storing that when they're tagged so that they can track trend as hash tags come and go.

02:51.330 --> 02:54.180
So it doesn't allow us to store additional information like that.

02:54.450 --> 02:59.910
And then the other thing that's kind of important is that you have to be careful when you're searching

02:59.910 --> 03:00.590
for stuff.

03:00.750 --> 03:06.360
So if I was trying to find the hash tag food and wanted to see nice food beautiful foods with the hash

03:06.360 --> 03:14.400
tags just food well it will return this which has said food if I'm just doing like food so I'd have

03:14.400 --> 03:21.070
to be careful and I need to add spaces maybe to make it easier to separate things out.

03:21.090 --> 03:23.580
Basically you have to be careful in those ways around it.

03:23.580 --> 03:26.810
We've seen them before how you used search with like.

03:26.970 --> 03:32.490
But the point is this is an easy solution but it's not going to be the best and it's not the best for

03:32.490 --> 03:34.130
us right now.

03:34.140 --> 03:37.220
So another solution is to use two tables.

03:37.380 --> 03:40.810
So we have our photos as our photos table currently is now.

03:40.820 --> 03:45.360
So then we have a tag's table where we have our tag name.

03:45.360 --> 03:48.890
In this case cute and then a corresponding photo ID.

03:49.080 --> 03:54.230
So you know if this photo with idea of one is tag's cute we put it there.

03:54.690 --> 03:58.110
And then a photo with ID 3 which is this selfie here.

03:58.530 --> 04:04.860
It's also tagged cute and then photo-ID too is tagged with a microwave and three with ego through smile

04:04.900 --> 04:05.890
too with gross.

04:05.970 --> 04:12.480
So as you can see we can have a single photo with multiple tags like here where we have this selfie

04:12.810 --> 04:15.270
D-3 is tagged with the smile.

04:15.270 --> 04:21.560
Ego and cute and we can also have tags that are applied to multiple photos like cute.

04:21.580 --> 04:24.200
It's applied to the first photo and the third photo.

04:24.570 --> 04:26.590
So this works.

04:26.590 --> 04:28.380
There's unlimited number of tags.

04:28.470 --> 04:29.970
You know that's an advantage.

04:29.970 --> 04:35.880
We don't have to worry about how many can fit into the string but a disadvantage is that for one thing

04:35.970 --> 04:41.850
we are storing you know these tagged names over and over and over and these are strings and that it's

04:41.850 --> 04:46.970
maybe not ideal to store all these different strings and have duplicated data but it's not a big deal.

04:46.990 --> 04:48.600
Duplication isn't the problem.

04:48.600 --> 04:53.640
Really what it comes down to is that this is actually slower than the previous solution when it comes

04:53.640 --> 04:57.250
to things like inserting or updating or deleting.

04:57.420 --> 05:01.240
But then another problem is that it's actually slower than the previous solution.

05:01.380 --> 05:06.330
So in just a moment I can show you a comparison that I didn't do but there's a great article that I

05:06.330 --> 05:12.570
found where somebody does a speed test between these three solutions to figure out when they perform

05:12.570 --> 05:13.470
best.

05:13.470 --> 05:22.110
So this is not one we're going to use either but solution 3 is which maybe you've got it involves three

05:22.110 --> 05:23.090
tables.

05:23.220 --> 05:25.960
So we have our photos table unchanged.

05:26.280 --> 05:32.660
We have a different tag's table so this tax table is just a tag name and an ID.

05:33.300 --> 05:40.230
And then we have a middle table a join table called Photo tags or tag photos or tagging or something.

05:40.230 --> 05:45.000
Whatever you want to call it which is an instance of a hash tag being applied to a photo and all that

05:45.000 --> 05:50.430
it is a photo ID and tag ID and that's it.

05:50.430 --> 05:59.040
So in this case you know photo I.D. one is my cat is being tagged with cute and then the same photo

05:59.040 --> 06:01.560
My cat is being tagged with pet.

06:01.770 --> 06:03.810
And then we have you know we can keep going.

06:03.900 --> 06:10.230
My meal is being tagged with microwave and my meal is being tagged with gross and so on.

06:10.230 --> 06:12.330
So we have these two columns.

06:12.340 --> 06:13.940
That's pretty much it.

06:14.010 --> 06:19.410
The advantage again just like the previous one there's an unlimited number of tags and we can also add

06:19.440 --> 06:20.400
additional information.

06:20.400 --> 06:26.670
So if I wanted to you know when a tag is created the first time over here I could store the first time

06:26.670 --> 06:33.840
it was created and then over here every time it subsequently used I could store the time it was used

06:33.840 --> 06:34.730
or something.

06:35.070 --> 06:41.170
And so I'm sure there's some interesting data you could garner from that or the location of where you

06:41.170 --> 06:45.930
know where in the world it was used or something like that at the time of the day.

06:45.930 --> 06:47.860
I don't know what day of the week whatever.

06:47.880 --> 06:50.830
And we also have less duplication of the tags over here.

06:51.240 --> 06:53.150
But then there are a couple of problems.

06:53.400 --> 06:56.100
And they really come down to just more work.

06:56.100 --> 07:01.500
So when we're inserting something if it's a first time it has been used and it's not in the database

07:01.500 --> 07:07.830
yet we have to create it and then we have to associate it with a photo using photo tags.

07:07.830 --> 07:10.230
And the same thing when updating we have to

07:12.950 --> 07:16.760
what she just vomited anyway.

07:17.040 --> 07:19.080
And then we also have to worry about orphans.

07:19.080 --> 07:24.690
We should always all be worried about orphans but in particular when we're talking about these tables

07:25.020 --> 07:30.740
when we're deleting something right if we delete a tag for some reason which Instagram does you know

07:30.750 --> 07:37.710
if there's a problem something is being misused her somehow being you know I don't know controversial.

07:37.950 --> 07:39.200
They can be removed.

07:39.390 --> 07:43.440
And so then we just have to make sure when we're removing it from here we also need to remove it from

07:43.770 --> 07:47.730
all associated photo tags so it can be a bit more complicated.

07:47.850 --> 07:54.900
But basically it comes down to this the cost of having this nicer structure is that it's just a little

07:54.900 --> 08:01.890
more work and just as an end note here I did mention earlier that there is this great article online.

08:02.050 --> 08:09.660
This guy who does speed test of different ways of doing tagging and it's written up include the link

08:10.680 --> 08:12.330
along with this video if you want to read it.

08:12.480 --> 08:14.270
I'll just spend 10 seconds talking about it.

08:14.280 --> 08:22.110
But basically he does some tests here to show that depending on the size of your data set this example

08:22.110 --> 08:23.640
with a smaller data set.

08:23.730 --> 08:28.950
The second and third approaches I showed you with multiple tables are actually faster but then we get

08:29.010 --> 08:33.840
a larger dataset with thousands of tags it actually changes.

08:33.840 --> 08:35.950
So it's just interesting to note that and to read.

08:36.150 --> 08:41.370
But also he talks about that you know how much time it takes to insert compared to you know compare

08:41.370 --> 08:43.720
to cross the three different versions.

08:43.740 --> 08:49.020
And it's just an interesting article but if you actually look at his conclusion section which is probably

08:49.020 --> 08:56.930
what most of you care about where said is that he actually thinks that it would be best to have if you're

08:56.930 --> 08:59.560
working on a huge app to have two versions.

08:59.900 --> 09:02.210
The first version I sure do.

09:02.240 --> 09:04.270
And the third version I showed you.

09:04.340 --> 09:10.670
And basically they are performed best in different situations and this is getting into the woods a little

09:10.670 --> 09:10.890
bit.

09:10.910 --> 09:16.640
But basically the version that we're going to go with the third one with three different tables is faster

09:16.640 --> 09:23.390
if you're working with common tags things that you know are used often but if you're searching or you're

09:23.390 --> 09:30.320
trying to do joints with rare rare is the wrong word but uncommon tanks it just aren't used very often

09:30.680 --> 09:35.630
then it's actually slower and it would be faster to do the first version I showed you where you store

09:35.630 --> 09:37.320
them as text.

09:37.400 --> 09:44.240
So this guy is suggesting to use a combination of the two which we're not going to get into but just

09:44.300 --> 09:50.480
interesting to know that there isn't one best way and that it really comes down to your needs and what

09:50.480 --> 09:53.390
your app needs the product needs dictate.

09:53.480 --> 10:00.530
And in all likelihood as an athlete Instagram grows especially when it started out to where it is now

10:01.070 --> 10:07.640
its schema its database has undergone multiple big overhauls and changes probably not just the schema

10:07.640 --> 10:09.940
itself but the database.

10:10.280 --> 10:16.040
Whether its going from something like my ask you all to another type of database or a bigger change

10:16.040 --> 10:21.130
going from a relational database like my ice cube to a non-relational database.

10:21.200 --> 10:26.220
So its really hard to say you know what's best but it's interesting to take a look at somebody who's

10:26.240 --> 10:31.970
actually crunched the numbers and even after doing that can't necessarily give you know a definitive

10:31.970 --> 10:34.280
answer because the answer is it depends.
