WEBVTT

00:00.210 --> 00:00.690
All right.

00:00.690 --> 00:02.220
Welcome to the next section.

00:02.250 --> 00:08.010
So what we're going to do here is basically take the schema design we came up with in the last section

00:08.010 --> 00:15.000
for roughly Instagram ish application with photos and users and comments and tags and followers and

00:15.090 --> 00:19.250
all of that we're going to take that schema and implemented.

00:19.440 --> 00:23.550
And I'm going to give you a bunch of data to put in your database like a couple of thousand of things

00:23.550 --> 00:24.350
to enter.

00:24.570 --> 00:30.540
I wrote a giant file and what we'll do is put that in there so we all have the same starting place and

00:30.540 --> 00:34.440
then we're going to pretend that we're the company and we're trying to figure out certain things like

00:34.620 --> 00:42.000
Do we have too many bots on our application or their user accounts that are inactive or their user accounts

00:42.000 --> 00:43.950
that have never posted a photo.

00:43.950 --> 00:50.190
Maybe we could send them an e-mail campaign you know encouraging them to post a photo or hash tags tend

00:50.190 --> 00:51.600
to be most successful.

00:51.630 --> 00:52.490
That kind of stuff.

00:52.650 --> 00:59.100
So all of this section every video will be asking a question which is basically a nice way of saying

00:59.130 --> 01:00.880
open an exercise.

01:00.990 --> 01:06.210
But we're now posing it rather than just giving you a table which is what we've been doing where you

01:06.210 --> 01:09.000
know it's a table and I just ask you to recreate this.

01:09.060 --> 01:14.040
I'm now going to pose the question as if we work at a fictional company and we're trying to find something

01:14.040 --> 01:14.560
out.

01:14.580 --> 01:19.920
For example it might be something like we're running a new ad campaign and we need to figure out which

01:19.920 --> 01:22.600
five hashtags get the most likes.

01:22.680 --> 01:23.800
Something like that.

01:24.270 --> 01:30.220
So that's how this will work every video on the section will be a question and rather than having all

01:30.220 --> 01:35.010
of the solutions at the end making a giant video that takes like half an hour to go through.

01:35.010 --> 01:36.070
I'm just going to break them up.

01:36.090 --> 01:40.480
So every video will have the question at the beginning and then a following solution.

01:40.830 --> 01:47.280
But the first thing we need to do is insert all of our data so included in this section is this file

01:47.280 --> 01:50.170
here starter data as well.

01:50.580 --> 01:54.750
And there's two components the first one is setting up our schema.

01:55.020 --> 02:00.630
And there's nothing new here we talked about all of this in the last section but if you were following

02:00.630 --> 02:04.560
along and doing it yourself you may have named something slightly differently.

02:04.740 --> 02:07.460
Like instead of photo tags you may have called that tagging.

02:07.680 --> 02:14.280
Or instead of you know I don't know instead of follow me and follower and follows Maybe you call that

02:14.820 --> 02:17.090
following or relationship or something.

02:17.280 --> 02:23.460
So in order to give you this data here to insert I needed to make sure that we all have the same schema

02:23.790 --> 02:25.620
even if just one thing is named differently.

02:25.620 --> 02:26.840
It's a problem.

02:27.450 --> 02:30.960
So the first part of this is just a bunch of create tables.

02:30.960 --> 02:36.630
All the tables are discussed as well as dropping the existing Instagram clone database creating a new

02:36.630 --> 02:42.120
database and then using it if you want to if you want to keep your current Instagram clone database.

02:42.120 --> 02:47.820
If you've been working with that just give us a different name whatever you want but you need to do

02:47.820 --> 02:49.830
it to all three of these.

02:50.130 --> 02:50.370
OK.

02:50.370 --> 02:55.890
So we set up the schema but then the fun part which is all of these Insearch statements and it doesn't

02:55.890 --> 03:01.130
look that impressive when you look at it in cloud 9 because it doesn't wrap them onto new lines.

03:01.440 --> 03:07.190
And if I start scrolling you'll see I'm scrolling really fast but look down here at the progress bar

03:07.200 --> 03:08.760
I've barely made a dent.

03:08.940 --> 03:11.960
And this is not a great way to visualize the data.

03:11.970 --> 03:16.890
So instead here I am in Sublime Text and here's the file that I generated.

03:16.890 --> 03:18.380
It's the same exact file.

03:18.540 --> 03:21.810
It's just displayed all here rather than having to scroll across.

03:21.810 --> 03:23.070
I have to scroll down.

03:23.760 --> 03:30.060
So you know we won't do any sort of grand tour of the data but I will say this I spent a lot of time

03:30.060 --> 03:32.910
trying to figure out how I wanted this to work.

03:32.910 --> 03:34.680
Obviously I didn't type this myself.

03:34.770 --> 03:43.740
I would have taken days probably Instead I wrote some code which still took probably an entire day and

03:43.740 --> 03:48.350
a half to get it this how I wanted because I didn't want just randomized data.

03:48.630 --> 03:54.270
If I just wanted to insert a bunch of random users and posts and comments that have no meaning to it

03:54.750 --> 04:01.680
I could have done that pretty quickly but instead I tried to design this so that there is certain personas

04:02.130 --> 04:07.360
to our to our users so some users post a lot so they have a lot of comments a lot of posts.

04:07.410 --> 04:13.740
Some users are lurkers meaning that they don't really do anything except look at people maybe like certain

04:13.740 --> 04:14.430
things.

04:14.430 --> 04:16.670
At best they don't comment and they don't post.

04:16.680 --> 04:22.890
Then we have things like bots which on Instagram are a real problem where bots are accounts that exist

04:23.130 --> 04:26.220
basically to go out and like a bunch of things.

04:26.220 --> 04:32.040
Comments on things usually the same exact comment but they don't post things and then we also have like

04:32.040 --> 04:36.050
celebrity accounts who have very few people that they actually follow.

04:36.060 --> 04:42.330
But they have tons of followers and they're not very active with commentating or liking with very active

04:42.330 --> 04:44.210
with posting.

04:44.290 --> 04:48.480
So then we've got things like hashtags and I didn't want to just do a bunch of random hash tags.

04:48.490 --> 04:52.100
I tried to actually come up with combinations of hash tags.

04:52.100 --> 04:53.110
It would be common.

04:53.110 --> 05:00.880
So you know things like smile and style and fashion might be together often but then we might also have

05:01.000 --> 05:04.320
smile and party or concert.

05:04.320 --> 05:11.320
Those might go together or we might have you know landscape and sunrise and sunset and beach could be

05:11.320 --> 05:13.670
a combination for a beach photo at sunset.

05:13.840 --> 05:19.890
But then we could also have beach night with party or beach with beauty or something like that.

05:20.140 --> 05:22.590
So that's just kind of scratching the surface.

05:22.600 --> 05:26.890
But I tried to design a lot of that in there so that we had somewhat real data.

05:26.930 --> 05:34.010
With that said things like comments the text of a comment is just random Lorem Ipsum text.

05:34.210 --> 05:39.070
So there's nothing there you know that's going to be meaningful about the comment itself.

05:39.400 --> 05:43.240
What's more meaningful is the relationship between the comment and who posted it.

05:43.260 --> 05:47.320
And if they have a lot of comments or they don't have many comments and that sort of thing.

05:47.650 --> 05:49.380
OK so that's enough of that.

05:49.410 --> 05:55.100
So even with all the effort that I've put in here there's still only you know 100 or so or exactly 100

05:55.150 --> 06:00.110
users about 250 something photos.

06:00.220 --> 06:05.380
But then when we start talking about likes there's thousands and comments there's tons.

06:05.380 --> 06:11.520
And following follows whatever recalled it there's tons and tags there's only maybe 20 hashtags.

06:11.530 --> 06:18.610
But then when you apply them to all of those photos 200 photos we get thousands to thousands of instances.

06:18.610 --> 06:25.720
So all you have to do is get that into a file and then execute that file scroll back over and take too

06:25.720 --> 06:26.330
long.

06:26.410 --> 06:28.540
Oh boy almost there.

06:28.540 --> 06:29.170
Here we go.

06:29.440 --> 06:33.060
So execute this file and I'll just do that now.

06:33.310 --> 06:43.630
With source Instagram slash starter data and that should just be the path that corresponds to wherever

06:43.630 --> 06:46.050
that file is and whatever you've named it.

06:46.520 --> 06:51.030
OK so it takes a little bit compared to what we've seen so far.

06:51.030 --> 06:55.480
So if you take a look we've got you know a bunch of tables being created seven I think.

06:55.620 --> 07:02.670
And then a bunch of things being inserted 100 257 seven thousand six hundred twenty three 7400 eighty

07:02.690 --> 07:07.240
eight eight thousand seven hundred eighty two 21 and then 501.

07:07.440 --> 07:10.690
So this is still a very small dataset compared to.

07:10.860 --> 07:14.180
I mean tiny miniscule compared to Instagram.

07:14.310 --> 07:17.350
This is maybe like a minute of Internet activity.

07:17.370 --> 07:20.370
Probably less than that a couple of seconds.

07:20.370 --> 07:24.660
As far as the number of photos and stuff but it's still an order of magnitude greater than what we've

07:24.660 --> 07:26.130
been working with.

07:26.130 --> 07:31.650
One other thing that I'll show you is that we haven't really seen giant amounts of data like this.

07:31.650 --> 07:39.720
And if you try and do it to do a select star from likes which I think we have a lot of you actually

07:39.720 --> 07:46.140
if you if you start to scroll through it you'll see that we're not going to be able to see all of them.

07:46.650 --> 07:53.460
There is a number that we that the terminal or that the Seelye will print out for us and then it just

07:53.460 --> 07:56.160
gives up or it decides not to display all of them.

07:56.160 --> 07:57.910
That doesn't mean they're not there.

07:57.930 --> 07:59.940
So if he did account for example

08:03.880 --> 08:07.350
count star from like we get eight thousand.

08:07.390 --> 08:12.970
And if you you know join them with another table it will join all of them it just won't display everything.

08:12.970 --> 08:17.450
So for some reason you needed to display every single like it would be better not to do it using the

08:17.530 --> 08:22.850
CSI but to do it from you know another file or some code somewhere.

08:22.920 --> 08:26.080
Anyways so that's it we've got the data set up if you'd like.

08:26.080 --> 08:30.440
Play around that explored a bit and then we'll start with our fun challenges in the next video.
