WEBVTT

00:00.220 --> 00:06.900
OK so next up we're talking about something called GROUP BY AND group is going to be essential to everything

00:06.900 --> 00:09.250
we do in the rest of this section.

00:09.480 --> 00:14.210
And I'll warn you right now take a deep breath that it's a little bit weird.

00:14.220 --> 00:16.520
It's different than the things that we've seen so far.

00:16.680 --> 00:24.730
And in fact for me personally it's a challenging one to teach because like like I just said it is different.

00:24.780 --> 00:29.520
It's not just a single little function that we can throw in there like concat or upper where we get

00:29.520 --> 00:36.180
a nice immediate impact where we can see OK that was upper case or that was concatenated or that's account

00:36.900 --> 00:41.850
rather we have to use it alongside some other functions.

00:41.850 --> 00:46.410
So it's going to take a little bit to explain so hang in there and just know that we're going to get

00:46.410 --> 00:48.410
there there is a payoff and it is very useful.

00:48.420 --> 00:50.800
It's kind of at the core of aggregating our data.

00:51.060 --> 00:57.720
It's how we can do things eventually like average or some or find the minimum or the maximum and so

00:57.720 --> 00:58.320
on.

00:58.320 --> 01:07.210
So here we go this is a little blurb that I wrote about group by group by summarizes or aggregates identical

01:07.210 --> 01:09.200
data into single rows.

01:09.280 --> 01:15.010
So it's actually pretty self-explanatory the named group by we can say things like take all of our books

01:15.190 --> 01:21.880
and group them by the year they were released or group them by the author's last name or take all of

01:21.880 --> 01:28.360
our t's or actually say take all the movies in our database and group them by genre and that on its

01:28.420 --> 01:34.240
own is not that useful but then we can say things like group them by genre and tell me how many each

01:34.240 --> 01:42.600
genre has or group are Arty's by a variety like green tea white tea black tea and so on group are teased

01:42.610 --> 01:49.510
by Variety and find the average sales price for each one for each variety that is or add together all

01:49.510 --> 01:54.480
the sales for each variety and figure out which the type is selling best.

01:54.490 --> 01:55.840
So we'll get there in time.

01:55.840 --> 01:59.540
But for now let's start with a little example again.

01:59.560 --> 02:00.260
Here's a reminder.

02:00.280 --> 02:01.530
Stay with me please.

02:02.910 --> 02:06.580
Let's say we're working with a very small subset of our book data.

02:06.780 --> 02:08.720
So I just grabbed five books.

02:08.790 --> 02:14.520
So when we do select the title and author L name from books right now we'll get 19 in our actual database

02:14.520 --> 02:18.480
but this is a fictional database where we get five books.

02:18.480 --> 02:24.570
We've got the namesake by the Henry Norse mythology by Neil Gaiman's American Gods final game an interpreter

02:24.570 --> 02:28.810
of maladies the Harry and a hologram for the King by Dave Eggers.

02:28.810 --> 02:33.780
All right so we get these five books out when we just select just like that we don't do anything fancy.

02:33.780 --> 02:35.980
We're just getting title An author last name.

02:36.360 --> 02:38.390
So then we introduce group.

02:39.690 --> 02:44.940
And we'll talk about the syntax and a little bit but all that we've done this is the exact same lines

02:44.970 --> 02:45.770
before.

02:46.100 --> 02:50.480
So like title An author last name from books and we've added on at the end.

02:50.480 --> 02:52.310
Group by author Al-Naimi.

02:52.650 --> 02:56.240
And I've just done it on the second line just to make it so that this is easier to read.

02:56.240 --> 03:01.170
Again it doesn't matter so group by author last name and what we get.

03:01.400 --> 03:05.430
Books like this which is bizarre.

03:05.940 --> 03:12.110
It's probably not what you would expect it's not what to me makes sense on a visual level at least.

03:12.210 --> 03:17.580
It does make sense when you learn what's happening but on a visual level it looks like basically it's

03:17.580 --> 03:24.570
only just giving us the unique authors like a hero game in and Eggers and then just picking their first

03:24.570 --> 03:31.950
book title right to we're getting the namesake for the theory Norse mythology for game and the hologram

03:31.950 --> 03:33.050
for the king for Aggers.

03:33.060 --> 03:37.600
But then we're missing these two books and that's that is what's happening.

03:37.730 --> 03:40.140
What we're seeing here which isn't really useful.

03:40.140 --> 03:41.750
Why would you want that.

03:41.790 --> 03:48.640
So what's actually happening is that even though it just prints out the namesake by Lahiri it's basically

03:48.640 --> 03:51.650
a grouping based off of last name.

03:52.080 --> 04:00.870
And so it creates one big let's call it a super hero with all of his information or all of the books

04:01.080 --> 04:03.020
written by Lahiri.

04:03.090 --> 04:10.290
So that's the namesake and interpreter of maladies and it treats that as one row and it does the same

04:10.290 --> 04:12.080
thing for Neil Gaiman.

04:12.900 --> 04:17.630
And then the same thing for Aggers and there's only one Dave Eggers book in our dataset.

04:17.760 --> 04:24.020
So even though this is all that we see behind the scenes they are grouped together in these super mega

04:24.030 --> 04:30.150
whatever we want to call them rose and that's important because the data is now grouped and we can do

04:30.150 --> 04:32.260
things like this.

04:32.460 --> 04:35.020
Count how many books each author has written.

04:37.490 --> 04:39.680
So our query would look something like this.

04:39.830 --> 04:42.260
Select author last name comma.

04:42.320 --> 04:45.190
We don't have to have this here either but it's like the author's last name.

04:45.190 --> 04:48.900
Comma count star from books.

04:49.220 --> 04:50.980
Groups by author last name.

04:51.260 --> 04:54.860
So it's going to group them like we've already seen.

04:54.860 --> 05:01.880
And then when we do account star it's not going to count every row in our entire table but it's counting

05:02.320 --> 05:04.980
is for each of these mega rows.

05:05.180 --> 05:07.340
How many sub rows are in there.

05:07.340 --> 05:11.980
So basically how many things are under the here and in this case there's two.

05:12.170 --> 05:16.420
How many things how many books are grouped under Geymann two.

05:17.420 --> 05:20.480
And how many are under Eggers 1.

05:20.680 --> 05:26.710
So then what we get spit out when we're actually selecting the author's last name and the count is this

05:27.830 --> 05:31.770
Lahiri to Geymann to Eggers.

05:32.260 --> 05:34.430
So we're only getting started here with group by.

05:34.630 --> 05:37.630
I'm going to now dive into cloud 9 and do some more.

05:37.990 --> 05:42.570
But just know that we will be using Group By to do things like average.

05:42.670 --> 05:46.930
So once we have that say group by authors we could average the page count.

05:46.930 --> 05:51.550
Like I said or the average year that they wrote a book or we could find that the first year they wrote

05:51.550 --> 05:56.920
a book using men which you will see in just a moment or you could sum the total page count for instance

05:57.250 --> 05:58.490
for every author.

05:58.930 --> 06:01.300
So group is very important and versatile.

06:01.360 --> 06:05.120
For now we're only using it in conjunction with count.

06:05.170 --> 06:08.800
So now let's dive into cloud 9 and do just that.

06:08.800 --> 06:13.000
So let's start with something simple like select title comma.

06:13.600 --> 06:24.310
Author L name you actually first name and author last name from books so no count or anything like that

06:24.310 --> 06:25.230
yet.

06:25.300 --> 06:27.940
So we have a much bigger dataset right.

06:27.940 --> 06:30.430
Let's do a group by now.

06:32.850 --> 06:33.420
Group By.

06:33.440 --> 06:34.170
And will do.

06:34.170 --> 06:36.030
Author last name.

06:36.030 --> 06:37.760
And we're going to just do the exact same thing.

06:37.770 --> 06:43.380
I want to know how many books each author has written and we will run into problems and just telling

06:43.380 --> 06:44.070
you now with this.

06:44.070 --> 06:44.820
HARRIS Right.

06:44.820 --> 06:49.100
Because we have two different Harris's Frieda Harris and Dan Harris.

06:49.170 --> 06:51.780
But when we grouped by last name it just treats them as one.

06:51.810 --> 06:57.770
So it will say Harris has two books which technically is true the last name Harris has two books.

06:58.080 --> 07:02.940
OK so if we do this and just hit enter it's not useful.

07:02.940 --> 07:04.760
You pretty I don't know why you would ever do that.

07:04.770 --> 07:10.610
Honestly it just is giving you sort of a preview of the mega rows if you will.

07:10.650 --> 07:16.270
Basically it's just taking the first book by Carver the first book by Harris but we don't see how they're

07:16.290 --> 07:17.550
actually grouped.

07:17.550 --> 07:20.760
So it's not that useful it's not useful at all honestly.

07:20.760 --> 07:29.550
So what we want to do is rather than just title author first name author last name let's do count star

07:29.880 --> 07:35.010
and remember Count star in this case is referring to the group rose.

07:35.430 --> 07:40.870
So it's going to go to each of those group froze and count how many rows have been groups together.

07:41.680 --> 07:43.380
Oh and this is kind of a mess to look at.

07:43.380 --> 07:45.600
So let's actually get rid of title

07:48.770 --> 07:51.550
and just to author first name and last name.

07:51.550 --> 07:52.600
Now we can see.

07:52.720 --> 07:54.160
So Raymond Carver has two.

07:54.190 --> 07:55.600
Don DeLillo has one.

07:55.600 --> 07:57.170
Dave Eggers has three.

07:57.190 --> 07:58.620
David Foster Wallace has two.

07:58.870 --> 08:04.300
And then as I mentioned we have this problem with Harris because it says Dan Harris has too but that's

08:04.300 --> 08:04.770
not true.

08:04.780 --> 08:07.690
Dan Harris has one book Frida Harris has one book.

08:07.960 --> 08:13.780
So to get around that of course we can do the same thing where rather than just doing group by author

08:13.930 --> 08:19.310
name hopefully you've seen that this is something that's kind of standardized across my ass.

08:19.330 --> 08:20.380
Q All we can do.

08:20.380 --> 08:28.050
Author f name as well so it will group where both of these are unique.

08:28.240 --> 08:31.520
So then we do that and everything else is the same.

08:31.540 --> 08:32.880
Raymond Carver has two.

08:32.890 --> 08:36.490
Dave Eggers has three foster Wallace has to go down here.

08:36.490 --> 08:41.210
We now have Dan Harris with one and Frieda Harris with one.

08:41.560 --> 08:42.170
All right.

08:42.310 --> 08:48.280
So just to show you another thing that we could do we could also Group By released here so we could

08:48.280 --> 08:52.330
regroup by title of course but we don't have any books that have the same title but we do have a couple

08:52.330 --> 08:54.310
of books that have the same year.

08:54.310 --> 08:58.810
So let's say we wanted to get a nice printout that shows us in 2017.

08:58.990 --> 09:02.100
One book was released in 2000 16 two books.

09:02.190 --> 09:08.300
Or how many books are released so we can do that pretty easily with just a select.

09:08.650 --> 09:15.730
And let's do the least year comma count star.

09:15.940 --> 09:24.670
So if we just do select release year from books you can see that as an example I think 2003 we've got

09:25.150 --> 09:31.040
at least two 2003's and I know there's a couple more I hope there's a couple more.

09:31.210 --> 09:38.620
So let's say we wanted to condense them altogether and then put a count next to it as well.

09:38.830 --> 09:39.980
We would do that just like this.

09:40.060 --> 09:42.030
So release your from books.

09:42.340 --> 09:52.510
We would group by released year and rather than only selecting the release year we also want to select

09:52.600 --> 09:57.010
count star hit enter.

09:57.020 --> 10:01.170
Now you can see 1945 1 book 2001.

10:01.280 --> 10:03.650
There were actually three releases 2003.

10:03.680 --> 10:05.730
There were two releases.

10:05.730 --> 10:07.250
So you can see that it's working.

10:07.340 --> 10:09.220
And if you want to get really fancy.

10:09.470 --> 10:10.130
Go ahead.

10:10.130 --> 10:11.580
Move on to the next video if you'd like.

10:11.590 --> 10:17.430
But I'm just going to show you how to get fancy and printed out nicely if we can combine it with the

10:17.600 --> 10:18.740
concatenation.

10:18.800 --> 10:20.430
We could do something like this.

10:20.680 --> 10:26.330
So rather than just selecting the count we could do.

10:26.950 --> 10:38.970
Let's see a select can cat and we're going to concatenate the release here comma so as to actually add

10:38.970 --> 10:46.260
a string before so I'd like it to say like in 2010 two books released in one book I guess in 2001 three

10:46.260 --> 10:53.160
books release actually want that string so we can do in the release here comma

10:56.370 --> 10:57.230
the count

11:01.850 --> 11:02.390
comma

11:05.100 --> 11:12.970
books released and technically just to make sure the parentheses are on the SB kissers.

11:13.020 --> 11:17.380
And a lot of cases one book really so I don't want to say one books anyway.

11:17.940 --> 11:21.450
So let's make sure we have the correct premise here.

11:21.660 --> 11:26.890
Select the concatenation of in the release year.

11:27.630 --> 11:29.880
We probably want to space there don't we.

11:29.880 --> 11:33.700
This is where it gets kind of annoying that we're not using a separate file.

11:33.810 --> 11:40.750
So in 2001 say space three space books are released.

11:40.800 --> 11:45.160
And then lastly to clean this whole thing up let's add an as.

11:45.510 --> 11:50.620
And we'll just call this as you say year or something silly.

11:50.940 --> 11:54.930
We hit enter and we get this nice little table in 1945.

11:54.960 --> 12:00.550
One book released in 2003 two books released in 2001 three books released.

12:00.870 --> 12:01.490
OK.

12:02.220 --> 12:04.050
So that's it for now.

12:04.050 --> 12:05.750
At least with a group by.

12:05.760 --> 12:08.970
We're going to be working with it in every video from here on out in this section.

12:08.970 --> 12:14.790
Like I said we're going to move on to things like min and max average some fun things hopefully sort

12:14.790 --> 12:15.750
of fun.

12:15.750 --> 12:18.120
Congratulations if you made it through this video.

12:18.120 --> 12:21.460
Like I said it's a little bit weird it's different than what we've done before.

12:21.600 --> 12:22.650
Hopefully it wasn't too boring.
