Open In Colab

2. Learn to quack SQL with DuckDB: Group by, Joins and Subqueries#

To start off, install the latest version of duckdb and magic-duckdb to run this notebook.

!pip install --upgrade duckdb magic-duckdb --quiet
%load_ext magic_duckdb

Example Tables#

Let’s start with two datasets:

  • birds.csv: a list of measurements of individuals of different bird species

  • ducks.csv: a list of scientific names of bird species that are ducks

To download the datasets directly from GitHub, run:

!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/birds.csv -q
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/ducks.csv -q
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_2.zip -q 
!unzip -o answers_2.zip -d answers 
Archive:  answers_2.zip
  inflating: answers/answer_2.01.sql  
  inflating: answers/answer_2.02.sql  
  inflating: answers/answer_2.03.sql  
  inflating: answers/answer_2.04.sql  
  inflating: answers/answer_2.05.sql  
  inflating: answers/answer_2.06.sql  
  inflating: answers/answer_2.07.sql  
  inflating: answers/answer_2.08.sql  
  inflating: answers/answer_2.09.sql  
  inflating: answers/answer_2.10.sql  
  inflating: answers/answer_2.11.sql  
  inflating: answers/answer_2.12.sql  
  inflating: answers/answer_2.13.sql  
  inflating: answers/answer_2.14.sql  

To create the tables in your database, run:

%%dql
CREATE TABLE birds AS SELECT * FROM read_csv('birds.csv');
CREATE TABLE ducks AS SELECT * FROM read_csv('ducks.csv');
Count
0 139

To begin understanding the data contained in these tables, you can run:

%%dql
SUMMARIZE birds;
column_name column_type min max approx_unique avg std q25 q50 q75 count null_percentage
0 column00 BIGINT 0 90370 90212 45185.0 26088.004925635843 22592 45185 67778 90371 0.00
1 Avibase_ID VARCHAR AVIBASE-000E026B AVIBASE-FFF63987 11331 None None None None None 90371 0.00
2 Species_Common_Name VARCHAR Abbott's babbler ʻōʻū 7843 None None None None None 90371 19.24
3 Species1_BirdLife VARCHAR Abeillia abeillei Zosterornis whiteheadi 10928 None None None None None 90371 0.37
4 Species2_eBird VARCHAR Abeillia abeillei Zosterornis whiteheadi 10422 None None None None None 90371 0.40
5 eBird_species_group VARCHAR Abeillia abeillei Zosterornis whiteheadi 10996 None None None None None 90371 0.20
6 Species3_BirdTree VARCHAR Abeillia abeillei Zosterops xanthochroa 9808 None None None None None 90371 0.08
7 Data_type DOUBLE 1.0 2.0 2 1.1574872250610975 0.3642615447315967 1.0 1.0 1.0 90371 0.39
8 Source VARCHAR AIM ZRC 80 None None None None None 90371 0.40
9 Specimen_number VARCHAR "1925.12.24.246B" y3.5.12.1435 72640 None None None None None 90371 18.34
10 Sex VARCHAR F U 3 None None None None None 90371 0.39
11 Age DOUBLE 0.0 1.0 2 0.014096867362808265 0.11789105175801462 0.0 0.0 0.0 90371 0.39
12 Locality VARCHAR None None 0 None None None None None 90371 100.00
13 Country_WRI VARCHAR Afghanistan Zimbabwe 209 None None None None None 90371 14.52
14 Country VARCHAR None None 0 None None None None None 90371 100.00
15 Beak_Length_Culmen DOUBLE 3.8 455.0 1483 25.33171431651031 22.381254702846245 14.355561717044468 19.32543897495656 27.623699301844592 90371 17.87
16 Beak_Length_Nares DOUBLE 1.3 428.0 1248 16.286273278552965 18.57503539116434 8.469626187869864 11.449578893557915 17.14992962004607 90371 38.32
17 Beak_Width DOUBLE 0.7 94.2 493 6.3637648597917655 4.861109150585658 3.59853210080108 4.898176459753179 7.303471551218731 90371 22.18
18 Beak_Depth DOUBLE 0.7 132.8 681 7.431216081735518 6.896127808541489 3.697628589530138 5.417165073816974 8.454719279852593 90371 16.60
19 Tarsus_Length DOUBLE 1.9 490.0 1410 26.910134505756478 20.971245785879187 17.297410931321547 21.611251274328783 29.204797845191624 90371 10.90
20 Wing_Length DOUBLE 0.1 838.0 2249 111.78041893027141 82.12554804390615 63.8613742389585 83.7726839803396 124.9753250951792 90371 1.21
21 Kipps_Distance DOUBLE 0.1 450.0 1678 34.17952516104371 41.72670850086505 11.04372502503744 18.642134166584025 38.707493045910375 90371 30.60
22 Secondary1 DOUBLE 0.1 729.6 2849 84.01929174323386 53.30644704168306 52.8451414597508 67.80888888544403 95.15018564006331 90371 28.94
23 Hand_wing_Index DOUBLE 0.1 95.9 751 24.923443357848146 14.298359758953382 15.219163142564692 20.86252740290153 30.33337642738868 90371 30.64
24 Tail_Length DOUBLE 0.1 1415.0 2208 81.53134389053731 58.00735637295096 48.42788217905411 65.92166351759666 92.66079973786391 90371 7.48
25 Measurer VARCHAR AC YW 152 None None None None None 90371 0.00
26 Protocol DOUBLE 0.0 1.0 2 0.5704732281715175 0.4950113595980929 0.0 1.0 1.0 90371 0.39
27 Publication VARCHAR Cardona-Salazar et al. 2020 https://esajournal... Collar 2014 2 None None None None None 90371 96.34
%%dql
SUMMARIZE ducks;
column_name column_type min max approx_unique avg std q25 q50 q75 count null_percentage
0 name VARCHAR African black duck Yellow-billed teal 134 None None None None None 139 0.0
1 genus VARCHAR Aix galericulata Thalassornis leuconotus 139 None None None None None 139 0.0
2 author VARCHAR A. Wilson Wied-Neuwied 57 None None None None None 139 0.0
3 year VARCHAR 1758 H. Fleming 56 None None None None None 139 0.0
4 extinct BIGINT 0 1935 3 13.971223021582734 164.12040218163463 0 0 0 139 0.0

Exercise 2.01

Create a new table birds_measurements from the file birds.csv (this file contains the names and measurements of individuals from over 10k bird species).

# Uncomment and run to show solution
# !cat ./answers/answer_2.01.sql

Exercise 2.02

Create a new table ducks_species from the file ducks.csv (this file contains species names and common names of ducks).

# Uncomment and run to show solution
# !cat ./answers/answer_2.02.sql

1. Aggregate Functions#

The functions we saw previously when building calculated columns operated on each row of the table individually. In contrast, aggregate functions summarize many rows of the table. By default, they will summarize all rows (stay tuned though!). For example, let’s find the minimum and maximum Beak_Width of any bird in the dataset.

%%dql
SELECT 
    MIN(Beak_Width) AS Min_Beak_Width,
    MAX(Beak_Width) AS Max_Beak_Width
FROM birds;
Min_Beak_Width Max_Beak_Width
0 0.7 94.2

However, aggregating an entire table all the way up to just a single row isn’t always what we are looking for. Next, we will use the GROUP BY clause to apply aggregate functions to groups of rows instead of all rows.

2. Group Rows (GROUP BY Clause)#

To group the rows based on a specific column (or columns) and perform aggregate functions, you can use the GROUP BY clause. For example, if you want to group the birds by their species name and calculate the average Beak_Width, Beak_Depth and Beak_Length_Culmen for each group, you can run this query:

%%dql
SELECT
    Species_Common_Name,
    AVG(Beak_Width) AS Avg_Beak_Width,
    AVG(Beak_Depth) AS Avg_Beak_Depth,
    AVG(Beak_Length_Culmen) AS Avg_Beak_Length_Culmen
FROM birds
GROUP BY Species_Common_Name;
Species_Common_Name Avg_Beak_Width Avg_Beak_Depth Avg_Beak_Length_Culmen
0 Yellow-bellied warbler 3.120000 2.840000 12.320000
1 Rifleman 2.383333 2.216667 12.116667
2 Inland thornbill 2.170000 2.630000 11.010526
3 Yellow-rumped thornbill 2.138462 2.500000 11.700000
4 Slender-billed thornbill 1.961538 2.323077 9.153846
... ... ... ... ...
7988 Santa Cruz white-eye 3.250000 4.200000 14.950000
7989 Malaita white-eye 3.475000 3.900000 17.125000
7990 Slender-billed white-eye 2.980000 3.800000 17.300000
7991 Golden-bellied white-eye 3.333333 4.066667 14.233333
7992 Pemba white-eye 2.980000 3.040000 11.460000

7993 rows × 4 columns

This command groups the rows by the Species_Common_Name column and calculates the average Beak_Width, Beak_Depth and Beak_Length_Culmen for the individuals in each bird species group.

Multiple columns can be included within a GROUP BY clause, separated by commas. In this example, we measure the maximum wing_length by Country_WRI and Source. This example shows that these columns do not have to be hierarchically related - the GROUP BY will show all combinations of data in those columns.

%%dql 
SELECT 
    Country_WRI,
    Source,
    MAX(wing_length)
FROM birds 
GROUP BY
    Country_WRI,
    Source
Country_WRI Source max(wing_length)
0 India NHMUK 787.0
1 Brazil MPEG 602.0
2 Venezuela NHMUK 497.0
3 Australia ANWC 258.0
4 Costa Rica NHMUK 668.0
... ... ... ...
1059 Japan NHMD 129.8
1060 Portugal ZFMK 124.0
1061 Nigeria UMMZ 68.2
1062 Jamaica LSUMZ 77.0
1063 Ecuador FLMNH 64.0

1064 rows × 3 columns

Exercise 2.03

Run a query that gets the average Beak_Length_Culmen, Wing_Length and Tail_Length for all birds.

# Uncomment and run to show solution
# !cat ./answers/answer_2.03.sql

Exercise 2.04

Run a query that finds the average Tail_Length by Species_Common_Name and by Country_WRI.

# Uncomment and run to show solution
# !cat ./answers/answer_2.04.sql

Getting the 95th percentile of a column value#

We’ve used GROUP BY to group by a certain column, and used an aggregate function to combine other columns in our query, for instance, by taking the average. But, what if we want to get the 95th percentile of a column value? DuckDB has a built-in aggregate function for that too! For instance, to get the 95th percentile value of the Beak_Length_Culmen of all birds, run:

%%dql
SELECT 
    QUANTILE_CONT(Beak_Length_Culmen, 0.95)
FROM birds;
quantile_cont(Beak_Length_Culmen, 0.95)
0 59.3

Exercise 2.05

Run a query that gets the 95th percentile and 99th percentile of Beak_Length_Culmen for all birds.

# Uncomment and run to show solution
# !cat ./answers/answer_2.05.sql

Exercise 2.06

Run a query that gets the 99th percentile of Wing_Length by Species_Common_Name.

# Uncomment and run to show solution
# !cat ./answers/answer_2.06.sql

3. Understanding SQL Joins#

INNER JOIN (JOIN)#

In SQL, a Join operation allows you to combine rows from two or more tables based on a related column between them. This is incredibly useful when you need to pull together related information that is stored in different tables.

Let’s combine the birds and ducks tables to find the Beak_Length_Culmen of all birds that are ducks. To do this, we’ll use a SQL Join operation. Specifically, we’ll use an INNER JOIN, which combines rows from both tables only when there is a match in the Species_Common_Name column.

%%dql
SELECT
    birds.column00 as id,
    birds.Species_Common_Name,
    birds.Beak_Length_Culmen,
    ducks.author
FROM birds
INNER JOIN ducks ON birds.Species_Common_Name = ducks.name;
id Species_Common_Name Beak_Length_Culmen author
0 2656 Egyptian goose 53.0 Linnaeus
1 2657 Egyptian goose 45.2 Linnaeus
2 2658 Egyptian goose 45.2 Linnaeus
3 2659 Egyptian goose 48.0 Linnaeus
4 2660 Egyptian goose 62.7 Linnaeus
... ... ... ... ...
657 81213 White-backed duck 45.0 Eyton
658 81214 White-backed duck 43.6 Eyton
659 81215 White-backed duck 41.0 Eyton
660 81216 White-backed duck 40.4 Eyton
661 81217 White-backed duck 30.2 Eyton

662 rows × 4 columns

Step-by-Step Explanation#

Let’s break down the SQL query step by step:

SELECT birds.column00 as id, birds.Species_Common_Name, birds.Beak_Length_Culmen, ducks.author: We’re selecting the species id, name and beak length from the birds table, and the duck species author from the ducks table.

Up until now, we haven’t needed to specify which table a column is coming from since we have been working with just one table! Now we prefix column names with the name of the table they come from. As a note, this is not required if the column names in the two tables are completely different from one another, but it is a good best practice.

FROM birds: We’re starting with the birds table.

INNER JOIN ducks ON birds.Species_Common_Name = ducks.name: We’re joining the birds table to the ducks table where the species’ common name matches in both tables. We are using table prefixes again for clarity.

INNER JOIN Gotchas#

NOTE: When using an INNER JOIN, we only show output rows where there are matching values in both tables. This has dramatically reduced the number of output rows since now we are only looking at ducks!

NOTE: If a join between 2 tables results in multiple matches, all matches will be returned. This can mean that your result can actually return more rows after a join, in some cases! (Imagine that we had messy data in ducks.csv, and one species mistakenly had multiple authors. We would have 1 row in our result for each author.)

NOTE: INNER JOIN is the default kind of join in SQL. So if you see a query that just says ... table1 JOIN table2 ..., then it is using an INNER JOIN! It is common practice to omit INNER.

NOTE: It is possible to join on multiple columns. For example, imagine wanting to connect two tables by matching both a first name column and last name column. Inequality conditions are also possible (as we will see later!).

Exercise 2.07

Run a query that gets the name, Beak_Length_Culmen, Wing_Length and Tail_Length of birds that are ducks.

# Uncomment and run to show solution
# !cat ./answers/answer_2.07.sql

Exercise 2.08

Let’s run a similar query, but group the ducks by species. Run a query that gets the Species_Common_Name, average Beak_Length_Culmen, Wing_Length and Tail_Length of birds that are ducks, and sort the results by Species_Common_Name.

# Uncomment and run to show solution
# !cat ./answers/answer_2.08.sql

LEFT OUTER JOIN (LEFT JOIN)#

A LEFT OUTER JOIN will keep all rows from the LEFT table in the join (the table before the LEFT JOIN keywords), even if there is not a match in the table on the right. Any rows that do not have a match in the right table will have the value NULL for all columns from the right table. NULL is the missing value indicator in SQL.

This can be useful when adding optional details. For example, in our situtation, ducks will have an author, but all other birds will not.

%%dql
SELECT
    birds.column00 as id,
    birds.Species_Common_Name,
    birds.Beak_Length_Culmen,
    ducks.author
FROM birds
LEFT JOIN ducks ON birds.Species_Common_Name = ducks.name;
id Species_Common_Name Beak_Length_Culmen author
0 0 Emerald-chinned hummingbird 14.8 None
1 1 Emerald-chinned hummingbird 13.8 None
2 2 Emerald-chinned hummingbird 11.8 None
3 3 Emerald-chinned hummingbird 11.8 None
4 4 Emerald-chinned hummingbird 12.8 None
... ... ... ... ...
90366 81213 White-backed duck 45.0 Eyton
90367 81214 White-backed duck 43.6 Eyton
90368 81215 White-backed duck 41.0 Eyton
90369 81216 White-backed duck 40.4 Eyton
90370 81217 White-backed duck 30.2 Eyton

90371 rows × 4 columns

Notice how the LEFT JOIN query has 90371 rows in the result (the same number of rows as the birds table), and the INNER JOIN query only had 662 rows.

The author column contains the Python missing value indicator of None, which is equivalent to SQL’s NULL.

LEFT JOIN Gotchas#

NOTE: A LEFT JOIN usually, but not always, will result in the same number of rows as the left table. Cases where this is not true include:

  • Duplicates in the columns that are being joined in the right table

  • A WHERE clause that filters the result

Exercise 2.09

Modify the LEFT JOIN query above to filter to only rows that are NOT ducks.

Hint: In Python (like in SQL), nothing equals None! Just like in Python, we use the IS keyword to check if a value is missing.

# Uncomment and run to show solution
# !cat ./answers/answer_2.09.sql

3. Subqueries#

What is a Subquery?#

A subquery, also known as a nested query, is a query within another SQL query. It’s like a query inside a query! Subqueries are used to perform operations that require multiple steps, such as filtering data based on a complex condition or aggregating data before using it in the main query. In other words, instead of creating multiple new tables as intermediate steps, you can define these steps within the scope of a larger query.

Types of Subquery#

A Subquery can return a single value (one row and one column), an entire column of values, or a table of values. These each can be used in the location within a query where a static value, column, or table would otherwise be.

Using Subqueries in DuckDB#

Let’s start by looking at our previously example query to understand how subqueries work in DuckDB.

Finding the top Beak_Length_Culmen#

Suppose we want to find the individual ducks with the largest Beak_Length_Culmen. We can use a subquery to calculate the 99th percentile of Beak_Length_Culmen first, and then use that result in our main query:

%%dql
SELECT
    birds.column00 as id,
    birds.Species_Common_Name,
    birds.Beak_Length_Culmen
FROM birds
INNER JOIN ducks ON birds.Species_Common_Name = ducks.name
WHERE birds.Beak_Length_Culmen > (
    SELECT QUANTILE_CONT(birds.Beak_Length_Culmen, 0.99)
    FROM birds 
    INNER JOIN ducks ON birds.Species_Common_Name = ducks.name
)
ORDER BY birds.Beak_Length_Culmen DESC;
id Species_Common_Name Beak_Length_Culmen
0 48763 Common merganser 79.0
1 48764 Common merganser 78.7
2 54085 Knob-billed duck 74.5
3 10784 Muscovy duck 74.0
4 46501 Pink-eared duck 73.4
5 75575 Common eider 73.1
6 46502 Pink-eared duck 72.2

In this example, the subquery (SELECT QUANTILE_CONT(birds.Beak_Length_Culmen, 0.99) FROM birds INNER JOIN ducks ON birds.Species_Common_Name = ducks.name) calculates the 99th percentile of beak length for all birds that are ducks. The main query then selects the names and beak measurements of individual ducks who have a beak length above this value.

Exercise 2.10

Find the duck species that have a Wing_Length larger than the 99th percentile of all ducks.

# Uncomment and run to show solution
# !cat ./answers/answer_2.10.sql

Exercise 2.11

Can you find any duck species that have both a Wing_Length and Beak_Length_Culmen larger than the 95th percentile of all duck species?

# Uncomment and run to show solution
# !cat ./answers/answer_2.11.sql

Exercise 2.12

NOTE: This is extra credit!

Instead of individual ducks, find the duck species that on average have a measured beak size that is larger than the 95th percentile of all ducks.

# Uncomment and run to show solution
# !cat ./answers/answer_2.12.sql

Using the WITH Clause#

The WITH clause is an alternative to a subquery that has 2 key advantages: it can increase readability, and it allows for reusability. The technical term for a WITH clause is a Common Table Expression (abbreviated CTE), which describes how it can be reusable.

Now, let’s see how we can use the WITH clause to make our queries more readable. Suppose we want to find the individual ducks that have a beak length above the 99th percentile of duck beaks. Here’s how we can do it using the WITH clause:

%%dql
WITH
    duck_beaks AS (
        SELECT
            column00 as id,
            Species_Common_Name,
            Beak_Length_Culmen
        FROM birds
        INNER JOIN ducks ON name = Species_Common_Name
    ),

    pc99_beak_len AS (
        SELECT QUANTILE_CONT(Beak_Length_Culmen, 0.99) AS Top_Beak_Length 
        FROM duck_beaks
    )

SELECT
    duck_beaks.id,
    duck_beaks.Species_Common_Name,
    duck_beaks.Beak_Length_Culmen
FROM duck_beaks
INNER JOIN pc99_beak_len ON duck_beaks.Beak_Length_Culmen > pc99_beak_len.Top_Beak_Length
ORDER BY duck_beaks.Beak_Length_Culmen DESC;
id Species_Common_Name Beak_Length_Culmen
0 48763 Common merganser 79.0
1 48764 Common merganser 78.7
2 54085 Knob-billed duck 74.5
3 10784 Muscovy duck 74.0
4 46501 Pink-eared duck 73.4
5 75575 Common eider 73.1
6 46502 Pink-eared duck 72.2

In this example, the WITH clause creates two temporary result sets called duck_beaks and pc99_beak_len. The main query then selects the names and beak measurements of ducks with Beak_Length_Culmen above the top 99th percentile beak length.

Exercise 2.13

Find the duck species that have an average Wing_Length larger than the 95th percentile of all duck species.

# Uncomment and run to show solution
# !cat ./answers/answer_2.13.sql

Exercise 2.14

What about the duck species that have both a Wing_Length or Beak_Length_Culmen larger than the 95sup>th percentile of all duck species?

# Uncomment and run to show solution
# !cat ./answers/answer_2.14.sql