1. Learn to quack SQL with DuckDB: The Basics#
Today, we’ll cover some basic operations in DuckDB SQL. DuckDB is a high-performance analytical database system designed for easy integration with data science workflows. It is particularly well-suited for in-memory processing and can handle complex queries efficiently. Learning SQL is a great skill because it allows you to manage and analyze large datasets quickly and effectively.
Let’s dive in and learn how to load data, grab a whole table, pick specific columns, add a calculated column, and filter rows using the WHERE
clause.
To start off, install the latest version of duckdb
and magic-duckdb
to run this notebook.
!pip install --upgrade duckdb magic-duckdb --quiet
%load_ext magic_duckdb
Note
If you are in Google Collab or a Jupyter notebook, remember to put the magic %%dql
directive in the beginning of each SQL cell.
Basic operations#
Let’s get started with inspecting some data! We’ll use the washington_weather.csv
dataset.
To download the dataset directly from GitHub, run:
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/washington_weather.csv -q
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_1.zip -q
!unzip -o answers_1.zip -d answers
Archive: answers_1.zip
inflating: answers/answer_1.01.sql
inflating: answers/answer_1.02.sql
inflating: answers/answer_1.03.sql
inflating: answers/answer_1.04.sql
inflating: answers/answer_1.05.sql
inflating: answers/answer_1.06.sql
inflating: answers/answer_1.07.sql
inflating: answers/answer_1.08.sql
Create a new table from a CSV file#
DuckDB makes it very easy to load data from a CSV file. To create a new table from a file, run:
%%dql
CREATE TABLE weather AS SELECT * FROM read_csv('washington_weather.csv');
Count | |
---|---|
0 | 2640 |
In general, it’s easy to create a new table! The syntax CREATE TABLE <name> AS ...
lets you create a new table using any query. If you want to overwrite an existing table, you can use the CREATE OR REPLACE TABLE <name> AS ...
syntax. For more information about the CREATE TABLE
syntax, see the docs.
Exercise 1.01
Recreate the table called weather
by selecting all columns in the washington_weather.csv
file.
# Uncomment and run to show solution
# !cat ./answers/answer_1.01.sql
Describe the table#
You can now describe the table to learn its structure:
%%dql
DESCRIBE weather;
column_name | column_type | null | key | default | extra | |
---|---|---|---|---|---|---|
0 | station | VARCHAR | YES | None | None | None |
1 | name | VARCHAR | YES | None | None | None |
2 | latitude | DOUBLE | YES | None | None | None |
3 | longitude | DOUBLE | YES | None | None | None |
4 | elevation | DOUBLE | YES | None | None | None |
5 | date | DATE | YES | None | None | None |
6 | precipitation | DOUBLE | YES | None | None | None |
7 | temperature_max | BIGINT | YES | None | None | None |
8 | temperature_min | BIGINT | YES | None | None | None |
9 | temperature_obs | BIGINT | YES | None | None | None |
This returns a table that shows you details about the columns, such as the column name and its type.
Grab the Whole Table#
To see all the data in the weather
table, you can use the following SQL query:
%%dql
SELECT * FROM weather;
station | name | latitude | longitude | elevation | date | precipitation | temperature_max | temperature_min | temperature_obs | |
---|---|---|---|---|---|---|---|---|---|---|
0 | USC00458278 | TACOMA NUMBER 1, WA US | 47.24720 | -122.41220 | 7.6 | 2023-01-01 | 0.04 | 47.0 | 38.0 | 38 |
1 | USC00458278 | TACOMA NUMBER 1, WA US | 47.24720 | -122.41220 | 7.6 | 2023-01-02 | 0.00 | 47.0 | 32.0 | 36 |
2 | USC00458278 | TACOMA NUMBER 1, WA US | 47.24720 | -122.41220 | 7.6 | 2023-01-03 | 0.05 | 46.0 | 36.0 | 38 |
3 | USC00458278 | TACOMA NUMBER 1, WA US | 47.24720 | -122.41220 | 7.6 | 2023-01-04 | 0.10 | 50.0 | 34.0 | 45 |
4 | USC00458278 | TACOMA NUMBER 1, WA US | 47.24720 | -122.41220 | 7.6 | 2023-01-05 | 0.05 | 61.0 | 45.0 | 55 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2635 | USC00456295 | PALMER 3 ESE, WA US | 47.30889 | -121.85556 | 268.8 | 2023-12-27 | 0.03 | 48.0 | 42.0 | 46 |
2636 | USC00456295 | PALMER 3 ESE, WA US | 47.30889 | -121.85556 | 268.8 | 2023-12-28 | 0.01 | 51.0 | 41.0 | 50 |
2637 | USC00456295 | PALMER 3 ESE, WA US | 47.30889 | -121.85556 | 268.8 | 2023-12-29 | 0.00 | 55.0 | 41.0 | 50 |
2638 | USC00456295 | PALMER 3 ESE, WA US | 47.30889 | -121.85556 | 268.8 | 2023-12-30 | 0.86 | 51.0 | 41.0 | 42 |
2639 | USC00456295 | PALMER 3 ESE, WA US | 47.30889 | -121.85556 | 268.8 | 2023-12-31 | 0.10 | 44.0 | 35.0 | 35 |
2640 rows × 10 columns
This query selects all columns and rows from the weather
table.
Filter Rows (WHERE Clause)#
To filter rows based on certain conditions, you can use the WHERE
clause. For example, if you only want to see the dates where a temperature higher than 82 was observed, you can run this query:
%%dql
SELECT * FROM weather WHERE temperature_obs > 82;
station | name | latitude | longitude | elevation | date | precipitation | temperature_max | temperature_min | temperature_obs | |
---|---|---|---|---|---|---|---|---|---|---|
0 | USC00458278 | TACOMA NUMBER 1, WA US | 47.2472 | -122.41220 | 7.6 | 2023-08-14 | 0.0 | 87 | 56 | 84 |
1 | USC00458278 | TACOMA NUMBER 1, WA US | 47.2472 | -122.41220 | 7.6 | 2023-08-15 | 0.0 | 90 | 65 | 84 |
2 | USC00454486 | LANDSBURG, WA US | 47.3766 | -121.96130 | 163.1 | 2023-05-13 | 0.0 | 91 | 50 | 83 |
3 | USC00454486 | LANDSBURG, WA US | 47.3766 | -121.96130 | 163.1 | 2023-05-14 | 0.0 | 90 | 46 | 85 |
4 | USC00455525 | MONROE, WA US | 47.8450 | -121.99556 | 38.1 | 2023-05-14 | 0.0 | 90 | 48 | 85 |
5 | USC00455525 | MONROE, WA US | 47.8450 | -121.99556 | 38.1 | 2023-08-14 | 0.0 | 91 | 60 | 84 |
6 | USC00455525 | MONROE, WA US | 47.8450 | -121.99556 | 38.1 | 2023-08-15 | 0.0 | 93 | 52 | 85 |
7 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-08-14 | 0.0 | 88 | 67 | 85 |
This command selects all columns from the weather table, but only includes rows where the observed temperature is greater than 82°F.
To combine filters for two or more different columns, you can use AND
or OR
:
%%dql
SELECT * FROM weather WHERE precipitation > 2.5 OR elevation > 600;
station | name | latitude | longitude | elevation | date | precipitation | temperature_max | temperature_min | temperature_obs | |
---|---|---|---|---|---|---|---|---|---|---|
0 | USC00451233 | CEDAR LAKE, WA US | 47.4144 | -121.75610 | 480.4 | 2023-12-05 | 2.98 | 52 | 43 | 51 |
1 | USC00450872 | BREMERTON, WA US | 47.5688 | -122.68270 | 36.6 | 2023-12-05 | 5.00 | 59 | 44 | 53 |
2 | USC00455525 | MONROE, WA US | 47.8450 | -121.99556 | 38.1 | 2023-12-05 | 2.68 | 59 | 49 | 50 |
3 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-01-01 | 0.24 | 41 | 33 | 34 |
4 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-01-02 | 0.00 | 39 | 33 | 37 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
361 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-12-27 | 0.03 | 45 | 38 | 44 |
362 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-12-28 | 0.36 | 44 | 42 | 44 |
363 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-12-29 | 0.02 | 54 | 44 | 54 |
364 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-12-30 | 0.00 | 54 | 44 | 45 |
365 | USC00458508 | TOLT SOUTH FORK RESERVOIR, WA US | 47.7000 | -121.69080 | 609.6 | 2023-12-31 | 0.62 | 47 | 40 | 41 |
366 rows × 10 columns
Note
In DuckDB, strings are indicated with single quotes, like so: 'my string value'
, and column names with double quotes, like so: "my column name"
. You’ll only need to use double quotes for your column names if they contain spaces or special characters.
Exercise 1.02
Filter rows where the station name is 'TACOMA NUMBER 1, WA US'
.
# Uncomment and run to show solution
# !cat ./answers/answer_1.02.sql
Pick the Columns that You Want#
Sometimes, you may only want to see specific columns. For example, if you only want to see the temperature_max
and temperature_min
columns, you can run this query:
%%dql
SELECT name, date, temperature_min, temperature_max FROM weather;
name | date | temperature_min | temperature_max | |
---|---|---|---|---|
0 | TACOMA NUMBER 1, WA US | 2023-01-01 | 38.0 | 47.0 |
1 | TACOMA NUMBER 1, WA US | 2023-01-02 | 32.0 | 47.0 |
2 | TACOMA NUMBER 1, WA US | 2023-01-03 | 36.0 | 46.0 |
3 | TACOMA NUMBER 1, WA US | 2023-01-04 | 34.0 | 50.0 |
4 | TACOMA NUMBER 1, WA US | 2023-01-05 | 45.0 | 61.0 |
... | ... | ... | ... | ... |
2635 | PALMER 3 ESE, WA US | 2023-12-27 | 42.0 | 48.0 |
2636 | PALMER 3 ESE, WA US | 2023-12-28 | 41.0 | 51.0 |
2637 | PALMER 3 ESE, WA US | 2023-12-29 | 41.0 | 55.0 |
2638 | PALMER 3 ESE, WA US | 2023-12-30 | 41.0 | 51.0 |
2639 | PALMER 3 ESE, WA US | 2023-12-31 | 35.0 | 44.0 |
2640 rows × 4 columns
Exercise 1.03
Run a DESCRIBE
query on the weather
table to inspect the column names, and try selecting a few different ones! For example, select the name
, date
, elevation
, precipitation
, and/or temperature_obs
columns.
# Uncomment and run to show solution
# !cat ./answers/answer_1.03.sql
Exercise 1.04
Select the temperature_max
and temperature_min
columns, and filter down to only see the rows where both of those values are under 60 and above 50.
# Uncomment and run to show solution
# !cat ./answers/answer_1.04.sql
Add a calculated Column#
You can also add a calculated column to your results. For example, if you want to calculate the average of two columns, temperature_max
and temperature_min
, you can do this:
%%dql
SELECT name, date, (temperature_max + temperature_min) / 2 AS mean_temperature
FROM weather;
name | date | mean_temperature | |
---|---|---|---|
0 | TACOMA NUMBER 1, WA US | 2023-01-01 | 42.5 |
1 | TACOMA NUMBER 1, WA US | 2023-01-02 | 39.5 |
2 | TACOMA NUMBER 1, WA US | 2023-01-03 | 41.0 |
3 | TACOMA NUMBER 1, WA US | 2023-01-04 | 42.0 |
4 | TACOMA NUMBER 1, WA US | 2023-01-05 | 53.0 |
... | ... | ... | ... |
2635 | PALMER 3 ESE, WA US | 2023-12-27 | 45.0 |
2636 | PALMER 3 ESE, WA US | 2023-12-28 | 46.0 |
2637 | PALMER 3 ESE, WA US | 2023-12-29 | 48.0 |
2638 | PALMER 3 ESE, WA US | 2023-12-30 | 46.0 |
2639 | PALMER 3 ESE, WA US | 2023-12-31 | 39.5 |
2640 rows × 3 columns
This command creates a new column called mean_temperature
that contains the average of temperature_min
and temperature_max
.
Exercise 1.05
Add a new calculated column called temperature_range
that gets the difference between temperature_max
and temperature_min
columns.
# Uncomment and run to show solution
# !cat ./answers/answer_1.05.sql
Exercise 1.06
Create a new calculated column, temperature_obs_celcius
, that converts the observed temperature to °C using the equation: (32°F − 32) × 5/9 = 0°C
.
# Uncomment and run to show solution
# !cat ./answers/answer_1.06.sql
Order Rows (ORDER BY Clause)#
To sort the rows based on a specific column, you can use the ORDER BY clause. For example, if you want to order the observations by the amount of precipitation with the rainiest days on top, you can run this query:
%%dql
SELECT name, date, precipitation, (temperature_max + temperature_min) / 2 AS mean_temperature
FROM weather
ORDER BY precipitation DESC;
name | date | precipitation | mean_temperature | |
---|---|---|---|---|
0 | BREMERTON, WA US | 2023-12-05 | 5.00 | 51.5 |
1 | PALMER 3 ESE, WA US | 2023-12-05 | 3.10 | 48.0 |
2 | CEDAR LAKE, WA US | 2023-12-05 | 2.98 | 47.5 |
3 | TOLT SOUTH FORK RESERVOIR, WA US | 2023-12-06 | 2.96 | 49.5 |
4 | MONROE, WA US | 2023-12-05 | 2.68 | 54.0 |
... | ... | ... | ... | ... |
2635 | PALMER 3 ESE, WA US | 2023-12-20 | 0.00 | 42.0 |
2636 | PALMER 3 ESE, WA US | 2023-12-21 | 0.00 | 45.5 |
2637 | PALMER 3 ESE, WA US | 2023-12-23 | 0.00 | 33.0 |
2638 | PALMER 3 ESE, WA US | 2023-12-24 | 0.00 | 38.0 |
2639 | PALMER 3 ESE, WA US | 2023-12-29 | 0.00 | 48.0 |
2640 rows × 4 columns
This command sorts the rows by the precipitation
column in descending order.
Exercise 1.07
Use the query you created in the previous exercise and order the rows by precipitation
in ascending order.
# Uncomment and run to show solution
# !cat ./answers/answer_1.07.sql
Exercise 1.08
Get the station name
, date
, temperature_obs
and precipitation
, and sort the table such that the row with the lowest temperature observed is at the top of the result table.
# Uncomment and run to show solution
# !cat ./answers/answer_1.08.sql