1. Learn to quack SQL with DuckDB: The Basics

1. Learn to quack SQL with DuckDB: The Basics#

Today, we’ll cover some basic operations in DuckDB SQL. DuckDB is a high-performance analytical database system designed for easy integration with data science workflows. It is particularly well-suited for in-memory processing and can handle complex queries efficiently. Learning SQL is a great skill because it allows you to manage and analyze large datasets quickly and effectively.

Let’s dive in and learn how to load data, grab a whole table, pick specific columns, add a calculated column, and filter rows using the WHERE clause.

To start off, install the latest version of duckdb and magic-duckdb to run this notebook.

!pip install --upgrade duckdb magic-duckdb --quiet
%load_ext magic_duckdb

Note

If you are in Google Collab or a Jupyter notebook, remember to put the magic %%dql directive in the beginning of each SQL cell.

Basic operations#

Let’s get started with inspecting some data! We’ll use the washington_weather.csv dataset.

To download the dataset directly from GitHub, run:

!wget https://raw.githubusercontent.com/motherduckdb/sql-tutorial/main/data/washington_weather.csv -q
!wget https://raw.githubusercontent.com/motherduckdb/sql-tutorial/main/answers/answers_1.zip -q 
!unzip -o answers_1.zip -d answers 

Archive:  answers_1.zip
  inflating: answers/answer_1.01.sql  
  inflating: answers/answer_1.02.sql  
  inflating: answers/answer_1.03.sql  
  inflating: answers/answer_1.04.sql  
  inflating: answers/answer_1.05.sql  
  inflating: answers/answer_1.06.sql  
  inflating: answers/answer_1.07.sql  
  inflating: answers/answer_1.08.sql  

Create a new table from a CSV file#

DuckDB makes it very easy to load data from a CSV file. To create a new table from a file, run:

%%dql
CREATE TABLE weather AS SELECT * FROM read_csv('washington_weather.csv');

	Count
0	2640

In general, it’s easy to create a new table! The syntax CREATE TABLE <name> AS ... lets you create a new table using any query. If you want to overwrite an existing table, you can use the CREATE OR REPLACE TABLE <name> AS ... syntax. For more information about the CREATE TABLE syntax, see the docs.

Exercise 1.01

Recreate the table called weather by selecting all columns in the washington_weather.csv file.

# Uncomment and run to show solution
# !cat ./answers/answer_1.01.sql

Describe the table#

You can now describe the table to learn its structure:

%%dql
DESCRIBE weather;

	column_name	column_type	null	key	default	extra
0	station	VARCHAR	YES	None	None	None
1	name	VARCHAR	YES	None	None	None
2	latitude	DOUBLE	YES	None	None	None
3	longitude	DOUBLE	YES	None	None	None
4	elevation	DOUBLE	YES	None	None	None
5	date	DATE	YES	None	None	None
6	precipitation	DOUBLE	YES	None	None	None
7	temperature_max	BIGINT	YES	None	None	None
8	temperature_min	BIGINT	YES	None	None	None
9	temperature_obs	BIGINT	YES	None	None	None

This returns a table that shows you details about the columns, such as the column name and its type.

Grab the Whole Table#

To see all the data in the weather table, you can use the following SQL query:

%%dql
SELECT * FROM weather;

	station	name	latitude	longitude	elevation	date	precipitation	temperature_max	temperature_min	temperature_obs
0	USC00458278	TACOMA NUMBER 1, WA US	47.24720	-122.41220	7.6	2023-01-01	0.04	47	38	38
1	USC00458278	TACOMA NUMBER 1, WA US	47.24720	-122.41220	7.6	2023-01-02	0.00	47	32	36
2	USC00458278	TACOMA NUMBER 1, WA US	47.24720	-122.41220	7.6	2023-01-03	0.05	46	36	38
3	USC00458278	TACOMA NUMBER 1, WA US	47.24720	-122.41220	7.6	2023-01-04	0.10	50	34	45
4	USC00458278	TACOMA NUMBER 1, WA US	47.24720	-122.41220	7.6	2023-01-05	0.05	61	45	55
...	...	...	...	...	...	...	...	...	...	...
2635	USC00456295	PALMER 3 ESE, WA US	47.30889	-121.85556	268.8	2023-12-27	0.03	48	42	46
2636	USC00456295	PALMER 3 ESE, WA US	47.30889	-121.85556	268.8	2023-12-28	0.01	51	41	50
2637	USC00456295	PALMER 3 ESE, WA US	47.30889	-121.85556	268.8	2023-12-29	0.00	55	41	50
2638	USC00456295	PALMER 3 ESE, WA US	47.30889	-121.85556	268.8	2023-12-30	0.86	51	41	42
2639	USC00456295	PALMER 3 ESE, WA US	47.30889	-121.85556	268.8	2023-12-31	0.10	44	35	35

2640 rows × 10 columns

This query selects all columns and rows from the weather table.

Filter Rows (WHERE Clause)#

To filter rows based on certain conditions, you can use the WHERE clause. For example, if you only want to see the dates where a temperature higher than 82 was observed, you can run this query:

%%dql
SELECT * FROM weather WHERE temperature_obs > 82;

	station	name	latitude	longitude	elevation	date	temperature_max	temperature_min	temperature_obs
0	USC00458278	TACOMA NUMBER 1, WA US	47.2472	-122.41220	7.6	2023-08-14	87	56	84
1	USC00458278	TACOMA NUMBER 1, WA US	47.2472	-122.41220	7.6	2023-08-15	90	65	84
2	USC00454486	LANDSBURG, WA US	47.3766	-121.96130	163.1	2023-05-13	91	50	83
3	USC00454486	LANDSBURG, WA US	47.3766	-121.96130	163.1	2023-05-14	90	46	85
4	USC00455525	MONROE, WA US	47.8450	-121.99556	38.1	2023-05-14	90	48	85
5	USC00455525	MONROE, WA US	47.8450	-121.99556	38.1	2023-08-14	91	60	84
6	USC00455525	MONROE, WA US	47.8450	-121.99556	38.1	2023-08-15	93	52	85
7	USC00458508	TOLT SOUTH FORK RESERVOIR, WA US	47.7000	-121.69080	609.6	2023-08-14	88	67	85

This command selects all columns from the weather table, but only includes rows where the observed temperature is greater than 82°F.

To combine filters for two or more different columns, you can use AND or OR:

%%dql
SELECT * FROM weather WHERE precipitation > 2.5 OR elevation > 600;

	station	name	latitude	longitude	elevation	date	precipitation	temperature_max	temperature_min	temperature_obs
0	USC00451233	CEDAR LAKE, WA US	47.41440	-121.75610	480.4	2023-12-05	2.98	52	43	51
1	USC00450872	BREMERTON, WA US	47.56880	-122.68270	36.6	2023-12-05	5.00	59	44	53
2	USC00455525	MONROE, WA US	47.84500	-121.99556	38.1	2023-12-05	2.68	59	49	50
3	USC00458508	TOLT SOUTH FORK RESERVOIR, WA US	47.70000	-121.69080	609.6	2023-01-01	0.24	41	33	34
4	USC00458508	TOLT SOUTH FORK RESERVOIR, WA US	47.70000	-121.69080	609.6	2023-01-02	0.00	39	33	37
...	...	...	...	...	...	...	...	...	...	...
361	USC00458508	TOLT SOUTH FORK RESERVOIR, WA US	47.70000	-121.69080	609.6	2023-12-28	0.36	44	42	44
362	USC00458508	TOLT SOUTH FORK RESERVOIR, WA US	47.70000	-121.69080	609.6	2023-12-29	0.02	54	44	54
363	USC00458508	TOLT SOUTH FORK RESERVOIR, WA US	47.70000	-121.69080	609.6	2023-12-30	0.00	54	44	45
364	USC00458508	TOLT SOUTH FORK RESERVOIR, WA US	47.70000	-121.69080	609.6	2023-12-31	0.62	47	40	41
365	USC00456295	PALMER 3 ESE, WA US	47.30889	-121.85556	268.8	2023-12-05	3.10	50	46	46

366 rows × 10 columns

Note

In DuckDB, strings are indicated with single quotes, like so: 'my string value', and column names with double quotes, like so: "my column name". You’ll only need to use double quotes for your column names if they contain spaces or special characters.

Exercise 1.02

Filter rows where the station name is 'TACOMA NUMBER 1, WA US'.

# Uncomment and run to show solution
# !cat ./answers/answer_1.02.sql

Pick the Columns that You Want#

Sometimes, you may only want to see specific columns. For example, if you only want to see the temperature_max and temperature_min columns, you can run this query:

%%dql
SELECT name, date, temperature_min, temperature_max FROM weather;

	name	date	temperature_min	temperature_max
0	TACOMA NUMBER 1, WA US	2023-01-01	38	47
1	TACOMA NUMBER 1, WA US	2023-01-02	32	47
2	TACOMA NUMBER 1, WA US	2023-01-03	36	46
3	TACOMA NUMBER 1, WA US	2023-01-04	34	50
4	TACOMA NUMBER 1, WA US	2023-01-05	45	61
...	...	...	...	...
2635	PALMER 3 ESE, WA US	2023-12-27	42	48
2636	PALMER 3 ESE, WA US	2023-12-28	41	51
2637	PALMER 3 ESE, WA US	2023-12-29	41	55
2638	PALMER 3 ESE, WA US	2023-12-30	41	51
2639	PALMER 3 ESE, WA US	2023-12-31	35	44

2640 rows × 4 columns

Exercise 1.03

Run a DESCRIBE query on the weather table to inspect the column names, and try selecting a few different ones! For example, select the name, date, elevation, precipitation, and/or temperature_obs columns.

# Uncomment and run to show solution
# !cat ./answers/answer_1.03.sql

Exercise 1.04

Select the temperature_max and temperature_min columns, and filter down to only see the rows where both of those values are under 60 and above 50.

# Uncomment and run to show solution
# !cat ./answers/answer_1.04.sql

Add a calculated Column#

You can also add a calculated column to your results. For example, if you want to calculate the average of two columns, temperature_max and temperature_min, you can do this:

%%dql
SELECT name, date, (temperature_max + temperature_min) / 2 AS mean_temperature
FROM weather;

	name	date	mean_temperature
0	TACOMA NUMBER 1, WA US	2023-01-01	42.5
1	TACOMA NUMBER 1, WA US	2023-01-02	39.5
2	TACOMA NUMBER 1, WA US	2023-01-03	41.0
3	TACOMA NUMBER 1, WA US	2023-01-04	42.0
4	TACOMA NUMBER 1, WA US	2023-01-05	53.0
...	...	...	...
2635	PALMER 3 ESE, WA US	2023-12-27	45.0
2636	PALMER 3 ESE, WA US	2023-12-28	46.0
2637	PALMER 3 ESE, WA US	2023-12-29	48.0
2638	PALMER 3 ESE, WA US	2023-12-30	46.0
2639	PALMER 3 ESE, WA US	2023-12-31	39.5

2640 rows × 3 columns

This command creates a new column called mean_temperature that contains the average of temperature_min and temperature_max.

Exercise 1.05

Add a new calculated column called temperature_range that gets the difference between temperature_max and temperature_min columns.

# Uncomment and run to show solution
# !cat ./answers/answer_1.05.sql

Exercise 1.06

Create a new calculated column, temperature_obs_celcius, that converts the observed temperature to °C using the equation: (32°F − 32) × 5/9 = 0°C.

# Uncomment and run to show solution
# !cat ./answers/answer_1.06.sql

Order Rows (ORDER BY Clause)#

To sort the rows based on a specific column, you can use the ORDER BY clause. For example, if you want to order the observations by the amount of precipitation with the rainiest days on top, you can run this query:

%%dql
SELECT name, date, precipitation, (temperature_max + temperature_min) / 2 AS mean_temperature
FROM weather
ORDER BY precipitation DESC;

	name	date	precipitation	mean_temperature
0	BREMERTON, WA US	2023-12-05	5.00	51.5
1	PALMER 3 ESE, WA US	2023-12-05	3.10	48.0
2	CEDAR LAKE, WA US	2023-12-05	2.98	47.5
3	TOLT SOUTH FORK RESERVOIR, WA US	2023-12-06	2.96	49.5
4	MONROE, WA US	2023-12-05	2.68	54.0
...	...	...	...	...
2635	PALMER 3 ESE, WA US	2023-12-20	0.00	42.0
2636	PALMER 3 ESE, WA US	2023-12-21	0.00	45.5
2637	PALMER 3 ESE, WA US	2023-12-23	0.00	33.0
2638	PALMER 3 ESE, WA US	2023-12-24	0.00	38.0
2639	PALMER 3 ESE, WA US	2023-12-29	0.00	48.0

2640 rows × 4 columns

This command sorts the rows by the precipitation column in descending order.

Exercise 1.07

Use the query you created in the previous exercise and order the rows by precipitation in ascending order.

# Uncomment and run to show solution
# !cat ./answers/answer_1.07.sql

Exercise 1.08

Get the station name, date, temperature_obs and precipitation, and sort the table such that the row with the lowest temperature observed is at the top of the result table.

# Uncomment and run to show solution
# !cat ./answers/answer_1.08.sql