Open In Colab

1. Learn to quack SQL with DuckDB: The Basics#

Today, we’ll cover some basic operations in DuckDB SQL. DuckDB is a high-performance analytical database system designed for easy integration with data science workflows. It is particularly well-suited for in-memory processing and can handle complex queries efficiently. Learning SQL is a great skill because it allows you to manage and analyze large datasets quickly and effectively.

Let’s dive in and learn how to load data, grab a whole table, pick specific columns, add a calculated column, and filter rows using the WHERE clause.

To start off, install the latest version of duckdb and magic-duckdb to run this notebook.

!pip install --upgrade duckdb magic-duckdb --quiet
%load_ext magic_duckdb

Note

If you are in Google Collab or a Jupyter notebook, remember to put the magic %%dql directive in the beginning of each SQL cell.

Basic operations#

Let’s get started with inspecting some data! We’ll use the washington_weather.csv dataset.

To download the dataset directly from GitHub, run:

!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/washington_weather.csv -q
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_1.zip -q 
!unzip -o answers_1.zip -d answers 
Archive:  answers_1.zip
  inflating: answers/answer_1.01.sql  
  inflating: answers/answer_1.02.sql  
  inflating: answers/answer_1.03.sql  
  inflating: answers/answer_1.04.sql  
  inflating: answers/answer_1.05.sql  
  inflating: answers/answer_1.06.sql  
  inflating: answers/answer_1.07.sql  
  inflating: answers/answer_1.08.sql  

Create a new table from a CSV file#

DuckDB makes it very easy to load data from a CSV file. To create a new table from a file, run:

%%dql
CREATE TABLE weather AS SELECT * FROM read_csv('washington_weather.csv');
Count
0 2640

In general, it’s easy to create a new table! The syntax CREATE TABLE <name> AS ... lets you create a new table using any query. If you want to overwrite an existing table, you can use the CREATE OR REPLACE TABLE <name> AS ... syntax. For more information about the CREATE TABLE syntax, see the docs.

Exercise 1.01

Recreate the table called weather by selecting all columns in the washington_weather.csv file.

# Uncomment and run to show solution
# !cat ./answers/answer_1.01.sql

Describe the table#

You can now describe the table to learn its structure:

%%dql
DESCRIBE weather;
column_name column_type null key default extra
0 station VARCHAR YES None None None
1 name VARCHAR YES None None None
2 latitude DOUBLE YES None None None
3 longitude DOUBLE YES None None None
4 elevation DOUBLE YES None None None
5 date DATE YES None None None
6 precipitation DOUBLE YES None None None
7 temperature_max BIGINT YES None None None
8 temperature_min BIGINT YES None None None
9 temperature_obs BIGINT YES None None None

This returns a table that shows you details about the columns, such as the column name and its type.

Grab the Whole Table#

To see all the data in the weather table, you can use the following SQL query:

%%dql
SELECT * FROM weather;
station name latitude longitude elevation date precipitation temperature_max temperature_min temperature_obs
0 USC00458278 TACOMA NUMBER 1, WA US 47.24720 -122.41220 7.6 2023-01-01 0.04 47.0 38.0 38
1 USC00458278 TACOMA NUMBER 1, WA US 47.24720 -122.41220 7.6 2023-01-02 0.00 47.0 32.0 36
2 USC00458278 TACOMA NUMBER 1, WA US 47.24720 -122.41220 7.6 2023-01-03 0.05 46.0 36.0 38
3 USC00458278 TACOMA NUMBER 1, WA US 47.24720 -122.41220 7.6 2023-01-04 0.10 50.0 34.0 45
4 USC00458278 TACOMA NUMBER 1, WA US 47.24720 -122.41220 7.6 2023-01-05 0.05 61.0 45.0 55
... ... ... ... ... ... ... ... ... ... ...
2635 USC00456295 PALMER 3 ESE, WA US 47.30889 -121.85556 268.8 2023-12-27 0.03 48.0 42.0 46
2636 USC00456295 PALMER 3 ESE, WA US 47.30889 -121.85556 268.8 2023-12-28 0.01 51.0 41.0 50
2637 USC00456295 PALMER 3 ESE, WA US 47.30889 -121.85556 268.8 2023-12-29 0.00 55.0 41.0 50
2638 USC00456295 PALMER 3 ESE, WA US 47.30889 -121.85556 268.8 2023-12-30 0.86 51.0 41.0 42
2639 USC00456295 PALMER 3 ESE, WA US 47.30889 -121.85556 268.8 2023-12-31 0.10 44.0 35.0 35

2640 rows × 10 columns

This query selects all columns and rows from the weather table.

Filter Rows (WHERE Clause)#

To filter rows based on certain conditions, you can use the WHERE clause. For example, if you only want to see the dates where a temperature higher than 82 was observed, you can run this query:

%%dql
SELECT * FROM weather WHERE temperature_obs > 82;
station name latitude longitude elevation date precipitation temperature_max temperature_min temperature_obs
0 USC00458278 TACOMA NUMBER 1, WA US 47.2472 -122.41220 7.6 2023-08-14 0.0 87 56 84
1 USC00458278 TACOMA NUMBER 1, WA US 47.2472 -122.41220 7.6 2023-08-15 0.0 90 65 84
2 USC00454486 LANDSBURG, WA US 47.3766 -121.96130 163.1 2023-05-13 0.0 91 50 83
3 USC00454486 LANDSBURG, WA US 47.3766 -121.96130 163.1 2023-05-14 0.0 90 46 85
4 USC00455525 MONROE, WA US 47.8450 -121.99556 38.1 2023-05-14 0.0 90 48 85
5 USC00455525 MONROE, WA US 47.8450 -121.99556 38.1 2023-08-14 0.0 91 60 84
6 USC00455525 MONROE, WA US 47.8450 -121.99556 38.1 2023-08-15 0.0 93 52 85
7 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-08-14 0.0 88 67 85

This command selects all columns from the weather table, but only includes rows where the observed temperature is greater than 82°F.

To combine filters for two or more different columns, you can use AND or OR:

%%dql
SELECT * FROM weather WHERE precipitation > 2.5 OR elevation > 600;
station name latitude longitude elevation date precipitation temperature_max temperature_min temperature_obs
0 USC00451233 CEDAR LAKE, WA US 47.4144 -121.75610 480.4 2023-12-05 2.98 52 43 51
1 USC00450872 BREMERTON, WA US 47.5688 -122.68270 36.6 2023-12-05 5.00 59 44 53
2 USC00455525 MONROE, WA US 47.8450 -121.99556 38.1 2023-12-05 2.68 59 49 50
3 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-01-01 0.24 41 33 34
4 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-01-02 0.00 39 33 37
... ... ... ... ... ... ... ... ... ... ...
361 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-12-27 0.03 45 38 44
362 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-12-28 0.36 44 42 44
363 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-12-29 0.02 54 44 54
364 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-12-30 0.00 54 44 45
365 USC00458508 TOLT SOUTH FORK RESERVOIR, WA US 47.7000 -121.69080 609.6 2023-12-31 0.62 47 40 41

366 rows × 10 columns

Note

In DuckDB, strings are indicated with single quotes, like so: 'my string value', and column names with double quotes, like so: "my column name". You’ll only need to use double quotes for your column names if they contain spaces or special characters.

Exercise 1.02

Filter rows where the station name is 'TACOMA NUMBER 1, WA US'.

# Uncomment and run to show solution
# !cat ./answers/answer_1.02.sql

Pick the Columns that You Want#

Sometimes, you may only want to see specific columns. For example, if you only want to see the temperature_max and temperature_min columns, you can run this query:

%%dql
SELECT name, date, temperature_min, temperature_max FROM weather;
name date temperature_min temperature_max
0 TACOMA NUMBER 1, WA US 2023-01-01 38.0 47.0
1 TACOMA NUMBER 1, WA US 2023-01-02 32.0 47.0
2 TACOMA NUMBER 1, WA US 2023-01-03 36.0 46.0
3 TACOMA NUMBER 1, WA US 2023-01-04 34.0 50.0
4 TACOMA NUMBER 1, WA US 2023-01-05 45.0 61.0
... ... ... ... ...
2635 PALMER 3 ESE, WA US 2023-12-27 42.0 48.0
2636 PALMER 3 ESE, WA US 2023-12-28 41.0 51.0
2637 PALMER 3 ESE, WA US 2023-12-29 41.0 55.0
2638 PALMER 3 ESE, WA US 2023-12-30 41.0 51.0
2639 PALMER 3 ESE, WA US 2023-12-31 35.0 44.0

2640 rows × 4 columns

Exercise 1.03

Run a DESCRIBE query on the weather table to inspect the column names, and try selecting a few different ones! For example, select the name, date, elevation, precipitation, and/or temperature_obs columns.

# Uncomment and run to show solution
# !cat ./answers/answer_1.03.sql

Exercise 1.04

Select the temperature_max and temperature_min columns, and filter down to only see the rows where both of those values are under 60 and above 50.

# Uncomment and run to show solution
# !cat ./answers/answer_1.04.sql

Add a calculated Column#

You can also add a calculated column to your results. For example, if you want to calculate the average of two columns, temperature_max and temperature_min, you can do this:

%%dql
SELECT name, date, (temperature_max + temperature_min) / 2 AS mean_temperature
FROM weather;
name date mean_temperature
0 TACOMA NUMBER 1, WA US 2023-01-01 42.5
1 TACOMA NUMBER 1, WA US 2023-01-02 39.5
2 TACOMA NUMBER 1, WA US 2023-01-03 41.0
3 TACOMA NUMBER 1, WA US 2023-01-04 42.0
4 TACOMA NUMBER 1, WA US 2023-01-05 53.0
... ... ... ...
2635 PALMER 3 ESE, WA US 2023-12-27 45.0
2636 PALMER 3 ESE, WA US 2023-12-28 46.0
2637 PALMER 3 ESE, WA US 2023-12-29 48.0
2638 PALMER 3 ESE, WA US 2023-12-30 46.0
2639 PALMER 3 ESE, WA US 2023-12-31 39.5

2640 rows × 3 columns

This command creates a new column called mean_temperature that contains the average of temperature_min and temperature_max.

Exercise 1.05

Add a new calculated column called temperature_range that gets the difference between temperature_max and temperature_min columns.

# Uncomment and run to show solution
# !cat ./answers/answer_1.05.sql

Exercise 1.06

Create a new calculated column, temperature_obs_celcius, that converts the observed temperature to °C using the equation: (32°F 32) × 5/9 = 0°C.

# Uncomment and run to show solution
# !cat ./answers/answer_1.06.sql

Order Rows (ORDER BY Clause)#

To sort the rows based on a specific column, you can use the ORDER BY clause. For example, if you want to order the observations by the amount of precipitation with the rainiest days on top, you can run this query:

%%dql
SELECT name, date, precipitation, (temperature_max + temperature_min) / 2 AS mean_temperature
FROM weather
ORDER BY precipitation DESC;
name date precipitation mean_temperature
0 BREMERTON, WA US 2023-12-05 5.00 51.5
1 PALMER 3 ESE, WA US 2023-12-05 3.10 48.0
2 CEDAR LAKE, WA US 2023-12-05 2.98 47.5
3 TOLT SOUTH FORK RESERVOIR, WA US 2023-12-06 2.96 49.5
4 MONROE, WA US 2023-12-05 2.68 54.0
... ... ... ... ...
2635 PALMER 3 ESE, WA US 2023-12-20 0.00 42.0
2636 PALMER 3 ESE, WA US 2023-12-21 0.00 45.5
2637 PALMER 3 ESE, WA US 2023-12-23 0.00 33.0
2638 PALMER 3 ESE, WA US 2023-12-24 0.00 38.0
2639 PALMER 3 ESE, WA US 2023-12-29 0.00 48.0

2640 rows × 4 columns

This command sorts the rows by the precipitation column in descending order.

Exercise 1.07

Use the query you created in the previous exercise and order the rows by precipitation in ascending order.

# Uncomment and run to show solution
# !cat ./answers/answer_1.07.sql

Exercise 1.08

Get the station name, date, temperature_obs and precipitation, and sort the table such that the row with the lowest temperature observed is at the top of the result table.

# Uncomment and run to show solution
# !cat ./answers/answer_1.08.sql