4. Collaborating with data in the Cloud#
To start off, install the latest version of duckdb
, magic-duckdb
and dash
to run this notebook.
!pip install --upgrade duckdb magic-duckdb dash -q
%load_ext magic_duckdb
We’re also going to create a helper variable IN_COLAB
to see if we’re running Google Colab. This will come in handy later.
try:
import google.colab
IN_COLAB = True
except:
IN_COLAB = False
This cell downloads the answers for the exercises.
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_4.zip -q
!unzip -o answers_4.zip -d answers
Archive: answers_4.zip
inflating: answers/answer_4.02.sql
inflating: answers/answer_4.03.sql
inflating: answers/answer_4.04.sql
inflating: answers/answer_4.05.sql
Sign up for MotherDuck#
If you haven’t already done it, sign up for MotherDuck.
To connect to MotherDuck, all you need to do is connect to a duckdb
database! Your MotherDuck databases will be accessible with the md:
prefix. For example, to connect to the sample_data
database and show the tables, uncomment the following lines and run:
import duckdb
con = duckdb.connect("md:sample_data")
However, this will throw an error! You actually need to specify your authentication token to connect to MotherDuck.
To do so, you can copy your token from Motherduck and add it to your notebook “Secrets”.
If you are using Google Colab, you can click on the “Secrets” tab and add a new “token” secret there. See how to do that in the screenshot below.

Now, you can get your token from the secrets manager and load it into an environment variable. After this, you can connect to MotherDuck without any extra authentication steps!
import os
if IN_COLAB:
from google.colab import userdata
os.environ["motherduck_token"] = userdata.get('token')
If you’re running in a Jupyter Notebook elsewhere, you can uncomment and run the following, and paste your token in the input field:
# import getpass
# os.environ["motherduck_token"] = getpass.getpass(prompt='Password: ', stream=None)
Exercise 4.01
Create a connection to MotherDuck and show all tables in your sample_data
database. You can use the SHOW TABLES
command that is documented here.
Run a query against DuckDB in the Cloud#
You are now all ready to go and query your Cloud data warehouse! One example in the sample_data
database is the service_requests
table, which contains New York City 311 Service Requests with requests to the city’s complaint service from 2010 to the present.
To query the data, you’ll want to fully specify the table name with the following format:
<database name>.<schema>.<table name>
For example, you can run the below cell to get the service requests between March 27th and 31st of 2022:
%%dql -co con
SELECT
created_date, agency_name, complaint_type,
descriptor, incident_address, resolution_description
FROM
sample_data.nyc.service_requests
WHERE
created_date >= '2022-03-27' AND
created_date <= '2022-03-31';
created_date | agency_name | complaint_type | descriptor | incident_address | resolution_description | |
---|---|---|---|---|---|---|
0 | 2022-03-29 09:07:58 | Department of Sanitation | Graffiti | Graffiti | 80-06 101 AVENUE | The City has removed the graffiti from this pr... |
1 | 2022-03-30 09:59:06 | Department of Sanitation | Graffiti | Graffiti | 180-09 LIBERTY AVENUE | The City has removed the graffiti from this pr... |
2 | 2022-03-27 09:02:19 | Department of Sanitation | Illegal Dumping | Removal Request | EAST 241 STREET | The Department of Sanitation cleaned the locat... |
3 | 2022-03-27 12:56:16 | Department of Transportation | Street Condition | Pothole | EAST 77 STREET | The Department of Transportation inspected thi... |
4 | 2022-03-27 18:10:00 | Department of Transportation | Traffic Signal Condition | Controller | None | Service Request status for this request is ava... |
... | ... | ... | ... | ... | ... | ... |
33510 | 2022-03-27 01:48:27 | Department of Buildings | Building/Use | Illegal Conversion Of Residential Building/Space | 30 VAN BUREN STREET | The Department of Buildings attempted to inves... |
33511 | 2022-03-30 17:30:09 | Department of Buildings | Building/Use | Illegal Conversion Of Residential Building/Space | 28 CURTIS PLACE | The Department of Buildings attempted to inves... |
33512 | 2022-03-30 17:53:28 | Department of Health and Mental Hygiene | Indoor Air Quality | Other (Explain Below) | 177-49 105 AVENUE | None |
33513 | 2022-03-28 18:03:05 | Department of Buildings | Building/Use | Illegal Conversion Of Residential Building/Space | 119-20 194 STREET | The Department of Buildings attempted to inves... |
33514 | 2022-03-30 08:19:49 | Department of Sanitation | Graffiti | Graffiti | 105-01 JAMAICA AVENUE | The City has removed the graffiti from this pr... |
33515 rows × 6 columns
Exercise 4.02
Run DESCRIBE
on the sample_data.who.ambient_air_quality
table to inspect the column names. Write a query that gets the average concentrations of PM1.0 and PM2.5 particles for the 'United States of America'
, for the last 10 years, grouped and ordered by year.
# Uncomment and run to show solution
# !cat ./answers/answer_4.02.py
Load data from Huggingface#
Now, let’s try to load some data from a data source into MotherDuck. HuggingFace has recently released an extension for DuckDB, that lets you access and query their entire datasets library!
To query a HuggingFace dataset, you can run:
%%dql -co con
SELECT * FROM read_parquet('hf://datasets/datonic/threatened_animal_species/data/threatened_animal_species.parquet');
taxonid | kingdom_name | phylum_name | class_name | order_name | family_name | genus_name | scientific_name | taxonomic_authority | infra_rank | infra_name | population | category | main_common_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 31630 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | CARYOPHYLLALES | NYCTAGINACEAE | Pisonia | Pisonia ekmanii | Heimerl | None | None | None | EN | chicharron sapo |
1 | 31631 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | MALPIGHIALES | SALICACEAE | Lunania | Lunania dodecandra | C.Wright ex Griseb. | None | None | None | EN | None |
2 | 31632 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | MALPIGHIALES | SALICACEAE | Lunania | Lunania elongata | Britton & P.Wilson | None | None | None | EN | None |
3 | 31633 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | ERICALES | SAPOTACEAE | Chrysophyllum | Chrysophyllum claraense | Urban | None | None | None | CR | None |
4 | 31635 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | MALVALES | MALVACEAE | Carpodiptera | Carpodiptera ophiticola | Bisse | None | None | None | EN | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
156498 | 258432990 | FUNGI | BASIDIOMYCOTA | AGARICOMYCETES | BOLETALES | BOLETACEAE | Durianella | Durianella echinulata | (Corner & Hawker) Desjardin, A.W. Wilson &... | None | None | None | VU | None |
156499 | 258721221 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | MAGNOLIALES | ANNONACEAE | Uvariodendron | Uvariodendron fuscum | (Benth.) R.E.Fr. | None | None | None | LC | None |
156500 | 258981334 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | SAPINDALES | ANACARDIACEAE | Mangifera | Mangifera similis | Blume | None | None | None | LC | None |
156501 | 259047450 | PLANTAE | TRACHEOPHYTA | MAGNOLIOPSIDA | FABALES | FABACEAE | Ateleia | Ateleia gummifera | (Bertero ex DC.) D.Dietr. | None | None | None | LC | Barbinegra |
156502 | 259121878 | ANIMALIA | CHORDATA | ACTINOPTERYGII | CYPRINIFORMES | NEMACHEILIDAE | Barbatula | Barbatula barbatula | (Linnaeus, 1758) | None | None | None | LC | European Stone Loach |
156503 rows × 14 columns
Before we create a new table with this data, let’s first swap to a different database. You can do so by creating a new DuckDB connection, or by changing the database with the USE
statement. For example, to connect to your default database, my_db
, run:
%%dql -co con
USE my_db;
Success |
---|
Exercise 4.03
Create a new table called animals
in your MotherDuck database md:my_db
based on the datonic/threatened_animal_species
dataset.
# Uncomment and run to show solution
# !cat ./answers/answer_4.03.py
Exercise 4.04
DuckDB releases are each named after a duck! Let’s load this data into a new table called duckdb_ducks
. You can use read_csv
to load the data directly from the HTTP URL: https://duckdb.org/data/duckdb-releases.csv
.
# Uncomment and run to show solution
# !cat ./answers/answer_4.04.py
Visualize your data#
Now that your data is in the Cloud and easy to share, you can also create simple web apps that load and plot the data!
Here is an example Dash app that you can run to plot data in the sample_data
database.
from dash import Dash, html, dcc, callback, Output, Input
import plotly.express as px
import pandas as pd
from sqlalchemy import create_engine, text
con = duckdb.connect("md:sample_data")
countries = con.sql('SELECT DISTINCT country_name as countries FROM who.ambient_air_quality ORDER BY country_name')
app = Dash()
app.layout = [
html.H1(children='Air quality by country', style={'textAlign':'center', 'font-family':'monospace'}),
dcc.Dropdown(countries.df().countries.values.tolist(), 'Canada', id='dropdown-selection'),
dcc.Graph(id='graph-content')
]
@callback(
Output('graph-content', 'figure'),
Input('dropdown-selection', 'value')
)
def update_graph(value):
sql = "SELECT year, avg(pm25_concentration) as avg_pm25 FROM who.ambient_air_quality WHERE country_name=? GROUP by year ORDER by year"
result = con.execute(sql, [value]).df()
return px.line(result, x='year', y='avg_pm25')
if __name__ == '__main__':
app.run(debug=True)
Exercise (bonus)
Create a dashboard that plots the data your neighbor shared with you. Share it in the #scipy-2024
Slack channel so others can give it a try!