Open In Colab

4. Collaborating with data in the Cloud#

To start off, install the latest version of duckdb, magic-duckdb and dash to run this notebook.

!pip install --upgrade duckdb magic-duckdb dash -q
%load_ext magic_duckdb

We’re also going to create a helper variable IN_COLAB to see if we’re running Google Colab. This will come in handy later.

try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

This cell downloads the answers for the exercises.

!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_4.zip -q 
!unzip -o answers_4.zip -d answers 
Archive:  answers_4.zip
  inflating: answers/answer_4.02.sql  
  inflating: answers/answer_4.03.sql  
  inflating: answers/answer_4.04.sql  
  inflating: answers/answer_4.05.sql  

Sign up for MotherDuck#

If you haven’t already done it, sign up for MotherDuck.

To connect to MotherDuck, all you need to do is connect to a duckdb database! Your MotherDuck databases will be accessible with the md: prefix. For example, to connect to the sample_data database and show the tables, uncomment the following lines and run:

import duckdb
con = duckdb.connect("md:sample_data")

However, this will throw an error! You actually need to specify your authentication token to connect to MotherDuck.

To do so, you can copy your token from Motherduck and add it to your notebook “Secrets”.

If you are using Google Colab, you can click on the “Secrets” tab and add a new “token” secret there. See how to do that in the screenshot below.

Now, you can get your token from the secrets manager and load it into an environment variable. After this, you can connect to MotherDuck without any extra authentication steps!

import os

if IN_COLAB:
  from google.colab import userdata
  os.environ["motherduck_token"] = userdata.get('token')

If you’re running in a Jupyter Notebook elsewhere, you can uncomment and run the following, and paste your token in the input field:

# import getpass
# os.environ["motherduck_token"] = getpass.getpass(prompt='Password: ', stream=None)

Exercise 4.01

Create a connection to MotherDuck and show all tables in your sample_data database. You can use the SHOW TABLES command that is documented here.

Run a query against DuckDB in the Cloud#

You are now all ready to go and query your Cloud data warehouse! One example in the sample_data database is the service_requests table, which contains New York City 311 Service Requests with requests to the city’s complaint service from 2010 to the present.

To query the data, you’ll want to fully specify the table name with the following format:

<database name>.<schema>.<table name>

For example, you can run the below cell to get the service requests between March 27th and 31st of 2022:

%%dql -co con
SELECT
  created_date, agency_name, complaint_type,
  descriptor, incident_address, resolution_description

FROM
  sample_data.nyc.service_requests
WHERE
  created_date >= '2022-03-27' AND
  created_date <= '2022-03-31';
created_date agency_name complaint_type descriptor incident_address resolution_description
0 2022-03-29 09:07:58 Department of Sanitation Graffiti Graffiti 80-06 101 AVENUE The City has removed the graffiti from this pr...
1 2022-03-30 09:59:06 Department of Sanitation Graffiti Graffiti 180-09 LIBERTY AVENUE The City has removed the graffiti from this pr...
2 2022-03-27 09:02:19 Department of Sanitation Illegal Dumping Removal Request EAST 241 STREET The Department of Sanitation cleaned the locat...
3 2022-03-27 12:56:16 Department of Transportation Street Condition Pothole EAST 77 STREET The Department of Transportation inspected thi...
4 2022-03-27 18:10:00 Department of Transportation Traffic Signal Condition Controller None Service Request status for this request is ava...
... ... ... ... ... ... ...
33510 2022-03-27 01:48:27 Department of Buildings Building/Use Illegal Conversion Of Residential Building/Space 30 VAN BUREN STREET The Department of Buildings attempted to inves...
33511 2022-03-30 17:30:09 Department of Buildings Building/Use Illegal Conversion Of Residential Building/Space 28 CURTIS PLACE The Department of Buildings attempted to inves...
33512 2022-03-30 17:53:28 Department of Health and Mental Hygiene Indoor Air Quality Other (Explain Below) 177-49 105 AVENUE None
33513 2022-03-28 18:03:05 Department of Buildings Building/Use Illegal Conversion Of Residential Building/Space 119-20 194 STREET The Department of Buildings attempted to inves...
33514 2022-03-30 08:19:49 Department of Sanitation Graffiti Graffiti 105-01 JAMAICA AVENUE The City has removed the graffiti from this pr...

33515 rows × 6 columns

Exercise 4.02

Run DESCRIBE on the sample_data.who.ambient_air_quality table to inspect the column names. Write a query that gets the average concentrations of PM1.0 and PM2.5 particles for the 'United States of America', for the last 10 years, grouped and ordered by year.

# Uncomment and run to show solution
# !cat ./answers/answer_4.02.py

Load data from Huggingface#

Now, let’s try to load some data from a data source into MotherDuck. HuggingFace has recently released an extension for DuckDB, that lets you access and query their entire datasets library!

To query a HuggingFace dataset, you can run:

%%dql -co con
SELECT * FROM read_parquet('hf://datasets/datonic/threatened_animal_species/data/threatened_animal_species.parquet');
taxonid kingdom_name phylum_name class_name order_name family_name genus_name scientific_name taxonomic_authority infra_rank infra_name population category main_common_name
0 31630 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA CARYOPHYLLALES NYCTAGINACEAE Pisonia Pisonia ekmanii Heimerl None None None EN chicharron sapo
1 31631 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA MALPIGHIALES SALICACEAE Lunania Lunania dodecandra C.Wright ex Griseb. None None None EN None
2 31632 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA MALPIGHIALES SALICACEAE Lunania Lunania elongata Britton &amp; P.Wilson None None None EN None
3 31633 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA ERICALES SAPOTACEAE Chrysophyllum Chrysophyllum claraense Urban None None None CR None
4 31635 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA MALVALES MALVACEAE Carpodiptera Carpodiptera ophiticola Bisse None None None EN None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
156498 258432990 FUNGI BASIDIOMYCOTA AGARICOMYCETES BOLETALES BOLETACEAE Durianella Durianella echinulata (Corner &amp; Hawker) Desjardin, A.W. Wilson &... None None None VU None
156499 258721221 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA MAGNOLIALES ANNONACEAE Uvariodendron Uvariodendron fuscum (Benth.) R.E.Fr. None None None LC None
156500 258981334 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA SAPINDALES ANACARDIACEAE Mangifera Mangifera similis Blume None None None LC None
156501 259047450 PLANTAE TRACHEOPHYTA MAGNOLIOPSIDA FABALES FABACEAE Ateleia Ateleia gummifera (Bertero ex DC.) D.Dietr. None None None LC Barbinegra
156502 259121878 ANIMALIA CHORDATA ACTINOPTERYGII CYPRINIFORMES NEMACHEILIDAE Barbatula Barbatula barbatula (Linnaeus, 1758) None None None LC European Stone Loach

156503 rows × 14 columns

Before we create a new table with this data, let’s first swap to a different database. You can do so by creating a new DuckDB connection, or by changing the database with the USE statement. For example, to connect to your default database, my_db, run:

%%dql -co con
USE my_db;
Success

Exercise 4.03

Create a new table called animals in your MotherDuck database md:my_db based on the datonic/threatened_animal_species dataset.

# Uncomment and run to show solution
# !cat ./answers/answer_4.03.py

Exercise 4.04

DuckDB releases are each named after a duck! Let’s load this data into a new table called duckdb_ducks. You can use read_csv to load the data directly from the HTTP URL: https://duckdb.org/data/duckdb-releases.csv.

# Uncomment and run to show solution
# !cat ./answers/answer_4.04.py

Sharing is caring: Teach your data to fly!#

Now, we have two tables that we can join together and share with our colleagues!

Let’s inspect them and take a look at the columns we have available.

%%dql -co con
DESCRIBE animals;
column_name column_type null key default extra
0 taxonid BIGINT YES None None None
1 kingdom_name VARCHAR YES None None None
2 phylum_name VARCHAR YES None None None
3 class_name VARCHAR YES None None None
4 order_name VARCHAR YES None None None
5 family_name VARCHAR YES None None None
6 genus_name VARCHAR YES None None None
7 scientific_name VARCHAR YES None None None
8 taxonomic_authority VARCHAR YES None None None
9 infra_rank VARCHAR YES None None None
10 infra_name VARCHAR YES None None None
11 population VARCHAR YES None None None
12 category VARCHAR YES None None None
13 main_common_name VARCHAR YES None None None
%%dql -co con
DESCRIBE duckdb_ducks;
column_name column_type null key default extra
0 release_date DATE YES None None None
1 version_number VARCHAR YES None None None
2 codename VARCHAR YES None None None
3 duck_species_primary VARCHAR YES None None None
4 duck_species_secondary VARCHAR YES None None None
5 duck_wikipage VARCHAR YES None None None
6 blog_post VARCHAR YES None None None

Now, we can get the endangered species status of all DuckDB ducks by joining the two.

Exercise 4.05

Create a new table called duckdb_species that joins the duckdb_ducks and animals tables on the scientific name.

# Uncomment and run to show solution
# !cat ./answers/answer_4.05.py

To share your database, you can run:

%%dql -co con -o df
CREATE SHARE duck_share FROM my_db (ACCESS UNRESTRICTED);
share_url
0 md:_share/my_db/9955b23e-1bff-49c5-90bd-a3d069...

Now you can print the share URL:

print(df.share_url.iloc[0])
md:_share/my_db/9955b23e-1bff-49c5-90bd-a3d0697e754d

Exercise 4.06

Check out these datasets from Huggingface: https://huggingface.co/datasets. Pick one, create a share and send it to your neighbor!

To attach a share into your Cloud data warehouse, run:

ATTACH '<share_url>';

For example, to load the Mosaic example datasets, run

%%dql -co con
ATTACH 'md:_share/mosaic_examples/b01cfda8-239e-4148-a228-054b94cdc3b4';
Success

You can then inspect the database and query the data like so:

%%dql -co con
USE mosaic_examples;
SHOW TABLES;
name
0 complaints
1 earthquakes
2 flights_10m
3 flights_200k
4 gaia_5m
5 md
6 seattle_weather
%%dql -co con
SELECT * FROM seattle_weather;
date precipitation temp_max temp_min wind weather
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle
1 2012-01-02 10.9 10.6 2.8 4.5 rain
2 2012-01-03 0.8 11.7 7.2 2.3 rain
3 2012-01-04 20.3 12.2 5.6 4.7 rain
4 2012-01-05 1.3 8.9 2.8 6.1 rain
... ... ... ... ... ... ...
1456 2015-12-27 8.6 4.4 1.7 2.9 rain
1457 2015-12-28 1.5 5.0 1.7 1.3 rain
1458 2015-12-29 0.0 7.2 0.6 2.6 fog
1459 2015-12-30 0.0 5.6 -1.0 3.4 sun
1460 2015-12-31 0.0 5.6 -2.1 3.5 sun

1461 rows × 6 columns

Exercise 4.07

Attach the share you received from your neighbor and inspect the tables.

Detaching and removing your shares#

To detach a database someone shared with you, make sure it’s not selected, and run DETACH:

%%dql -co con
USE my_db;
DETACH mosaic_examples;
Success

To drop the share you created, simply run:

%%dql -co con
DROP SHARE duck_share;
Success
0 True

Visualize your data#

Now that your data is in the Cloud and easy to share, you can also create simple web apps that load and plot the data!

Here is an example Dash app that you can run to plot data in the sample_data database.

from dash import Dash, html, dcc, callback, Output, Input
import plotly.express as px
import pandas as pd
from sqlalchemy import create_engine, text

con = duckdb.connect("md:sample_data")

countries = con.sql('SELECT DISTINCT country_name as countries FROM who.ambient_air_quality ORDER BY country_name')

app = Dash()

app.layout = [
    html.H1(children='Air quality by country', style={'textAlign':'center', 'font-family':'monospace'}),
    dcc.Dropdown(countries.df().countries.values.tolist(), 'Canada', id='dropdown-selection'),
    dcc.Graph(id='graph-content')
]

@callback(
    Output('graph-content', 'figure'),
    Input('dropdown-selection', 'value')
)
def update_graph(value):
    sql = "SELECT year, avg(pm25_concentration) as avg_pm25 FROM who.ambient_air_quality WHERE country_name=? GROUP by year ORDER by year"
    result = con.execute(sql, [value]).df()
    return px.line(result, x='year', y='avg_pm25')

if __name__ == '__main__':
  app.run(debug=True)

Exercise (bonus)

Create a dashboard that plots the data your neighbor shared with you. Share it in the #scipy-2024 Slack channel so others can give it a try!