4. Collaborating with data in the Cloud#
This cell downloads the answers for the exercises.
# run this in your Vs Code terminal
wget https://raw.githubusercontent.com/motherduckdb/sql-tutorial/main/answers/answers_4.zip -q
unzip -o answers_4.zip -d answers
Sign up for MotherDuck#
If you haven’t already done it, sign up for MotherDuck.
To connect to MotherDuck, all you need to do is connect to a duckdb
database! Your MotherDuck databases will be accessible with the md:
prefix. For example, to connect to the sample_data
database and show the tables, uncomment the following lines and run:
import duckdb
con = duckdb.connect("md:sample_data")
However, this will throw an error! You actually need to specify your authentication token to connect to MotherDuck.
To do so, you can copy your token from Motherduck and add it to your notebook “Secrets”.
Now, you can get your token from the secrets manager and load it into an environment variable. After this, you can connect to MotherDuck without any extra authentication steps!
BUT if you want to use a Marimo SQL Cell, this will simply work:
ATTACH 'md:'
Then you can click-through the authentication steps and get your token into your Marimo session.
Exercise 4.01
Create a connection to MotherDuck and show all tables in your sample_data
database. You can use the SHOW TABLES
command that is documented here.
Run a query against DuckDB in the Cloud#
You are now all ready to go and query your Cloud data warehouse! One example in the sample_data
database is the service_requests
table, which contains New York City 311 Service Requests with requests to the city’s complaint service from 2010 to the present.
To query the data, you’ll want to fully specify the table name with the following format:
<database name>.<schema>.<table name>
For example, you can run the below cell to get the service requests between March 27th and 31st of 2022:
SELECT
created_date, agency_name, complaint_type,
descriptor, incident_address, resolution_description
FROM
sample_data.nyc.service_requests
WHERE
created_date >= '2022-03-27' AND
created_date <= '2022-03-31';
Exercise 4.02
Run DESCRIBE
on the sample_data.who.ambient_air_quality
table to inspect the column names. Write a query that gets the average concentrations of PM1.0 and PM2.5 particles for the 'United States of America'
, for the last 10 years, grouped and ordered by year.
# Uncomment and run to show solution
# !cat ./answers/answer_4.02.py
Load data from Huggingface#
Now, let’s try to load some data from a data source into MotherDuck. HuggingFace has recently released an extension for DuckDB, that lets you access and query their entire datasets library!
To query a HuggingFace dataset, you can run:
SELECT * FROM read_parquet('hf://datasets/datonic/threatened_animal_species/data/threatened_animal_species.parquet');
Cell In[3], line 1
SELECT * FROM read_parquet('hf://datasets/datonic/threatened_animal_species/data/threatened_animal_species.parquet');
^
SyntaxError: invalid syntax
Before we create a new table with this data, let’s first swap to a different database. You can do so by creating a new DuckDB connection, or by changing the database with the USE
statement. For example, to connect to your default database, my_db
, run:
USE my_db;
Exercise 4.03
Create a new table called animals
in your MotherDuck database md:my_db
based on the datonic/threatened_animal_species
dataset.
# Uncomment and run to show solution
# !cat ./answers/answer_4.03.py
Exercise 4.04
DuckDB releases are each named after a duck! Let’s load this data into a new table called duckdb_ducks
. You can use read_csv
to load the data directly from the HTTP URL: https://duckdb.org/data/duckdb-releases.csv
.
# Uncomment and run to show solution
# !cat ./answers/answer_4.04.py
How do we fit AI into this?#
MotherDuck contains a set of useful AI functions that you can use interrogate your data.
A particularly useful one is PRAGMA prompt_query('<natural language question>')
- which we can use interrogate our datasets. Recall the exercise from part 3 - getting the bird with the maximum wing length? Lets do this with a bit of AI in MotherDuck.
The first step is that it must understand the data, so we can simply create a table using CTAS from our local file:
CREATE OR REPLACE TABLE birds AS
FROM 'birds.csv'
Then we can ask a question about it.
PRAGMA prompt_query('which bird has the largest wing length?')
This should return the right answer. But how can we validate it? Lets use CALL prompt_sql()
to do so!
CALL prompt_sql('which bird has the largest wing length?')
This will return the SQL query that is associated to this question, which can then be inspected and run by the user!
Further Reading#
We have written extensively about using AI with SQL. Hopefully these links will help you understand how you can better these types of capabilities into your own workflow.