Open In Colab

3. Combining SQL and Python#

A. Using duckdb from Python#

DuckDB is released with a native Python client. You can install it with a simple pip install, and there are no dependencies required.

Google Collab even has duckdb pre-installed!

We will also install a few dataframe libraries, but these are optional unless you would like to do some of your analysis outside of DuckDB!

!pip install --upgrade duckdb pandas polars pyarrow -q
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/ducks.csv -q --show-progress
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/data/birds.csv -q --show-progress
!wget https://raw.githubusercontent.com/MotherDuck-Open-Source/sql-tutorial/main/answers/answers_3.zip -q 
!unzip -o answers_3.zip -d answers 
ducks.csv.1           0%[                    ]       0  --.-KB/s               
ducks.csv.1         100%[===================>]   6.94K  --.-KB/s    in 0s      
birds.csv.1           0%[                    ]       0  --.-KB/s               
birds.csv.1         100%[===================>]  18.03M  --.-KB/s    in 0.06s   
Archive:  answers_3.zip
  inflating: answers/answer_3.01.py  
  inflating: answers/answer_3.02.py  
  inflating: answers/answer_3.03.py  
  inflating: answers/answer_3.04.py  

DuckDB follows the Python DB API spec, so you can use it the same way you would use another database. fetchall() returns a list of tuples.

import duckdb

duckdb.execute("SELECT 42 as hello_world").fetchall()
[(42,)]

DuckDB also has a .sql method that has some convenience features beyond .execute. We recommend using .sql!

duckdb.sql("SELECT 42 as hello_world").fetchall()
[(42,)]

B. Writing Pandas DataFrames with DuckDB#

DuckDB can also return a DataFrame directly using .df(), instead of a list of tuples!

This is much faster for large datasets, and fits nicely into existing dataframe workflows like charting (which we will see later) or machine learning.

duckdb.sql("SELECT 42 as hello_world").df()
hello_world
0 42

If that output looks familiar, it’s because we have been using Pandas DataFrames the entire time we have been using duckdb_magic! duckdb_magic returns a dataframe as the result of each SQL query.

C. Reading Pandas DataFrames#

Not only can DuckDB write dataframes, but it can read them as if they were a table!

No copying is required - DuckDB will read the existing Pandas object by scanning the C++ objects underneath Pandas’ Python objects.

For example, to create a Pandas dataframe and access it from DuckDB, you can run:

import pandas as pd
ducks_pandas = pd.read_csv('ducks.csv')

duckdb.sql("SELECT * FROM ducks_pandas").df()
name genus author year extinct
0 Mandarin duck Aix galericulata Linnaeus 1758 0
1 Wood duck Aix sponsa Linnaeus 1758 0
2 Egyptian goose Alopochen aegyptiaca Linnaeus 1766 0
3 Brazilian teal Amazonetta brasiliensis J. F. Gmelin 1789 0
4 Northern pintail Anas acuta Linnaeus 1758 0
... ... ... ... ... ...
134 Ruddy shelduck Tadorna ferruginea Pallas 1764 0
135 Common shelduck Tadorna tadorna Linnaeus 1758 0
136 Australian shelduck Tadorna tadornoides Jardine & Selby 1828 0
137 Paradise shelduck Tadorna variegata J. F. Gmelin 1789 0
138 White-backed duck Thalassornis leuconotus Eyton 1838 0

139 rows × 5 columns

When to use pd.read_csv?#

How would you decide whether to use Pandas or DuckDB to read a CSV file? There are pros to each!

D. Reading and Writing Polars and Apache Arrow#

In addition to Pandas, DuckDB is also fully interoperable with Polars and Apache Arrow.

Polars is a faster and more modern alternative to Pandas, and has a much smaller API to learn.

Apache Arrow is the industry standard tabular data transfer format. Polars is actually built on top of Apache Arrow data types. Apache Arrow and DuckDB types are highly compatible. Apache Arrow has also taken inspiration from DuckDB’s VARCHAR data type with their new STRING_VIEW type.

import polars as pl
import pyarrow as pa
import pyarrow.csv as pa_csv
ducks_polars = pl.read_csv('ducks.csv')
duckdb.sql("""SELECT * FROM ducks_polars""").pl()
shape: (139, 5)
namegenusauthoryearextinct
strstrstrstri64
"Mandarin duck""Aix galericulata""Linnaeus""1758"0
"Wood duck""Aix sponsa""Linnaeus""1758"0
"Egyptian goose""Alopochen aegyptiaca""Linnaeus""1766"0
"Brazilian teal""Amazonetta brasiliensis""J. F. Gmelin""1789"0
"Northern pintail""Anas acuta""Linnaeus""1758"0
"Ruddy shelduck""Tadorna ferruginea""Pallas""1764"0
"Common shelduck""Tadorna tadorna""Linnaeus""1758"0
"Australian shelduck""Tadorna tadornoides""Jardine & Selby""1828"0
"Paradise shelduck""Tadorna variegata""J. F. Gmelin""1789"0
"White-backed duck""Thalassornis leuconotus""Eyton""1838"0
ducks_arrow = pa_csv.read_csv('ducks.csv')
duckdb.sql("""SELECT * FROM ducks_arrow""").arrow()
pyarrow.Table
name: string
genus: string
author: string
year: string
extinct: int64
----
name: [["Mandarin duck","Wood duck","Egyptian goose","Brazilian teal","Northern pintail",...,"Ruddy shelduck","Common shelduck","Australian shelduck","Paradise shelduck","White-backed duck"]]
genus: [["Aix galericulata","Aix sponsa","Alopochen aegyptiaca","Amazonetta brasiliensis","Anas acuta",...,"Tadorna ferruginea","Tadorna tadorna","Tadorna tadornoides","Tadorna variegata","Thalassornis leuconotus"]]
author: [["Linnaeus","Linnaeus","Linnaeus","J. F. Gmelin","Linnaeus",...,"Pallas","Linnaeus","Jardine & Selby","J. F. Gmelin","Eyton"]]
year: [["1758","1758","1766","1789","1758",...,"1764","1758","1828","1789","1838"]]
extinct: [[0,0,0,0,0,...,0,0,0,0,0]]

Exercise 3.01

Read in the birds.csv file using Apache Arrow, then use the DuckDB Python library to execute a SQL statement on that Apache Arrow table to find the maximum wing_length in the dataset. Output that result as an Apache Arrow table.

# Uncomment and run to show solution
# !cat ./answers/answer_3.01.py

Exercise 3.02

Use the DuckDB Python client to return these results as a Polars dataframe.

SELECT
    Species_Common_Name,
    AVG(Beak_Width) AS Avg_Beak_Width,
    AVG(Beak_Depth) AS Avg_Beak_Depth,
    AVG(Beak_Length_Culmen) AS Avg_Beak_Length_Culmen
FROM 'birds.csv'
GROUP BY Species_Common_Name
# Uncomment and run to show solution
# !cat ./answers/answer_3.02.py

2. Using ibis with a DuckDB backend#

A. Introduction to Ibis and DuckDB#

We’ll show you how to leverage the power of DuckDB without even needing to write a single line of SQL. Instead, we’ll use Ibis, a powerful Python library that allows you to interact with databases using a DataFrame-like syntax. We’ll also show you how to combine the two so you can get the best of both worlds.

First, let’s make sure you have the necessary packages installed. You can install DuckDB and Ibis using pip:

!pip install --upgrade --quiet 'ibis-framework[duckdb]'

We are using Ibis in interactive mode for demo purposes. This converts Ibis expressions from lazily evaluated to eagerly evaluated, so it is easier to see what is happening at each step. It also converts Ibis results into Pandas dataframes for nice formatting in Jupyter.

For performance and memory reasons, we recommend not using interactive mode in production!

We can connect to a file-based DuckDB database by specifying a file path.

import ibis
from ibis import _
ibis.options.interactive = True

con = ibis.duckdb.connect(database='whats_quackalackin.duckdb')

We can read in a CSV using Ibis, and it will use the DuckDB read_csv_auto function under the hood. This way we get both DuckDB’s performance, and clean Python syntax.

ducks_ibis = ibis.read_csv('ducks.csv')
ducks_ibis
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ name                   genus                    author                  year    extinct ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ stringstringstringstringint64   │
├───────────────────────┼─────────────────────────┼────────────────────────┼────────┼─────────┤
│ Mandarin duck        Aix galericulata       Linnaeus              1758  0 │
│ Wood duck            Aix sponsa             Linnaeus              1758  0 │
│ Egyptian goose       Alopochen aegyptiaca   Linnaeus              1766  0 │
│ Brazilian teal       Amazonetta brasiliensisJ. F. Gmelin          1789  0 │
│ Northern pintail     Anas acuta             Linnaeus              1758  0 │
│ Andaman teal         Anas albogularis       Hume                  1873  0 │
│ Andean teal          Anas andium            P. L. Sclater & Salvin1873  0 │
│ Auckland teal        Anas aucklandica       G. R. Gray            1844  0 │
│ White-cheeked pintailAnas bahamensis        Linnaeus              1758  0 │
│ Bernier's teal       Anas bernieri          Hartlaub              1860  0 │
│  │
└───────────────────────┴─────────────────────────┴────────────────────────┴────────┴─────────┘

The result of the prior read_csv operation is an Ibis object. It is similar to the result of a SQL query - it is not saved into the database automatically.

To save the result of our read_csv into the DuckDB file, we create a table.

persistent_ducks = con.create_table(name='persistent_ducks', obj=ducks_ibis.to_pyarrow(), overwrite=True)
persistent_ducks
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ name                   genus                    author                  year    extinct ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ stringstringstringstringint64   │
├───────────────────────┼─────────────────────────┼────────────────────────┼────────┼─────────┤
│ Mandarin duck        Aix galericulata       Linnaeus              1758  0 │
│ Wood duck            Aix sponsa             Linnaeus              1758  0 │
│ Egyptian goose       Alopochen aegyptiaca   Linnaeus              1766  0 │
│ Brazilian teal       Amazonetta brasiliensisJ. F. Gmelin          1789  0 │
│ Northern pintail     Anas acuta             Linnaeus              1758  0 │
│ Andaman teal         Anas albogularis       Hume                  1873  0 │
│ Andean teal          Anas andium            P. L. Sclater & Salvin1873  0 │
│ Auckland teal        Anas aucklandica       G. R. Gray            1844  0 │
│ White-cheeked pintailAnas bahamensis        Linnaeus              1758  0 │
│ Bernier's teal       Anas bernieri          Hartlaub              1860  0 │
│  │
└───────────────────────┴─────────────────────────┴────────────────────────┴────────┴─────────┘

Now that we have a table set up, let’s see how we can query this data using Ibis. With Ibis, you can perform operations on your data without writing SQL. Let’s see how similar it feels…

The question we will build up towards answering is, “Who were the most prolific people at finding many new species of non-extinct ducks, and when did they get started finding ducks?”

Use a the filter function instead of a where clause to choose the rows you are interested in.

persistent_ducks.filter(persistent_ducks.extinct == 0)
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
┃ name                   genus                    author                  year    extinct ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
│ stringstringstringstringint64   │
├───────────────────────┼─────────────────────────┼────────────────────────┼────────┼─────────┤
│ Mandarin duck        Aix galericulata       Linnaeus              1758  0 │
│ Wood duck            Aix sponsa             Linnaeus              1758  0 │
│ Egyptian goose       Alopochen aegyptiaca   Linnaeus              1766  0 │
│ Brazilian teal       Amazonetta brasiliensisJ. F. Gmelin          1789  0 │
│ Northern pintail     Anas acuta             Linnaeus              1758  0 │
│ Andaman teal         Anas albogularis       Hume                  1873  0 │
│ Andean teal          Anas andium            P. L. Sclater & Salvin1873  0 │
│ Auckland teal        Anas aucklandica       G. R. Gray            1844  0 │
│ White-cheeked pintailAnas bahamensis        Linnaeus              1758  0 │
│ Bernier's teal       Anas bernieri          Hartlaub              1860  0 │
│  │
└───────────────────────┴─────────────────────────┴────────────────────────┴────────┴─────────┘

Pick your columns using the conveniently named select function!

(persistent_ducks
  .filter(persistent_ducks.extinct == 0)
  .select("name", "author", "year")
)
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ name                   author                  year   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ stringstringstring │
├───────────────────────┼────────────────────────┼────────┤
│ Mandarin duck        Linnaeus              1758   │
│ Wood duck            Linnaeus              1758   │
│ Egyptian goose       Linnaeus              1766   │
│ Brazilian teal       J. F. Gmelin          1789   │
│ Northern pintail     Linnaeus              1758   │
│ Andaman teal         Hume                  1873   │
│ Andean teal          P. L. Sclater & Salvin1873   │
│ Auckland teal        G. R. Gray            1844   │
│ White-cheeked pintailLinnaeus              1758   │
│ Bernier's teal       Hartlaub              1860   │
│       │
└───────────────────────┴────────────────────────┴────────┘

The group_by functions matches well with the group by clause.

However, Ibis splits the select clause into the select function and the aggregate function when working with a group by. This aligns with the SQL best practice to organize your select clause with non-aggregate expressions first, then aggregate expressions.

duck_legends = (persistent_ducks
  .filter(persistent_ducks.extinct == 0)
  .select("name", "author", "year")
  .group_by("author")
  .aggregate([persistent_ducks.name.count(), persistent_ducks.year.min()])
  .order_by([ibis.desc("Count(name)")])
)
duck_legends
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ author        Count(name)  Min(year) ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ stringint64string    │
├──────────────┼─────────────┼───────────┤
│ Linnaeus    351758      │
│ J. F. Gmelin131789      │
│ Eyton       101838      │
│ Vieillot    81816      │
│ Latham      41790      │
│ Gould       41837      │
│ Ridgway     31874      │
│ Pallas      31764      │
│ P. P. King  31828      │
│ G. R. Gray  21844      │
│          │
└──────────────┴─────────────┴───────────┘
ibis.to_sql(duck_legends)
SELECT
  *
FROM (
  SELECT
    "t1"."author",
    COUNT("t1"."name") AS "Count(name)",
    MIN("t1"."year") AS "Min(year)"
  FROM (
    SELECT
      "t0"."name",
      "t0"."author",
      "t0"."year"
    FROM "whats_quackalackin"."main"."persistent_ducks" AS "t0"
    WHERE
      "t0"."extinct" = CAST(0 AS TINYINT)
  ) AS "t1"
  GROUP BY
    1
) AS "t2"
ORDER BY
  "t2"."Count(name)" DESC

B. Mixing and matching SQL and Ibis#

If you have existing SQL queries, or want to use dialect-specific features of a specific SQL database, Ibis allows you to use SQL directly!

If you want to begin your Ibis query with SQL, you can use Table.sql directly.

However, we can no longer refer directly to the persistent_ducks object later in the expression. We instead need to use the _ (which we imported earlier with from ibis import _), which is a way to build expressions using Ibis’s deferred expression API. So instead of persistent_ducks.column.function(), we can say _.column.function()

duck_legends = (persistent_ducks
  .sql("""SELECT name, author, year FROM persistent_ducks WHERE extinct = 0""")
  .group_by("author")
  .aggregate([_.name.count(), _.year.min()]) # Use _ instead of persistent_ducks
  .order_by([ibis.desc("Count(name)")])
)
duck_legends
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ author        Count(name)  Min(year) ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ stringint64string    │
├──────────────┼─────────────┼───────────┤
│ Linnaeus    351758      │
│ J. F. Gmelin131789      │
│ Eyton       101838      │
│ Vieillot    81816      │
│ Latham      41790      │
│ Gould       41837      │
│ P. P. King  31828      │
│ Ridgway     31874      │
│ Pallas      31764      │
│ G. R. Gray  21844      │
│          │
└──────────────┴─────────────┴───────────┘

If you want to begin with Ibis, but transition to SQL, first give the Ibis expression a name using the alias function. Then you can refer to that as a table in your Table.sql call.

duck_legends = (persistent_ducks
  .filter(persistent_ducks.extinct == 0)
  .select("name", "author", "year")
  .group_by("author")
  .aggregate([persistent_ducks.name.count(), persistent_ducks.year.min()])
  .alias('ibis_duck') # Rename the result of all Ibis expressions up to this point
  .sql("""SELECT * from ibis_duck ORDER BY "Count(name)" desc""")
)
duck_legends
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ author        Count(name)  Min(year) ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ stringint64string    │
├──────────────┼─────────────┼───────────┤
│ Linnaeus    351758      │
│ J. F. Gmelin131789      │
│ Eyton       101838      │
│ Vieillot    81816      │
│ Latham      41790      │
│ Gould       41837      │
│ P. P. King  31828      │
│ Ridgway     31874      │
│ Pallas      31764      │
│ G. R. Gray  21844      │
│          │
└──────────────┴─────────────┴───────────┘

And there you go! You’ve learned:

  • How to read and write Pandas, Polars, and Apache Arrow with DuckDB

  • How to use Ibis to run dataframe queries on top of DuckDB

  • How to see the SQL that Ibis is running on your behalf

  • How to mix and match SQL and Ibis

Exercise 3.03

The SQL that Ibis generated to find the people who discovered the most duck species is not the most concise. Can you re-write the Ibis SQL (listed below) to its simplest possible form, using DuckDB’s Python client?

Hint: as a first step, connect to the same database that Ibis connected to.

SELECT
  *
FROM (
  SELECT
    "t1"."author",
    COUNT("t1"."name") AS "Count(name)",
    MIN("t1"."year") AS "Min(year)"
  FROM (
    SELECT
      "t0"."name",
      "t0"."author",
      "t0"."year"
    FROM "whats_quackalackin"."main"."persistent_ducks" AS "t0"
    WHERE
      "t0"."extinct" = CAST(0 AS TINYINT)
  ) AS "t1"
  GROUP BY
    1
) AS "t2"
ORDER BY
  "t2"."Count(name)" DESC
# Uncomment and run to show solution
# !cat ./answers/answer_3.03.py

Exercise 3.04

Convert the SQL query below into an Ibis expression. You are welcome to ignore the column renaming - think of it as a “stretch-goal” if you have time! We did not cover how to do that yet.

SELECT
    Species_Common_Name,
    AVG(Beak_Width) AS Avg_Beak_Width,
    AVG(Beak_Depth) AS Avg_Beak_Depth,
    AVG(Beak_Length_Culmen) AS Avg_Beak_Length_Culmen
FROM 'birds.csv'
GROUP BY Species_Common_Name

Exercise 3.04

Hint: Read directly from a csv file - no need to create a persistent table!

Hint 2: Ibis uses mean instead of avg!

Hint 3: Ibis aggregate documentation: https://ibis-project.org/reference/expression-tables#ibis.expr.types.relations.Table.aggregate

# Uncomment and run to show solution
# !cat ./answers/answer_3.04.py