What is Ibis and Why Would You Use It?
Ibis is a backend agnostic query builder / dataframe API for Python. Think of it as an interface for dynamically generating & executing SQL queries via Python. Ibis
provides a unified API for a variety of backends like Snowflake, BigQuery, DuckDB, polars or PySpark.
But why would this be useful? I mainly use it for:
- Generating complex dynamic queries where (Jinja-)templating would becomes too messy
- Writing SQL generation as self-contained, easily testable Python functions
- Switching out a production OLAP database (like Snowflake) at test time with a local & version controlled DuckDB instance containing test data
If you know SQL you already know Ibis.
For those who want to skip ahead directly to the code example: Here you can find the correspoding gist. You may run this example directly using uv with:
uv run https://gist.githubusercontent.com/layandreas/d2050e3113c4cbd1627a3cca58040eb3/raw/1acc913dd1042a06017fce16cffa94c4b67d870b/ibis-penguins.py
Query Builder vs. Object-relational Mapper (ORM)
If you’re coming from a web development background you’re probably familiar with ORM’s like for instance SQLAlchemy or the Django ORM. Ibis
is not an ORM but a query builder. While ORM’s will map your database objects to objects in your program Ibis
will not do such a thing! Rather think of using Ibis
in terms of “writing raw SQL queries but in Python”.
An Example using the DuckDB Backend
Create a New DuckDB Table
Let’s use the DuckDB backend and load an example dataset into a table:
# This will create a new DuckDB database in your working directory
# if it doesn't exist yet
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
"penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite=True
)
print("Tables in duckdb:")
print("\n")
print(con.list_tables())
print("\n")
penguins = con.table("penguins")
print("Initial penguins table:")
print("\n")
print(penguins.to_pandas())
print("\n")
Output:
Tables in duckdb:
['penguins']
Initial penguins table:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN None 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
.. ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009
[344 rows x 8 columns]
Window Functions: Row Number Over Partition
My most used feature when it comes to SQL is window functions, in particular row_number() over(partition by ... order by ...)
to select one observation within a group. Now we are going to select the penguins with the longest bill length by species:
penguins = (
penguins.mutate(
rn=ibis.row_number().over(
group_by=[ibis._["species"]],
order_by=ibis.desc(ibis._["bill_length_mm"]),
),
)
.filter(ibis._["rn"] == 0)
.drop("rn")
.alias("select_longest_bill_by_species")
)
print("Penguins with longest bill length by species:")
print("\n")
print(penguins.to_pandas())
print("\n")
Note: This code snippet uses the deferred expression / Underscore API (_) to access the table’s columns including columns added in previous steps.
Output:
Penguins with longest bill length by species:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Gentoo Biscoe 59.6 17.0 230 6050 male 2007
1 Chinstrap Dream 58.0 17.8 181 3700 female 2007
2 Adelie Torgersen 46.0 21.5 194 4200 male 2007
Add a JSON Column
You might need to prepare your data for publishing via an API endpoint or produce a Kafka message. Usually this data is served as JSON. Let’s add a new JSON column to our table:
penguins = penguins.mutate(
json_payload=ibis.struct(
[("species", ibis._["species"]), ("island", ibis._["island"])]
)
).alias("add_json_payload_column")
print("Penguins data with JSON payload column:")
print("\n")
print(penguins.to_pandas())
print("\n")
Output:
Penguins data with JSON payload column:
species island bill_length_mm bill_depth_mm ... body_mass_g sex year json_payload
0 Gentoo Biscoe 59.6 17.0 ... 6050 male 2007 {'species': 'Gentoo', 'island': 'Biscoe'}
1 Chinstrap Dream 58.0 17.8 ... 3700 female 2007 {'species': 'Chinstrap', 'island': 'Dream'}
2 Adelie Torgersen 46.0 21.5 ... 4200 male 2007 {'species': 'Adelie', 'island': 'Torgersen'}
[3 rows x 9 columns]
The Compiled SQL
So what’s happening under the hood? Ibis
uses SQLGlot to compile your expressions into SQL. Running print(ibis.to_sql(penguins))
will give you the compiled raw query string. The compiled SQL for our example looks like this:
WITH "select_longest_bill_by_species" AS (
SELECT
"t2".*
EXCLUDE ("rn")
FROM (
SELECT
*
FROM (
SELECT
"t0"."species",
"t0"."island",
"t0"."bill_length_mm",
"t0"."bill_depth_mm",
"t0"."flipper_length_mm",
"t0"."body_mass_g",
"t0"."sex",
"t0"."year",
ROW_NUMBER() OVER (PARTITION BY "t0"."species" ORDER BY "t0"."bill_length_mm" DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 1 AS "rn"
FROM "penguins" AS "t0"
) AS "t1"
WHERE
"t1"."rn" = 0
) AS "t2"
), "add_json_payload_column" AS (
SELECT
"t4"."species",
"t4"."island",
"t4"."bill_length_mm",
"t4"."bill_depth_mm",
"t4"."flipper_length_mm",
"t4"."body_mass_g",
"t4"."sex",
"t4"."year",
{'species': "t4"."species", 'island': "t4"."island"} AS "json_payload"
FROM "select_longest_bill_by_species" AS "t4"
)
SELECT
*
FROM "add_json_payload_column"
Note: The CTE’s ("select_longest_bill_by_species" AS
, "add_json_payload_column" AS
) are compiled as we used the alias
-method. This was done mostly for readability purposes and come in handy when you’re looking at the generated SQL for debugging purposes. Otherwise ibis
would have just generated nested subqueries.
Now the nice thing is that you could switch out the DuckDB
backend with any of the other supported backends without any code changes. The generated SQL will be valid for the chosen backend. In our example in particular the generation of the JSON column
{'species': "t4"."species", 'island': "t4"."island"} AS "json_payload"
will most likely compile to a different expression for other backends.
Full Example
Here is the full example. To make the data pipeline more composable I wrapped the transformations into the respective functions _select_longest_bill_by_species
and _add_json_payload_column
. The _pipe
helper function was just added for convenience & readability.
Tip: Using uv you can run the script below script with:
uv run https://gist.githubusercontent.com/layandreas/d2050e3113c4cbd1627a3cca58040eb3/raw/1acc913dd1042a06017fce16cffa94c4b67d870b/ibis-penguins.py
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "duckdb==1.2.1",
# "gcsfs>=2025.3.0",
# "ibis-framework[duckdb]==10.4.0",
# "pins==0.8.7",
# ]
# ///
import ibis
from typing import Callable
def main() -> None:
con = ibis.connect("duckdb://penguins.ddb")
con.create_table(
"penguins", ibis.examples.penguins.fetch().to_pyarrow(), overwrite=True
)
print("Tables in duckdb:")
print("\n")
print(con.list_tables())
print("\n")
penguins = con.table("penguins")
print("Initial penguins table:")
print("\n")
print(penguins.to_pandas())
print("\n")
penguins = _pipe(
initial_table=penguins,
funcs=[_select_longest_bill_by_species, _add_json_payload_column],
)
print("Compiled query:")
print("\n")
print(ibis.to_sql(penguins))
print("\n")
con.disconnect()
def _select_longest_bill_by_species(penguins: ibis.Table) -> ibis.Table:
penguins = (
penguins.mutate(
rn=ibis.row_number().over(
group_by=[ibis._["species"]],
order_by=ibis.desc(ibis._["bill_length_mm"]),
),
)
.filter(ibis._["rn"] == 0)
.drop("rn")
.alias("select_longest_bill_by_species")
)
print("Penguins with longest bill length by species:")
print("\n")
print(penguins.to_pandas())
print("\n")
return penguins
def _add_json_payload_column(penguins: ibis.Table) -> ibis.Table:
penguins = penguins.mutate(
json_payload=ibis.struct(
[("species", ibis._["species"]), ("island", ibis._["island"])]
)
).alias("add_json_payload_column")
print("Penguins data with JSON payload column:")
print("\n")
print(penguins.to_pandas())
print("\n")
return penguins
def _pipe(
initial_table: ibis.Table, funcs: list[Callable[[ibis.Table], ibis.Table]]
) -> ibis.Table:
processed_table = initial_table
for func in funcs:
processed_table = func(processed_table)
return processed_table
if __name__ == "__main__":
main()
Usage Patterns
Ibis vs. dbt?
For most of the pre-processing of our data I will generally stick to using raw SQL with dbt with some minimal Jinja templating. Ibis
will then come in at the “last mile”: Some final preparations before loading data into memory e.g. for model fitting or dashboarding.
Especially for the dashboarding / frontend tool use cases where a user might have a lot of input options it’s extremely useful to be able to programmatically generate queries. While Jinja templating is an option I find using Ibis
is much easier to write a testable & maintainable code base.
Testing
Last but not least for me Ibis
has proven itself extremely valuable for testing use cases:
I will create a local DuckDB database and ingest some test data taken corresponding to our production database (like Snowflake or BigQuery).
This database is version controlled and checked into our repository.
At test time we will replace the production database connection with a DuckDB connection and run our tests against the DuckDB test data.
Considerations Before Using Ibis
Some final remarks before I leave you alone:
While the official documentation is good it can be lacking in depth in some parts. You may occasionally need to look into Github issues or the source code to find a solution to your problem.
You should be familiar with SQL before jumping into
Ibis
or at least learn SQL while using it. Its syntax might otherwise feel unfamiliar.LLM’s aren’t that helpful with
Ibis
. If your workflow strongly relies on using LLMs for coding you might get frustrated with hallucinations. I assume there isn’t a ton ofIbis
code in the LLM training corpus.
That said for my use cases Ibis
solves real problems like being able to fake our production database at test time and avoid Jinja templating hell.