Lazy Queries and Pipelines

Build Ibis pipelines with filter, select, mutate, group_by, aggregate, and join to concept.

You will learn

The difference between lazy expressions and execution (collect / execute)
How to chain filter, select, mutate, group_by, aggregate, order_by, limit
How to join to the concept table for human-readable names
How to inspect compiled SQL with .compile()

Story question

“What does the data look like when we filter persons, add demographics, and join to concept names?”

Setup

GiBleed dataset; connect and obtain cdm.

from pathlib import Path
import cdmconnector as cc
import ibis

path = cc.eunomia_dir("GiBleed", cdm_version="5.3")
con = ibis.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main", cdm_name="eunomia")

Explore: Lazy vs execution

Building a pipeline returns an Ibis expression; nothing runs until we call cc.collect(expr) or expr.execute().

person = cdm.person
# Lazy pipeline: filter, select, limit (no execution yet)
expr = (
    person
    .filter(person.year_of_birth >= 1950)
    .select("person_id", "gender_concept_id", "year_of_birth")
    .limit(10)
)
# Execute to get a DataFrame
cc.collect(expr)

Build: Join to concept for human-readable names

Join person to concept on gender_concept_id so we can show concept names instead of IDs.

person = cdm.person
concept = cdm.concept
joined = person.join(
    concept,
    person.gender_concept_id == concept.concept_id,
    how="left",
)
expr = (
    joined
    .filter(person.year_of_birth >= 1950)
    .select(
        person.person_id,
        person.year_of_birth,
        concept.concept_name.name("gender"),
    )
    .order_by(person.year_of_birth.desc())
    .limit(10)
)
cc.collect(expr)

Interpret: Compiled SQL

Show the SQL that would be run (database pushdown). To use this query on a different backend (e.g. Postgres), use SQLGlot to translate: sqlglot.parse_one(compiled, dialect="duckdb").sql(dialect="postgres"). Prefer Ibis for most queries; use SQLGlot when you need to translate existing SQL strings.

# Compile to SQL (backend-specific)
compiled = expr.compile()
print(compiled)

Exercises

Stratify person count by sex: group_by gender_concept_id, aggregate count, then join to concept for labels.
Add age bands: mutate a column age_band (e.g. 0–17, 18–64, 65+) from year_of_birth and summarise counts by band.
Build a pipeline that filters to one gender_concept_id, selects three columns, and limits to 5; show its .compile().
Use order_by with two columns (e.g. year_of_birth, person_id) and execute.

What we learned

Pipelines are lazy: filter/select/mutate/group_by/aggregate/order_by/limit return expressions; execute only at the end via cc.collect(expr).
Join to concept to get human-readable names for concept_id columns.
expr.compile() shows the SQL sent to the database (pushdown). Use SQLGlot to translate that SQL to another dialect when needed.

Cleanup

cdm.disconnect()