from pathlib import Path
import cdmconnector as cc
import ibis
path = cc.eunomia_dir("GiBleed", cdm_version="5.3")
con = ibis.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main", cdm_name="eunomia")Lazy Queries and Pipelines
Build Ibis pipelines with filter, select, mutate, group_by, aggregate, and join to concept.
You will learn
- The difference between lazy expressions and execution (
collect/execute) - How to chain filter, select, mutate, group_by, aggregate, order_by, limit
- How to join to the concept table for human-readable names
- How to inspect compiled SQL with
.compile()
Story question
“What does the data look like when we filter persons, add demographics, and join to concept names?”
Setup
GiBleed dataset; connect and obtain cdm.
Explore: Lazy vs execution
Building a pipeline returns an Ibis expression; nothing runs until we call cc.collect(expr) or expr.execute().
person = cdm.person
# Lazy pipeline: filter, select, limit (no execution yet)
expr = (
person
.filter(person.year_of_birth >= 1950)
.select("person_id", "gender_concept_id", "year_of_birth")
.limit(10)
)
# Execute to get a DataFrame
cc.collect(expr)Build: Join to concept for human-readable names
Join person to concept on gender_concept_id so we can show concept names instead of IDs.
person = cdm.person
concept = cdm.concept
joined = person.join(
concept,
person.gender_concept_id == concept.concept_id,
how="left",
)
expr = (
joined
.filter(person.year_of_birth >= 1950)
.select(
person.person_id,
person.year_of_birth,
concept.concept_name.name("gender"),
)
.order_by(person.year_of_birth.desc())
.limit(10)
)
cc.collect(expr)Interpret: Compiled SQL
Show the SQL that would be run (database pushdown). To use this query on a different backend (e.g. Postgres), use SQLGlot to translate: sqlglot.parse_one(compiled, dialect="duckdb").sql(dialect="postgres"). Prefer Ibis for most queries; use SQLGlot when you need to translate existing SQL strings.
# Compile to SQL (backend-specific)
compiled = expr.compile()
print(compiled)Exercises
- Stratify person count by sex: group_by
gender_concept_id, aggregate count, then join to concept for labels. - Add age bands: mutate a column
age_band(e.g. 0–17, 18–64, 65+) fromyear_of_birthand summarise counts by band. - Build a pipeline that filters to one
gender_concept_id, selects three columns, and limits to 5; show its.compile(). - Use
order_bywith two columns (e.g. year_of_birth, person_id) and execute.
What we learned
- Pipelines are lazy: filter/select/mutate/group_by/aggregate/order_by/limit return expressions; execute only at the end via
cc.collect(expr). - Join to concept to get human-readable names for concept_id columns.
- expr.compile() shows the SQL sent to the database (pushdown). Use SQLGlot to translate that SQL to another dialect when needed.
Cleanup
cdm.disconnect()