Lazy Queries and Pipelines

Build Ibis pipelines with filter, select, mutate, group_by, aggregate, and join to concept.

You will learn

  • The difference between lazy expressions and execution (collect / execute)
  • How to chain filter, select, mutate, group_by, aggregate, order_by, limit
  • How to join to the concept table for human-readable names
  • How to inspect compiled SQL with .compile()

Story question

“What does the data look like when we filter persons, add demographics, and join to concept names?”


Setup

GiBleed dataset; connect and obtain cdm.

from pathlib import Path
import cdmconnector as cc
import ibis

path = cc.eunomia_dir("GiBleed", cdm_version="5.3")
con = ibis.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main", cdm_name="eunomia")

Explore: Lazy vs execution

Building a pipeline returns an Ibis expression; nothing runs until we call cc.collect(expr) or expr.execute().

person = cdm.person
# Lazy pipeline: filter, select, limit (no execution yet)
expr = (
    person
    .filter(person.year_of_birth >= 1950)
    .select("person_id", "gender_concept_id", "year_of_birth")
    .limit(10)
)
# Execute to get a DataFrame
cc.collect(expr)

Build: Join to concept for human-readable names

Join person to concept on gender_concept_id so we can show concept names instead of IDs.

person = cdm.person
concept = cdm.concept
joined = person.join(
    concept,
    person.gender_concept_id == concept.concept_id,
    how="left",
)
expr = (
    joined
    .filter(person.year_of_birth >= 1950)
    .select(
        person.person_id,
        person.year_of_birth,
        concept.concept_name.name("gender"),
    )
    .order_by(person.year_of_birth.desc())
    .limit(10)
)
cc.collect(expr)

Interpret: Compiled SQL

Show the SQL that would be run (database pushdown). To use this query on a different backend (e.g. Postgres), use SQLGlot to translate: sqlglot.parse_one(compiled, dialect="duckdb").sql(dialect="postgres"). Prefer Ibis for most queries; use SQLGlot when you need to translate existing SQL strings.

# Compile to SQL (backend-specific)
compiled = expr.compile()
print(compiled)

Exercises

  • Stratify person count by sex: group_by gender_concept_id, aggregate count, then join to concept for labels.
  • Add age bands: mutate a column age_band (e.g. 0–17, 18–64, 65+) from year_of_birth and summarise counts by band.
  • Build a pipeline that filters to one gender_concept_id, selects three columns, and limits to 5; show its .compile().
  • Use order_by with two columns (e.g. year_of_birth, person_id) and execute.

What we learned

  • Pipelines are lazy: filter/select/mutate/group_by/aggregate/order_by/limit return expressions; execute only at the end via cc.collect(expr).
  • Join to concept to get human-readable names for concept_id columns.
  • expr.compile() shows the SQL sent to the database (pushdown). Use SQLGlot to translate that SQL to another dialect when needed.

Cleanup

cdm.disconnect()