---
title: "Pedagogical OMOP in Your Browser (GiBleed / Eunomia)"
description: "Mini-course: observational data analysis on the GiBleed Eunomia OMOP CDM using Ibis and DuckDB in-browser—joins, cohort construction, time-at-risk, and outcome counting."
format:
html:
toc: true
toc-depth: 3
number-sections: true
code-copy: true
code-tools: true
include-in-header: wasm_debug_header.html
filters:
- pyodide
pyodide:
packages: ["ibis-framework[duckdb]"]
execute:
echo: true
---
::: {.callout-note}
## View over HTTP
Serve the **built `docs`** folder (e.g. `cd docs && python -m http.server 4522`), then open **http://localhost:4522/tutorials/wasm_example_analysis.html**. Run each code cell in order.
:::
This page is a **pedagogical mini-course** in doing observational data analysis on an OMOP CDM. Code runs in your browser (Pyodide + Ibis + DuckDB + pandas). We use the real **GiBleed Eunomia** example dataset so you can work with an actual OMOP CDM without installing anything locally.
---
# 1. What you'll learn
This tutorial teaches:
- **Ibis + DuckDB**: lazy, backend-agnostic table operations that compile to SQL when you execute.
- **Core OMOP tables**: `person`, `visit_occurrence`, `condition_occurrence`, `drug_exposure`, `measurement`, and `concept`.
- **Observational workflow**: define a cohort (e.g. GI bleed), time-at-risk, baseline characteristics, outcome counts, and rates.
- **Patterns**: joins to `concept`, cohort construction, time-window joins, and a small "study report" function.
- **SQL when needed**: using **SQLGlot** to translate SQL between dialects (e.g. DuckDB → Postgres).
# 2. Prerequisites
You should be comfortable with:
- basic Python
- basic pandas or Ibis for tables and joins
Everything else is taught as we go.
# 3. Install and load packages
In the browser we use **Pyodide** plus **micropip** to install Ibis (not included in the default Pyodide bundle). Run the cell below once; it installs `ibis-framework[duckdb]` then imports the tools we need to download and analyze the GiBleed Eunomia example data in-browser.
```{pyodide-python}
import micropip
await micropip.install("ibis-framework[duckdb]")
import ibis
import pandas as pd
import io
import zipfile
from pathlib import Path
from pyodide.http import pyfetch
```
# 4. Get the GiBleed Eunomia example and connect
Eunomia provides synthetic OMOP CDM datasets for teaching and testing. In this browser demo we download the real **GiBleed 5.3** Eunomia archive, cache its parquet files in Pyodide's local filesystem, and register the core OMOP tables in DuckDB.
```{pyodide-python}
DATASET_URL = "https://cdmconnectordata.blob.core.windows.net/cdmconnector-example-data/GiBleed_5.3.zip"
WORK_DIR = Path("/tmp/eunomia_gibleed")
WORK_DIR.mkdir(parents=True, exist_ok=True)
parquet_root = WORK_DIR / "GiBleed"
if not parquet_root.exists() or not any(parquet_root.glob("*.parquet")):
print("Downloading GiBleed Eunomia data...")
response = await pyfetch(DATASET_URL)
if not response.ok:
raise RuntimeError(f"Failed to download {DATASET_URL} (status {response.status})")
zip_bytes = await response.bytes()
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
zf.extractall(WORK_DIR)
else:
print("Using cached GiBleed Eunomia parquet files.")
if not parquet_root.exists():
parquet_root = WORK_DIR
TABLES_TO_LOAD = [
"person",
"concept",
"visit_occurrence",
"condition_occurrence",
"drug_exposure",
"measurement",
"death",
]
con = ibis.duckdb.connect()
loaded_tables = []
for table_name in TABLES_TO_LOAD:
parquet_path = parquet_root / f"{table_name}.parquet"
if parquet_path.exists():
con.raw_sql(
f"CREATE OR REPLACE TABLE {table_name} AS SELECT * FROM read_parquet('{parquet_path.as_posix()}')"
)
loaded_tables.append(table_name)
print("Loaded tables:", loaded_tables)
```
## 4.1 What is a "CDM reference"?
In CDMConnector you get a single object (for example `cdm`) whose elements are lazy Ibis table references. Here we use a plain **Ibis connection** and the GiBleed OMOP tables loaded from Eunomia parquet files; we query them with the same Ibis expressions.
```{pyodide-python}
print("Tables:", loaded_tables)
```
# 5. First contact with OMOP data
## 5.1 person: who are the individuals?
```{pyodide-python}
person_t = con.table("person")
out = person_t.select(
"person_id", "gender_concept_id", "year_of_birth", "month_of_birth", "day_of_birth"
).limit(10)
print(con.to_pandas(out).to_string(index=False))
```
### Key OMOP idea: "concept_id everywhere"
OMOP stores vocabulary as integer IDs. `gender_concept_id` points to the `concept` table. Decode gender by joining to `concept`.
```{pyodide-python}
concept_t = con.table("concept")
joined = person_t.join(
concept_t.select(concept_t.concept_id, concept_t.concept_name),
person_t.gender_concept_id == concept_t.concept_id,
how="left",
)
gender_counts = (
joined.group_by("concept_name")
.aggregate(n=joined.person_id.count())
.order_by(ibis.desc("n"))
)
print(con.to_pandas(gender_counts).to_string(index=False))
```
## 5.2 Visits: where do records cluster in time?
```{pyodide-python}
visit_t = con.table("visit_occurrence")
visit_summary = visit_t.aggregate(
n_visits=visit_t.count(),
n_people=visit_t.person_id.nunique(),
min_date=visit_t.visit_start_date.min(),
max_date=visit_t.visit_end_date.max(),
)
print(con.to_pandas(visit_summary).to_string(index=False))
```
# 6. Ibis on a database: mental model
When you build an Ibis expression, nothing runs until you execute it (e.g. `con.to_pandas(expr)`). In a real CDM you call `collect()` or `execute()`. Same idea: **query first, execute when needed.** You can inspect the compiled SQL; here we show a typical filter.
```{pyodide-python}
cond_t = con.table("condition_occurrence")
q = (
cond_t.filter(cond_t.condition_start_date >= "2000-01-01")
.select("person_id", "condition_concept_id", "condition_start_date")
)
print("Compiled SQL (DuckDB):")
print(q.compile())
out = con.to_pandas(q)
print("Rows:", len(out))
```
# 7. A first descriptive analysis: "Top conditions"
Most frequent condition concepts, then decode to names.
```{pyodide-python}
cond_t = con.table("condition_occurrence")
concept_t = con.table("concept")
by_concept = (
cond_t.group_by("condition_concept_id")
.aggregate(n=cond_t.count())
.order_by(ibis.desc("n"))
)
top_conditions = (
by_concept.join(
concept_t.select("concept_id", "concept_name", "domain_id", "vocabulary_id"),
by_concept.condition_concept_id == concept_t.concept_id,
how="left",
)
.select("condition_concept_id", "concept_name", "domain_id", "vocabulary_id", "n")
.limit(15)
)
print(con.to_pandas(top_conditions).to_string(index=False))
```
# 8. The observational workflow: define a cohort, then analyze
Common pattern:
1. Define an **index event** (cohort entry).
2. Optional inclusion/exclusion.
3. Define **time-at-risk** (TAR).
4. Estimate outcomes, covariates, rates.
We build a **GI bleed** cohort from condition concept **192671** (Gastrointestinal hemorrhage): first occurrence per person, cohort end = start + 10 days.
## 8.1 Define a "GI bleed" cohort
```{pyodide-python}
cond_t = con.table("condition_occurrence")
first_date = (
cond_t.filter(cond_t.condition_concept_id == 192671)
.group_by("person_id")
.aggregate(first_date=cond_t.condition_start_date.min())
)
gibleed_cohort_expr = first_date.select(
cohort_definition_id=ibis.literal(1, type="int64"),
subject_id=first_date.person_id,
cohort_start_date=first_date.first_date.cast("date"),
cohort_end_date=(first_date.first_date.cast("date") + ibis.interval(days=10)),
)
con.create_table("gibleed_cohort", gibleed_cohort_expr, overwrite=True)
print("Cohort table created.")
```
A cohort table has at minimum: `cohort_definition_id`, `subject_id`, `cohort_start_date`, `cohort_end_date`.
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
peek = con.to_pandas(gibleed_t.limit(10))
print(peek.to_string(index=False))
```
## 8.2 Cohort size and person-time
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
cohort_summary = gibleed_t.aggregate(
n_entries=gibleed_t.count(),
n_people=gibleed_t.subject_id.nunique(),
min_start=gibleed_t.cohort_start_date.min(),
max_start=gibleed_t.cohort_start_date.max(),
)
print(con.to_pandas(cohort_summary).to_string(index=False))
```
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
tar_days = (gibleed_t.cohort_end_date - gibleed_t.cohort_start_date).cast("int64")
with_tar = gibleed_t.mutate(tar_days=tar_days)
pt = with_tar.aggregate(
total_tar_days=with_tar.tar_days.sum(),
mean_tar_days=with_tar.tar_days.mean(),
)
print("Person-time (days):")
print(con.to_pandas(pt).to_string(index=False))
```
# 9. Baseline characteristics at cohort entry
Age and sex at `cohort_start_date`. We approximate age from year/month/day of birth.
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
person_t = con.table("person")
joined = gibleed_t.join(person_t, gibleed_t.subject_id == person_t.person_id)
baseline = joined.select(
joined.subject_id,
joined.cohort_start_date,
joined.gender_concept_id,
age=joined.cohort_start_date.year() - joined.year_of_birth,
)
print(con.to_pandas(baseline.limit(10)).to_string(index=False))
```
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
person_t = con.table("person")
concept_t = con.table("concept")
joined = gibleed_t.join(person_t, gibleed_t.subject_id == person_t.person_id)
age_expr = joined.cohort_start_date.year() - joined.year_of_birth
baseline_summary = joined.aggregate(
n_people=joined.subject_id.nunique(),
age_mean=age_expr.mean(),
age_sd=age_expr.std(),
)
print(con.to_pandas(baseline_summary).to_string(index=False))
joined_sex = joined.join(
concept_t.select(concept_t.concept_id, concept_t.concept_name),
person_t.gender_concept_id == concept_t.concept_id,
how="left",
)
sex_summary = (
joined_sex.group_by("concept_name")
.aggregate(n=joined_sex.subject_id.count())
.order_by(ibis.desc("n"))
)
print("Sex breakdown:")
print(con.to_pandas(sex_summary).to_string(index=False))
```
# 10. Outcomes: count events during time-at-risk
For each cohort entry, look for records between `cohort_start_date` and `cohort_end_date`.
## 10.1 Death during cohort window (if available)
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
death_t = con.table("death")
joined = gibleed_t.join(death_t, gibleed_t.subject_id == death_t.person_id, how="left")
in_window = (
joined.death_date.notnull()
& (joined.death_date >= joined.cohort_start_date)
& (joined.death_date <= joined.cohort_end_date)
)
death_in_tar = joined.group_by(joined.cohort_definition_id).aggregate(
n_entries=joined.count(),
n_deaths_in_window=ibis.ifelse(in_window, 1, 0).sum(),
)
print(con.to_pandas(death_in_tar).to_string(index=False))
```
## 10.2 Condition outcome during cohort window
Join cohort to `condition_occurrence`, filter condition dates within TAR, then summarise. (Here we use the same concept 192671 as a toy outcome.)
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
cond_t = con.table("condition_occurrence")
concept_outcome = 192671
joined = gibleed_t.join(
cond_t,
(gibleed_t.subject_id == cond_t.person_id)
& (cond_t.condition_concept_id == concept_outcome)
& (cond_t.condition_start_date >= gibleed_t.cohort_start_date)
& (cond_t.condition_start_date <= gibleed_t.cohort_end_date),
)
outcome_counts = joined.aggregate(
n_events=joined.count(),
n_people_with_event=joined.subject_id.nunique(),
)
print(con.to_pandas(outcome_counts).to_string(index=False))
```
# 11. Rates: events per person-time
Crude incidence rate = events / total time-at-risk (e.g. per person-year).
$$\text{rate} = \frac{\text{events}}{\text{total time-at-risk (years)}}$$
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
cond_t = con.table("condition_occurrence")
tar_days_expr = (gibleed_t.cohort_end_date - gibleed_t.cohort_start_date).cast("float64")
with_tar = gibleed_t.mutate(tar_days=tar_days_expr)
total_tar = con.to_pandas(with_tar.aggregate(total_tar_days=with_tar.tar_days.sum()))
joined = gibleed_t.join(
cond_t,
(gibleed_t.subject_id == cond_t.person_id)
& (cond_t.condition_concept_id == 192671)
& (cond_t.condition_start_date >= gibleed_t.cohort_start_date)
& (cond_t.condition_start_date <= gibleed_t.cohort_end_date),
)
events_df = con.to_pandas(joined.aggregate(n_events=joined.count()))
total_tar_days = total_tar["total_tar_days"].iloc[0]
n_events = events_df["n_events"].iloc[0]
total_tar_years = total_tar_days / 365.25
rate_per_py = n_events / total_tar_years if total_tar_years > 0 else 0
print(f"Total TAR (days): {total_tar_days:.0f}, TAR (years): {total_tar_years:.2f}")
print(f"Events: {n_events}, Rate per person-year: {rate_per_py:.4f}")
```
# 12. Practical Ibis-on-CDM skills
## 12.1 semi_join: restrict to cohort members
Filter any CDM table to **people in your cohort**. Example: drug exposures among cohort members (any time).
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
drug_t = con.table("drug_exposure")
concept_t = con.table("concept")
cohort_subjects = gibleed_t.select("subject_id").distinct()
joined = drug_t.join(cohort_subjects, drug_t.person_id == cohort_subjects.subject_id)
joined = joined.join(
concept_t.select(concept_t.concept_id, concept_t.concept_name),
joined.drug_concept_id == concept_t.concept_id,
how="left",
)
drugs_in_cohort = (
joined.group_by("drug_concept_id", "concept_name")
.aggregate(n=joined.count())
.order_by(ibis.desc("n"))
.limit(10)
)
print(con.to_pandas(drugs_in_cohort).to_string(index=False))
```
## 12.2 Time-window joins
Restrict drug exposures to the cohort TAR window: `drug_exposure_start_date` between `cohort_start_date` and `cohort_end_date`.
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
drug_t = con.table("drug_exposure")
concept_t = con.table("concept")
joined = gibleed_t.join(
drug_t,
(gibleed_t.subject_id == drug_t.person_id)
& (drug_t.drug_exposure_start_date >= gibleed_t.cohort_start_date)
& (drug_t.drug_exposure_start_date <= gibleed_t.cohort_end_date),
)
joined = joined.join(
concept_t.select(concept_t.concept_id, concept_t.concept_name),
joined.drug_concept_id == concept_t.concept_id,
how="left",
)
drugs_in_tar = (
joined.group_by("drug_concept_id", "concept_name")
.aggregate(n=joined.count())
.order_by(ibis.desc("n"))
.limit(10)
)
print(con.to_pandas(drugs_in_tar).to_string(index=False))
```
# 13. When you have SQL: translate with SQLGlot
When you do have raw SQL (e.g. from CIRCE cohort definitions or legacy scripts), use **SQLGlot** to keep it generic and translate between dialects (DuckDB, Postgres, Spark, etc.). That way one SQL source can run on different backends.
```{pyodide-python}
# If sqlglot is available in your environment: pip install sqlglot
try:
import sqlglot
except ImportError:
sqlglot = None
if sqlglot is not None:
# Example: generic SQL (DuckDB-style here)
generic_sql = """
SELECT person_id, condition_concept_id, condition_start_date
FROM condition_occurrence
WHERE condition_start_date >= DATE '2000-01-01'
"""
parsed = sqlglot.parse_one(generic_sql, dialect="duckdb")
# Translate to Postgres dialect for use on PostgreSQL
postgres_sql = parsed.sql(dialect="postgres")
print("Original (DuckDB):")
print(generic_sql.strip())
print("\nTranslated (Postgres):")
print(postgres_sql)
else:
print("SQLGlot not installed. Install with: pip install sqlglot")
print("Then you can parse SQL and translate: sqlglot.parse_one(sql, dialect='duckdb').sql(dialect='postgres')")
```
Use **Ibis** as the primary way to express queries (backend-agnostic, composable). Use **SQLGlot** when you need to work with existing SQL strings and run them on another dialect.
# 14. Materializing intermediate results
On large CDMs you keep queries lazy and **materialize** only when reusing. Here we create a table from an Ibis expression for reuse.
```{pyodide-python}
gibleed_t = con.table("gibleed_cohort")
drug_t = con.table("drug_exposure")
joined = gibleed_t.join(
drug_t,
(gibleed_t.subject_id == drug_t.person_id)
& (drug_t.drug_exposure_start_date >= gibleed_t.cohort_start_date)
& (drug_t.drug_exposure_start_date <= gibleed_t.cohort_end_date),
)
tar_drugs_expr = joined.group_by("drug_concept_id").aggregate(n=joined.count())
con.create_table("tmp_tar_drugs", tar_drugs_expr, overwrite=True)
tar_drugs_mat = con.to_pandas(con.table("tmp_tar_drugs").limit(5))
print(tar_drugs_mat.to_string(index=False))
```
# 15. Putting it together: a tiny "study report" function
A realistic workflow: a function that takes the connection and cohort table name and returns a summary (using Ibis).
```{pyodide-python}
def summarise_cohort_basics(con, cohort_table_name):
cohort_t = con.table(cohort_table_name)
person_t = con.table("person")
joined = cohort_t.join(person_t, cohort_t.subject_id == person_t.person_id)
age_expr = joined.cohort_start_date.year() - joined.year_of_birth
return con.to_pandas(
joined.aggregate(
n_entries=joined.count(),
n_people=joined.subject_id.nunique(),
age_mean=age_expr.mean(),
age_p50=age_expr.median(),
)
)
report = summarise_cohort_basics(con, "gibleed_cohort")
print(report.to_string(index=False))
```
# 16. Next steps
To grow this into a full observational analysis course:
1. **Cohort design rigor**: washout periods, prior observation, inclusion criteria, censoring.
2. **Confounding + covariates**: baseline covariate construction from condition/drug/measurement history.
3. **Comparative effectiveness**: target trial emulation, exposure cohorts, propensity scores.
4. **Outcome modeling**: incidence rates, Kaplan–Meier, Cox models (with careful design).
5. **Reusable pipelines**: parameterized functions producing standardized outputs.
# 17. Clean up
```{pyodide-python}
con.disconnect()
print("Disconnected.")
```
---
**Provenance**: This in-browser lesson uses the real **GiBleed Eunomia** example dataset and mirrors the [Pedagogical OMOP (GiBleed)](../notebooks/10_pedagogical_omop_gibleed.qmd) tutorial. For local work, use [CDMConnector](https://github.com/darwin-eu/CDMConnector) with [Eunomia](https://darwin-eu.github.io/CDMConnector/reference/eunomiaDir.html) (for example `eunomia_dir("GiBleed")`) and run the same workflow with Ibis + DuckDB or your backend. Prefer **Ibis** for queries; use **SQLGlot** to translate existing SQL between dialects when needed.