from pathlib import Path
import cdmconnector as cc
import ibis
path = cc.eunomia_dir("synpuf-1k", cdm_version="5.3")
con = ibis.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main", cdm_name="eunomia")Measurement Distributions with Guardrails
Summarise numeric measurements: missingness, outliers, summary stats, and optional unit handling.
You will learn
- How to focus on numeric-only measurements (value_as_number)
- How to summarise missingness and outliers
- Summary stats and quantiles for one measurement concept
- Optional unit handling (placeholder if units absent)
Story question
“What is the distribution of a key measurement (e.g. a lab) in this population, and how much is missing or extreme?”
Setup
synpuf-1k for richer measurement data.
Explore: Measurement table and one concept
Pick one measurement_concept_id (e.g. a common lab). Filter to rows with non-null value_as_number.
meas = cdm.measurement
# Discover a common measurement_concept_id (top by count with non-null value_as_number)
top_concepts = (
meas.filter(meas.value_as_number.notnull())
.group_by(meas.measurement_concept_id)
.aggregate(n=meas.measurement_id.count())
.order_by(ibis.desc("n"))
.limit(1)
)
measurement_concept_id = cc.collect(top_concepts)["measurement_concept_id"].iloc[0]
# Or use a fixed concept_id (e.g. 3004249 = Serum glucose): measurement_concept_id = 3004249
meas_one = meas.filter(
meas.measurement_concept_id == measurement_concept_id,
meas.value_as_number.notnull(),
)Build: Missingness and summary stats
Count rows with/without value_as_number; then summary stats and quantiles for numeric values.
# All rows for this concept (for missingness)
meas_all = meas.filter(meas.measurement_concept_id == measurement_concept_id)
n_total = meas_all.measurement_id.count()
n_with_value = meas_all.filter(meas_all.value_as_number.notnull()).measurement_id.count()
# Execute and show n_total, n_with_value (missingness)
cc.collect(meas_all.aggregate(n_total=n_total, n_with_value=n_with_value))# Summary stats and quantiles (numeric only)
summary = meas_one.aggregate(
n=meas_one.measurement_id.count(),
mean_val=meas_one.value_as_number.mean(),
std_val=meas_one.value_as_number.std(),
min_val=meas_one.value_as_number.min(),
q25=meas_one.value_as_number.quantile(0.25),
median_val=meas_one.value_as_number.median(),
q75=meas_one.value_as_number.quantile(0.75),
max_val=meas_one.value_as_number.max(),
)
cc.collect(summary)Interpret: Outliers and optional unit handling
Outliers: e.g. filter to IQR-based range or flag. Unit handling: placeholder if unit_concept_id / value_as_concept_id absent.
# Optional unit handling: filter by unit_concept_id or join to concept for unit names
# If unit_concept_id is present, you can restrict to a specific unit (e.g. mg/dL):
# meas_one_unit = meas_one.filter(meas_one.unit_concept_id == 8718) # 8718 = mg/dL
# Or join to concept to show unit names: meas_one.join(cdm.concept, meas_one.unit_concept_id == cdm.concept.concept_id)
concept = cdm.concept
meas_with_unit_name = meas_one.join(
concept, meas_one.unit_concept_id == concept.concept_id, how="left"
).select("measurement_id", "value_as_number", concept.concept_name.name("unit_name"))
cc.collect(meas_with_unit_name.limit(5))
# Outliers: exclude outside [Q1 - 1.5*IQR, Q3 + 1.5*IQR] (e.g. compute q25/q75 from summary, then filter in Ibis or post-collect)Exercises
- Restrict to a population subset (e.g. persons with at least one condition_occurrence in a given year) and recompute summary stats.
- Per-person summaries: mean value_as_number per person_id for this measurement concept; then summarise that distribution.
- Add a second measurement concept and compare distributions (side-by-side or separate tables).
What we learned
- Numeric-only measurements: filter to non-null value_as_number for stats.
- Missingness: count rows with/without value; summary stats and quantiles describe distribution.
- Outliers and units: handle with IQR or domain rules; unit_concept_id when available.
Cleanup
cdm.disconnect()