Understanding OMOP: People, Time, and Domains

Summarise person demographics, observation period coverage, and visits by type using synpuf-1k.

You will learn

  • How to summarise person demographics (counts by sex, age bands)
  • How to summarise observation period coverage (duration, coverage)
  • How to count visits by type and join to concept for domain labels
  • The idea of “domains” and concept joins in the OMOP CDM

Story question

“Who is in this cohort (demographics), how much observation time do we have, and what visit types do we see?”


Setup

We use synpuf-1k for richer, more realistic distributions.

from pathlib import Path
import cdmconnector as cc
import ibis

path = cc.eunomia_dir("synpuf-1k", cdm_version="5.3")
con = ibis.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main", cdm_name="eunomia")

Explore: Person demographics summary

Count persons by sex (join to concept for labels) and by age band.

person = cdm.person
concept = cdm.concept
# Use current year for age (or a fixed reference year for reproducibility)
import datetime
ref_year = datetime.date.today().year
person_with_age = person.mutate(age = ref_year - person.year_of_birth)
person_with_band = person_with_age.mutate(
    age_band = ibis.case().when(person_with_age.age < 18, "0-17")
    .when(person_with_age.age < 65, "18-64")
    .else_("65+").end()
)
by_band = person_with_band.group_by("age_band").aggregate(n=person_with_band.person_id.count())
cc.collect(by_band.order_by("age_band"))
# Count by sex (join to concept for name)
joined = person.join(concept, person.gender_concept_id == concept.concept_id, how="left")
by_sex = joined.group_by(concept.concept_name.name("sex")).aggregate(n=person.person_id.count())
cc.collect(by_sex)

Build: Observation period coverage

Summarise observation period duration (e.g. mean/median days per person).

op = cdm.observation_period
# Duration in days (expression)
duration_days = op.observation_period_end_date - op.observation_period_start_date
op_with_dur = op.mutate(duration_days=duration_days)
summary = op_with_dur.aggregate(
    mean_days=op_with_dur.duration_days.mean(),
    median_days=op_with_dur.duration_days.median(),
    n_periods=op_with_dur.observation_period_id.count(),
)
cc.collect(summary)

Interpret: Visits by type

Count visits by visit_concept_id and join to concept to show visit type names (domain-style).

visit = cdm.visit_occurrence
concept = cdm.concept
visit_with_name = visit.join(
    concept,
    visit.visit_concept_id == concept.concept_id,
    how="left",
)
by_visit_type = visit_with_name.group_by(concept.concept_name.name("visit_type")).aggregate(
    n=visit.visit_occurrence_id.count()
).order_by(ibis.desc("n")).limit(10)
cc.collect(by_visit_type)

Exercises

  • Add a demographics summary that combines sex and age_band (two-way table).
  • Compute the distribution of observation period count per person (how many people have 1, 2, 3+ periods).
  • List top 5 visit types by count and show their concept_id and concept_name.
  • Emphasise “domains”: condition_occurrence, drug_exposure, measurement each link to concept; try one more domain.

What we learned

  • Person and observation_period define who is in the CDM and their at-risk time.
  • Observation period duration summarises coverage; aggregate with mean/median/count.
  • Visit type and other domain tables use concept_id; join to concept for human-readable names and domain understanding.

Cleanup

cdm.disconnect()