Cohort Characterization: Table 1

Compute a Table 1 (demographics + baseline comorbidities) from a cohort table using synpuf-1k.

You will learn

  • How to take a cohort table / membership set and join to person and observation_period
  • How to compute Table 1: sex, age, observation period duration
  • How to add baseline comorbidities (top 10 conditions in baseline window)
  • Optional: pretty output via great-tables (guard with try/except and fallback to plain DataFrame)

Story question

“What do cohort members look like (demographics) and what are the top baseline conditions?”


Setup

synpuf-1k; we assume a cohort table exists (e.g. from generate_cohort_set or a pre-built cohort). For skeleton we use the first N persons as a “cohort” or an existing cohort table.

from pathlib import Path
import cdmconnector as cc
import ibis

path = cc.eunomia_dir("synpuf-1k", cdm_version="5.3")
con = ibis.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main", cdm_name="eunomia")

Explore: Cohort membership

If a cohort table exists (e.g. “cohort”), use it. Otherwise define a minimal cohort set (e.g. first 100 person_ids with observation_period) for demo.

# Option A: use existing cohort table if present
# cohort_members = cdm["cohort"]
# Option B: define demo cohort = persons with at least one observation_period (limit 100)
person = cdm.person
op = cdm.observation_period
# Simplified: cohort_demo = persons with observation_period and index_date = obs start
cohort_demo = op.group_by(op.person_id).aggregate(
    index_date=op.observation_period_start_date.min(),
    obs_end=op.observation_period_end_date.max(),
).limit(100)
cc.collect(cohort_demo.limit(5))

Build: Table 1 — sex, age, observation period duration

Join cohort to person and observation_period; compute age at index, sex (join to concept), and obs period duration.

person = cdm.person
concept = cdm.concept
ref_year = 2020
cohort_with_person = cohort_demo.join(person, cohort_demo.person_id == person.person_id, how="left")
cohort_with_person = cohort_with_person.mutate(
    age_at_index=ref_year - cohort_with_person.year_of_birth
)
cohort_with_sex = cohort_with_person.join(
    concept,
    cohort_with_person.gender_concept_id == concept.concept_id,
    how="left",
)
table1_expr = cohort_with_sex.aggregate(
    n=cohort_with_sex.person_id.count(),
    mean_age=cohort_with_sex.age_at_index.mean(),
    median_age=cohort_with_sex.age_at_index.median(),
)
# Duration: use obs_end - index_date from cohort_demo
duration_days = cohort_demo.obs_end - cohort_demo.index_date
duration_summary = cohort_demo.mutate(duration_days=duration_days).aggregate(
    mean_duration_days=duration_days.mean(),
)
cc.collect(table1_expr)
cc.collect(duration_summary)

Interpret: Baseline comorbidities (top 10 conditions)

Conditions in the baseline window (e.g. before index_date). Join to concept for names.

cond = cdm.condition_occurrence
concept = cdm.concept
# Conditions that start before cohort index_date (baseline)
baseline_cond = cond.join(cohort_demo, cond.person_id == cohort_demo.person_id, how="inner")
baseline_cond = baseline_cond.filter(cond.condition_start_date < cohort_demo.index_date)  # baseline = before index
top_conditions = (
    baseline_cond
    .group_by(cond.condition_concept_id)
    .aggregate(n=cond.condition_occurrence_id.count())
    .order_by(ibis.desc("n"))
    .limit(10)
)
top_with_name = top_conditions.join(
    concept,
    top_conditions.condition_concept_id == concept.concept_id,
    how="left",
)
cc.collect(top_with_name.select("condition_concept_id", "concept_name", "n"))

Optional: Pretty output with great-tables

If great-tables is installed, render Table 1 as a styled table; otherwise show plain DataFrame.

try:
    from great_tables import GT
    # df = cc.collect(table1_expr)  # or combine demographics into one DataFrame
    # GT(df)  # render as HTML table
    pass
except ImportError:
    pass  # fallback: use plain pandas DataFrame display
# For skeleton we skip actual GT() call; use cc.collect(...) and display(df)

Exercises

  • Add more Table 1 columns: race, ethnicity (join to concept), and year of birth distribution.
  • Define baseline as -365 to 0 days before index and recompute top conditions.
  • Export Table 1 to CSV and (if great-tables) to HTML.

What we learned

  • Table 1: join cohort to person and observation_period; compute sex (concept), age at index, obs period duration.
  • Baseline comorbidities: filter condition_occurrence to before index_date; aggregate by condition_concept_id; join to concept for names.
  • great-tables: optional pretty tables; guard with try/except and fall back to plain DataFrame.

Cleanup

cdm.disconnect()