Welcome: Your First CDM

Connect to an OMOP CDM and inspect tables and row counts.

You will learn

  • What the CDM reference object (cdm) is and how it holds OMOP tables
  • How to list CDM tables and access them by name
  • How to get person count and preview a few rows
  • How to get a quick snapshot of your CDM
  • How to subset a cdm reference object to one or more person_ids
  • How to clean up the connection when done

Motivating question

“How many people are in this database, and what do the first few person rows look like?”


Setup

We will use the GiBleed Eunomia example dataset and connect via DuckDB.

import cdmconnector as cc
import ibis as ib

print(cc.example_datasets())

path = cc.eunomia_dir("synpuf-1k", cdm_version="5.3")
con = ib.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main")
print(cdm)
('GiBleed', 'synthea-allergies-10k', 'synthea-anemia-10k', 'synthea-breast_cancer-10k', 'synthea-contraceptives-10k', 'synthea-covid19-10k', 'synthea-covid19-200k', 'synthea-dermatitis-10k', 'synthea-heart-10k', 'synthea-hiv-10k', 'synthea-lung_cancer-10k', 'synthea-medications-10k', 'synthea-metabolic_syndrome-10k', 'synthea-opioid_addiction-10k', 'synthea-rheumatoid_arthritis-10k', 'synthea-snf-10k', 'synthea-surgery-10k', 'synthea-total_joint_replacement-10k', 'synthea-veteran_prostate_cancer-10k', 'synthea-veterans-10k', 'synthea-weight_loss-10k', 'synpuf-1k', 'synpuf-110k', 'empty_cdm', 'Synthea27NjParquet')
eunomia_dir: data folder = /Users/ablack/eunomia_data_py
('GiBleed', 'synthea-allergies-10k', 'synthea-anemia-10k', 'synthea-breast_cancer-10k', 'synthea-contraceptives-10k', 'synthea-covid19-10k', 'synthea-covid19-200k', 'synthea-dermatitis-10k', 'synthea-heart-10k', 'synthea-hiv-10k', 'synthea-lung_cancer-10k', 'synthea-medications-10k', 'synthea-metabolic_syndrome-10k', 'synthea-opioid_addiction-10k', 'synthea-rheumatoid_arthritis-10k', 'synthea-snf-10k', 'synthea-surgery-10k', 'synthea-total_joint_replacement-10k', 'synthea-veteran_prostate_cancer-10k', 'synthea-veterans-10k', 'synthea-weight_loss-10k', 'synpuf-1k', 'synpuf-110k', 'empty_cdm', 'Synthea27NjParquet')
eunomia_dir: data folder = /Users/ablack/eunomia_data_py
# OMOP CDM reference (duckdb) of Synpuf

  omop tables: care_site, cdm_source, concept, concept_ancestor, concept_relationship, concept_synonym, condition_era, condition_occurrence, cost, death, device_exposure, dose_era, drug_era, drug_exposure, drug_strength, fact_relationship, location, measurement, note, note_nlp, observation, observation_period, payer_plan_period, person, procedure_occurrence, provider, specimen, visit_detail, visit_occurrence, vocabulary
  cohort tables: -
  achilles tables: -
  other tables: -
# OMOP CDM reference (duckdb) of Synpuf

  omop tables: care_site, cdm_source, concept, concept_ancestor, concept_relationship, concept_synonym, condition_era, condition_occurrence, cost, death, device_exposure, dose_era, drug_era, drug_exposure, drug_strength, fact_relationship, location, measurement, note, note_nlp, observation, observation_period, payer_plan_period, person, procedure_occurrence, provider, specimen, visit_detail, visit_occurrence, vocabulary
  cohort tables: -
  achilles tables: -
  other tables: -

Explore: CDM reference and table listing

The cdm object holds a mapping of table names to Ibis table expressions. List available tables.

# List CDM table names
cdm.tables
['care_site',
 'cdm_source',
 'concept',
 'concept_ancestor',
 'concept_relationship',
 'concept_synonym',
 'condition_era',
 'condition_occurrence',
 'cost',
 'death',
 'device_exposure',
 'dose_era',
 'drug_era',
 'drug_exposure',
 'drug_strength',
 'fact_relationship',
 'location',
 'measurement',
 'note',
 'note_nlp',
 'observation',
 'observation_period',
 'payer_plan_period',
 'person',
 'procedure_occurrence',
 'provider',
 'specimen',
 'visit_detail',
 'visit_occurrence',
 'vocabulary']
cdm.person.schema()
ibis.Schema {
  person_id                    int32
  gender_concept_id            int32
  year_of_birth                int32
  month_of_birth               int32
  day_of_birth                 int32
  birth_datetime               timestamp(6)
  race_concept_id              int32
  ethnicity_concept_id         int32
  location_id                  int32
  provider_id                  int32
  care_site_id                 int32
  person_source_value          string
  gender_source_value          string
  gender_source_concept_id     int32
  race_source_value            string
  race_source_concept_id       int32
  ethnicity_source_value       string
  ethnicity_source_concept_id  int32
}

Person count

Use count and execute to execute an expression and get a pandas DataFrame. For a scalar (e.g. count), the result is a small DataFrame.

cdm.person.count().execute()
1000
# Preview a few rows (limit to avoid large output)
cdm.person.head(10).execute()
person_id gender_concept_id year_of_birth month_of_birth day_of_birth birth_datetime race_concept_id ethnicity_concept_id location_id provider_id care_site_id person_source_value gender_source_value gender_source_concept_id race_source_value race_source_concept_id ethnicity_source_value ethnicity_source_concept_id
0 0 8507 1932 7 1 NaT 8527 38003564 321 None None 408666C393995BA0 Male 0 White 0 Not Hispanic or Latino 0
1 1 8532 1957 9 1 NaT 0 0 92 None None DAD9002C13FC4826 Female 0 Others 0 Others 0
2 2 8532 1934 11 1 NaT 8527 38003564 510 None None 9CFF524AE82DAB2C Female 0 White 0 Not Hispanic or Latino 0
3 3 8507 1942 9 1 NaT 8527 38003564 63 None None E8CBE18D26D1E3A4 Male 0 White 0 Not Hispanic or Latino 0
4 4 8507 1934 7 1 NaT 8527 38003564 509 None None EB7323CB1826E3D0 Male 0 White 0 Not Hispanic or Latino 0
5 5 8532 1930 12 1 NaT 8527 38003564 298 None None 739975F3C289B660 Female 0 White 0 Not Hispanic or Latino 0
6 6 8532 1942 5 1 NaT 8527 38003564 305 None None 302538B9B3A20925 Female 0 White 0 Not Hispanic or Latino 0
7 7 8507 1943 4 1 NaT 8516 38003564 378 None None 891B7BF6BAF5D198 Male 0 Black 0 Not Hispanic or Latino 0
8 8 8532 1951 12 1 NaT 8527 38003564 684 None None 793DEC6E5095CBF7 Female 0 White 0 Not Hispanic or Latino 0
9 9 8532 1942 2 1 NaT 8527 38003564 148 None None ADE4E01133FF6E70 Female 0 White 0 Not Hispanic or Latino 0

Behind the scenes Ibis sends a SQL query to our database

cdm.person.head(5).compile()
'SELECT * FROM "main"."person" AS "t0" LIMIT 5'

The cdm reference stores the cdm name and cdm version which are often helpful. The cdm name is set by looking it up in the cdm_source table of the OMOP CDM database. However it can be overridden by the cdm_name argument in cdm_from_con.

print("cdm.name:", cdm.name)
print("cdm.version:", cdm.version)
cdm.name: Synpuf
cdm.version: 5.3
cdm.name: Synpuf
cdm.version: 5.3

CDMConnector has a snapshot function that will return a dataframe with some summary information about the database. It’s a quick way to get some helpful metadata.

cdm.snapshot()
cdm_name cdm_source_name cdm_description cdm_documentation_reference cdm_version cdm_holder cdm_release_date vocabulary_version person_count observation_period_count earliest_observation_period_start_date latest_observation_period_end_date snapshot_date cdm_data_hash
0 Synpuf Synpuf 5.3 ohdsi 2018-03-15 v5.0 06-AUG-21 1000 1048 2008-01-01 00:00:00 2010-12-31 00:00:00 2026-02-01

Flattening is a CDM transformation that puts all of the CDM’s clinical data into a single table. It can be helpful for getting a timeline view of a person’s medical history.

cdm.flatten().execute().sort_values(by=["person_id", "start_date"])
person_id observation_concept_id start_date end_date type_concept_id domain observation_concept_id_right observation_concept_name type_concept_id_right type_concept_name
58521 0 374372 2008-02-19 2008-02-19 45756835 condition_occurrence 374372 Otogenic otalgia 45756835 Carrier claim header - 1st position
92392 0 374375 2008-02-19 2008-02-19 45756843 condition_occurrence 374375 Impacted cerumen 45756843 Carrier claim detail - 1st position
209028 0 2111199 2008-02-19 2008-02-19 45756900 procedure_occurrence 2111199 Removal impacted cerumen requiring instrumenta... 45756900 Carrier claim detail - 1st position
77318 0 2102077 2008-02-23 2008-02-23 45756901 procedure_occurrence 2102077 Debridement of nail(s) by any method(s); 6 or ... 45756901 Carrier claim detail - 2nd position
119492 0 1506430 2008-02-23 2008-02-23 38000179 drug_exposure 1506430 methylprednisolone 40 MG Injection 38000179 Physician administered drug (identified as pro...
... ... ... ... ... ... ... ... ... ... ...
189748 1494 135214 2010-11-29 2010-11-29 45756843 condition_occurrence 135214 Polycythemia vera (clinical) 45756843 Carrier claim detail - 1st position
34969 1494 2414395 2010-12-01 2010-12-01 45756900 procedure_occurrence 2414395 Office or other outpatient visit for the evalu... 45756900 Carrier claim detail - 1st position
102005 1494 437530 2010-12-01 2010-12-01 45756836 condition_occurrence 437530 Disorder of lipid metabolism 45756836 Carrier claim header - 2nd position
153212 1494 443861 2010-12-01 2010-12-01 45756843 condition_occurrence 443861 Open wound of foot except toes with complication 45756843 Carrier claim detail - 1st position
237358 1494 321052 2010-12-01 2010-12-01 45756837 condition_occurrence 321052 Peripheral vascular disease 45756837 Carrier claim header - 3rd position

271901 rows × 10 columns

Often CDMs are too big to bring into python so normally we would filter the cdm first and then flatten it. We can filter an entire CDM for a one or more specific person_ids with subset.

In case you’re wondering if we copy the entire cdm when we create a subset you can be rest assured we do not. cdm.subset(person_id = 0) creates SQL views for each of the cdm tables with a WHERE person_id = 0 clause applied to each table.

cdm = cdm.subset(person_id=[0])
cdm.flatten().execute()
person_id observation_concept_id start_date end_date type_concept_id domain observation_concept_id_right observation_concept_name type_concept_id_right type_concept_name
0 0 2211767 2010-07-27 2010-07-27 38000267 procedure_occurrence 2211767 Ultrasound, pelvic (nonobstetric), real time w... 38000267 Outpatient detail - 1st position
1 0 2313647 2008-05-20 2008-05-20 45756901 procedure_occurrence 2313647 Visual field examination, unilateral or bilate... 45756901 Carrier claim detail - 2nd position
2 0 2314331 2009-03-24 2009-03-24 45756901 procedure_occurrence 2314331 Chiropractic manipulative treatment (CMT); spi... 45756901 Carrier claim detail - 2nd position
3 0 2617289 2008-09-10 2008-09-10 38000267 procedure_occurrence 2617289 Screening mammography, bilateral (2-view study... 38000267 Outpatient detail - 1st position
4 0 1506430 2008-02-23 2008-02-23 38000179 drug_exposure 1506430 methylprednisolone 40 MG Injection 38000179 Physician administered drug (identified as pro...
... ... ... ... ... ... ... ... ... ... ...
271 0 317302 2008-11-25 2008-11-25 45756835 condition_occurrence 317302 Sinus node dysfunction 45756835 Carrier claim header - 1st position
272 0 317893 2008-11-25 2008-11-25 45756843 condition_occurrence 317893 Paroxysmal supraventricular tachycardia 45756843 Carrier claim detail - 1st position
273 0 375552 2008-06-24 2008-06-24 45756836 condition_occurrence 375552 Pterygium 45756836 Carrier claim header - 2nd position
274 0 432867 2008-11-11 2008-11-11 45756837 condition_occurrence 432867 Hyperlipidemia 45756837 Carrier claim header - 3rd position
275 0 432867 2009-05-15 2009-05-15 45756844 condition_occurrence 432867 Hyperlipidemia 45756844 Carrier claim detail - 2nd position

276 rows × 10 columns

You now have the total number of persons and a small preview. Use this pattern for any CDM table with cdm["table_name"] or cdm.table_name.

Cleanup

Disconnect the CDM reference when done (releases resources).

cdm.disconnect()