Welcome: Your First CDM

Connect to an OMOP CDM and inspect tables and row counts.

You will learn

What the CDM reference object (cdm) is and how it holds OMOP tables
How to list CDM tables and access them by name
How to get person count and preview a few rows
How to get a quick snapshot of your CDM
How to subset a cdm reference object to one or more person_ids
How to clean up the connection when done

Motivating question

“How many people are in this database, and what do the first few person rows look like?”

Setup

We will use the GiBleed Eunomia example dataset and connect via DuckDB.

import cdmconnector as cc
import ibis as ib

print(cc.example_datasets())

path = cc.eunomia_dir("synpuf-1k", cdm_version="5.3")
con = ib.duckdb.connect(path)
cdm = cc.cdm_from_con(con, cdm_schema="main", write_schema="main")
print(cdm)

('GiBleed', 'synthea-allergies-10k', 'synthea-anemia-10k', 'synthea-breast_cancer-10k', 'synthea-contraceptives-10k', 'synthea-covid19-10k', 'synthea-covid19-200k', 'synthea-dermatitis-10k', 'synthea-heart-10k', 'synthea-hiv-10k', 'synthea-lung_cancer-10k', 'synthea-medications-10k', 'synthea-metabolic_syndrome-10k', 'synthea-opioid_addiction-10k', 'synthea-rheumatoid_arthritis-10k', 'synthea-snf-10k', 'synthea-surgery-10k', 'synthea-total_joint_replacement-10k', 'synthea-veteran_prostate_cancer-10k', 'synthea-veterans-10k', 'synthea-weight_loss-10k', 'synpuf-1k', 'synpuf-110k', 'empty_cdm', 'Synthea27NjParquet')
eunomia_dir: data folder = /Users/ablack/eunomia_data_py
('GiBleed', 'synthea-allergies-10k', 'synthea-anemia-10k', 'synthea-breast_cancer-10k', 'synthea-contraceptives-10k', 'synthea-covid19-10k', 'synthea-covid19-200k', 'synthea-dermatitis-10k', 'synthea-heart-10k', 'synthea-hiv-10k', 'synthea-lung_cancer-10k', 'synthea-medications-10k', 'synthea-metabolic_syndrome-10k', 'synthea-opioid_addiction-10k', 'synthea-rheumatoid_arthritis-10k', 'synthea-snf-10k', 'synthea-surgery-10k', 'synthea-total_joint_replacement-10k', 'synthea-veteran_prostate_cancer-10k', 'synthea-veterans-10k', 'synthea-weight_loss-10k', 'synpuf-1k', 'synpuf-110k', 'empty_cdm', 'Synthea27NjParquet')
eunomia_dir: data folder = /Users/ablack/eunomia_data_py
# OMOP CDM reference (duckdb) of Synpuf

  omop tables: care_site, cdm_source, concept, concept_ancestor, concept_relationship, concept_synonym, condition_era, condition_occurrence, cost, death, device_exposure, dose_era, drug_era, drug_exposure, drug_strength, fact_relationship, location, measurement, note, note_nlp, observation, observation_period, payer_plan_period, person, procedure_occurrence, provider, specimen, visit_detail, visit_occurrence, vocabulary
  cohort tables: -
  achilles tables: -
  other tables: -
# OMOP CDM reference (duckdb) of Synpuf

  omop tables: care_site, cdm_source, concept, concept_ancestor, concept_relationship, concept_synonym, condition_era, condition_occurrence, cost, death, device_exposure, dose_era, drug_era, drug_exposure, drug_strength, fact_relationship, location, measurement, note, note_nlp, observation, observation_period, payer_plan_period, person, procedure_occurrence, provider, specimen, visit_detail, visit_occurrence, vocabulary
  cohort tables: -
  achilles tables: -
  other tables: -

Explore: CDM reference and table listing

The cdm object holds a mapping of table names to Ibis table expressions. List available tables.

# List CDM table names
cdm.tables

['care_site',
 'cdm_source',
 'concept',
 'concept_ancestor',
 'concept_relationship',
 'concept_synonym',
 'condition_era',
 'condition_occurrence',
 'cost',
 'death',
 'device_exposure',
 'dose_era',
 'drug_era',
 'drug_exposure',
 'drug_strength',
 'fact_relationship',
 'location',
 'measurement',
 'note',
 'note_nlp',
 'observation',
 'observation_period',
 'payer_plan_period',
 'person',
 'procedure_occurrence',
 'provider',
 'specimen',
 'visit_detail',
 'visit_occurrence',
 'vocabulary']

cdm.person.schema()

ibis.Schema {
  person_id                    int32
  gender_concept_id            int32
  year_of_birth                int32
  month_of_birth               int32
  day_of_birth                 int32
  birth_datetime               timestamp(6)
  race_concept_id              int32
  ethnicity_concept_id         int32
  location_id                  int32
  provider_id                  int32
  care_site_id                 int32
  person_source_value          string
  gender_source_value          string
  gender_source_concept_id     int32
  race_source_value            string
  race_source_concept_id       int32
  ethnicity_source_value       string
  ethnicity_source_concept_id  int32
}

Person count

Use count and execute to execute an expression and get a pandas DataFrame. For a scalar (e.g. count), the result is a small DataFrame.

cdm.person.count().execute()

# Preview a few rows (limit to avoid large output)
cdm.person.head(10).execute()

	person_id	gender_concept_id	year_of_birth	month_of_birth	day_of_birth	birth_datetime	race_concept_id	ethnicity_concept_id	location_id	provider_id	care_site_id	person_source_value	gender_source_value	race_source_value	ethnicity_source_value
0	0	8507	1932	7	1	NaT	8527	38003564	321	None	None	408666C393995BA0	Male	White	Not Hispanic or Latino
1	1	8532	1957	9	1	NaT	0	0	92	None	None	DAD9002C13FC4826	Female	Others	Others
2	2	8532	1934	11	1	NaT	8527	38003564	510	None	None	9CFF524AE82DAB2C	Female	White	Not Hispanic or Latino
3	3	8507	1942	9	1	NaT	8527	38003564	63	None	None	E8CBE18D26D1E3A4	Male	White	Not Hispanic or Latino
4	4	8507	1934	7	1	NaT	8527	38003564	509	None	None	EB7323CB1826E3D0	Male	White	Not Hispanic or Latino
5	5	8532	1930	12	1	NaT	8527	38003564	298	None	None	739975F3C289B660	Female	White	Not Hispanic or Latino
6	6	8532	1942	5	1	NaT	8527	38003564	305	None	None	302538B9B3A20925	Female	White	Not Hispanic or Latino
7	7	8507	1943	4	1	NaT	8516	38003564	378	None	None	891B7BF6BAF5D198	Male	Black	Not Hispanic or Latino
8	8	8532	1951	12	1	NaT	8527	38003564	684	None	None	793DEC6E5095CBF7	Female	White	Not Hispanic or Latino
9	9	8532	1942	2	1	NaT	8527	38003564	148	None	None	ADE4E01133FF6E70	Female	White	Not Hispanic or Latino

Behind the scenes Ibis sends a SQL query to our database

cdm.person.head(5).compile()

'SELECT * FROM "main"."person" AS "t0" LIMIT 5'

The cdm reference stores the cdm name and cdm version which are often helpful. The cdm name is set by looking it up in the cdm_source table of the OMOP CDM database. However it can be overridden by the cdm_name argument in cdm_from_con.

print("cdm.name:", cdm.name)
print("cdm.version:", cdm.version)

cdm.name: Synpuf
cdm.version: 5.3
cdm.name: Synpuf
cdm.version: 5.3

CDMConnector has a snapshot function that will return a dataframe with some summary information about the database. It’s a quick way to get some helpful metadata.

cdm.snapshot()

	cdm_name	cdm_source_name	cdm_description	cdm_documentation_reference	cdm_version	cdm_holder	cdm_release_date	vocabulary_version	person_count	observation_period_count	earliest_observation_period_start_date	latest_observation_period_end_date	snapshot_date	cdm_data_hash
0	Synpuf	Synpuf			5.3	ohdsi	2018-03-15	v5.0 06-AUG-21	1000	1048	2008-01-01 00:00:00	2010-12-31 00:00:00	2026-02-01

Flattening is a CDM transformation that puts all of the CDM’s clinical data into a single table. It can be helpful for getting a timeline view of a person’s medical history.

cdm.flatten().execute().sort_values(by=["person_id", "start_date"])

	person_id	observation_concept_id	start_date	end_date	type_concept_id	domain	observation_concept_id_right	observation_concept_name	type_concept_id_right	type_concept_name
58521	0	374372	2008-02-19	2008-02-19	45756835	condition_occurrence	374372	Otogenic otalgia	45756835	Carrier claim header - 1st position
92392	0	374375	2008-02-19	2008-02-19	45756843	condition_occurrence	374375	Impacted cerumen	45756843	Carrier claim detail - 1st position
209028	0	2111199	2008-02-19	2008-02-19	45756900	procedure_occurrence	2111199	Removal impacted cerumen requiring instrumenta...	45756900	Carrier claim detail - 1st position
77318	0	2102077	2008-02-23	2008-02-23	45756901	procedure_occurrence	2102077	Debridement of nail(s) by any method(s); 6 or ...	45756901	Carrier claim detail - 2nd position
119492	0	1506430	2008-02-23	2008-02-23	38000179	drug_exposure	1506430	methylprednisolone 40 MG Injection	38000179	Physician administered drug (identified as pro...
...	...	...	...	...	...	...	...	...	...	...
189748	1494	135214	2010-11-29	2010-11-29	45756843	condition_occurrence	135214	Polycythemia vera (clinical)	45756843	Carrier claim detail - 1st position
34969	1494	2414395	2010-12-01	2010-12-01	45756900	procedure_occurrence	2414395	Office or other outpatient visit for the evalu...	45756900	Carrier claim detail - 1st position
102005	1494	437530	2010-12-01	2010-12-01	45756836	condition_occurrence	437530	Disorder of lipid metabolism	45756836	Carrier claim header - 2nd position
153212	1494	443861	2010-12-01	2010-12-01	45756843	condition_occurrence	443861	Open wound of foot except toes with complication	45756843	Carrier claim detail - 1st position
237358	1494	321052	2010-12-01	2010-12-01	45756837	condition_occurrence	321052	Peripheral vascular disease	45756837	Carrier claim header - 3rd position

271901 rows × 10 columns

Often CDMs are too big to bring into python so normally we would filter the cdm first and then flatten it. We can filter an entire CDM for a one or more specific person_ids with subset.

In case you’re wondering if we copy the entire cdm when we create a subset you can be rest assured we do not. cdm.subset(person_id = 0) creates SQL views for each of the cdm tables with a WHERE person_id = 0 clause applied to each table.

cdm = cdm.subset(person_id=[0])
cdm.flatten().execute()

	person_id	observation_concept_id	start_date	end_date	type_concept_id	domain	observation_concept_id_right	observation_concept_name	type_concept_id_right	type_concept_name
0	0	2211767	2010-07-27	2010-07-27	38000267	procedure_occurrence	2211767	Ultrasound, pelvic (nonobstetric), real time w...	38000267	Outpatient detail - 1st position
1	0	2313647	2008-05-20	2008-05-20	45756901	procedure_occurrence	2313647	Visual field examination, unilateral or bilate...	45756901	Carrier claim detail - 2nd position
2	0	2314331	2009-03-24	2009-03-24	45756901	procedure_occurrence	2314331	Chiropractic manipulative treatment (CMT); spi...	45756901	Carrier claim detail - 2nd position
3	0	2617289	2008-09-10	2008-09-10	38000267	procedure_occurrence	2617289	Screening mammography, bilateral (2-view study...	38000267	Outpatient detail - 1st position
4	0	1506430	2008-02-23	2008-02-23	38000179	drug_exposure	1506430	methylprednisolone 40 MG Injection	38000179	Physician administered drug (identified as pro...
...	...	...	...	...	...	...	...	...	...	...
271	0	317302	2008-11-25	2008-11-25	45756835	condition_occurrence	317302	Sinus node dysfunction	45756835	Carrier claim header - 1st position
272	0	317893	2008-11-25	2008-11-25	45756843	condition_occurrence	317893	Paroxysmal supraventricular tachycardia	45756843	Carrier claim detail - 1st position
273	0	375552	2008-06-24	2008-06-24	45756836	condition_occurrence	375552	Pterygium	45756836	Carrier claim header - 2nd position
274	0	432867	2008-11-11	2008-11-11	45756837	condition_occurrence	432867	Hyperlipidemia	45756837	Carrier claim header - 3rd position
275	0	432867	2009-05-15	2009-05-15	45756844	condition_occurrence	432867	Hyperlipidemia	45756844	Carrier claim detail - 2nd position

276 rows × 10 columns

You now have the total number of persons and a small preview. Use this pattern for any CDM table with cdm["table_name"] or cdm.table_name.

Cleanup

Disconnect the CDM reference when done (releases resources).

cdm.disconnect()