Summarise clinical tables records • OmopSketch

Introduction

In this vignette, we will explore the OmopSketch functions designed to provide an overview of the clinical tables within a CDM object (e.g. visit_occurrence, condition_occurrence, drug_exposure, procedure_occurrence, device_exposure, measurement, observation, and death). Specifically, there are two key functions that facilitate this:

summariseClinicalRecords(): creates a summary statistics with key basic information about the clinical table (e.g., number of records, records per person, etc.), some quality checks (e.g, missingness, correct filling of date columns, etc.) and a summary of the concepts used in the table (domains, source vocabularies, etc.)
tableClinicalRecords(): helps visualising the results in a formatted table.

Create a mock cdm

Let’s see an example of its functionalities. To start with, we will load essential packages and create a mock cdm using the R package omock

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(OmopSketch)
library(omock)

# Connect to mock database
cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")
#> ℹ Reading GiBleed tables.
#> ℹ Adding drug_strength table.
#> ℹ Creating local <cdm_reference> object.
#> ℹ Inserting <cdm_reference> into duckdb.

cdm
#> 
#> ── # OMOP CDM reference (duckdb) of GiBleed ────────────────────────────────────
#> • omop tables: care_site, cdm_source, concept, concept_ancestor, concept_class,
#> concept_relationship, concept_synonym, condition_era, condition_occurrence,
#> cost, death, device_exposure, domain, dose_era, drug_era, drug_exposure,
#> drug_strength, fact_relationship, location, measurement, metadata, note,
#> note_nlp, observation, observation_period, payer_plan_period, person,
#> procedure_occurrence, provider, relationship, source_to_concept_map, specimen,
#> visit_detail, visit_occurrence, vocabulary
#> • cohort tables: -
#> • achilles tables: -
#> • other tables: -

Summarise clinical tables

Let’s now use summariseClinicalTables()from the OmopSketch package to help us have an overview of one of the clinical tables of the cdm (i.e., condition_occurrence).

summarisedResult <- summariseClinicalRecords(
  cdm = cdm, 
  omopTableName = "condition_occurrence"
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising subjects not in person table in condition_occurrence.
#> ℹ Summarising records in observation in condition_occurrence.
#> ℹ Summarising records with start before birth date in condition_occurrence.
#> ℹ Summarising records with end date before start date in condition_occurrence.
#> ℹ Summarising domains in condition_occurrence.
#> ℹ Summarising standard concepts in condition_occurrence.
#> ℹ Summarising source vocabularies in condition_occurrence.
#> ℹ Summarising concept types in condition_occurrence.
#> ℹ Summarising missing data in condition_occurrence.

summarisedResult
#> # A tibble: 82 × 13
#>    result_id cdm_name group_name group_level          strata_name strata_level
#>        <int> <chr>    <chr>      <chr>                <chr>       <chr>       
#>  1         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  2         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  3         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  4         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  5         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  6         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  7         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  8         1 GiBleed  omop_table condition_occurrence overall     overall     
#>  9         1 GiBleed  omop_table condition_occurrence overall     overall     
#> 10         1 GiBleed  omop_table condition_occurrence overall     overall     
#> # ℹ 72 more rows
#> # ℹ 7 more variables: variable_name <chr>, variable_level <chr>,
#> #   estimate_name <chr>, estimate_type <chr>, estimate_value <chr>,
#> #   additional_name <chr>, additional_level <chr>

Notice that the output is in the summarised result format. ## Records per person We can use the arguments to specify which statistics we want to perform. For example, use the argument recordsPerPerson to indicate which estimates you are interested regarding the number of records per person.

summarisedResult <- summariseClinicalRecords(
  cdm = cdm,
  omopTableName = "condition_occurrence",
  recordsPerPerson = c("mean", "sd", "q05", "q95")
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising subjects not in person table in condition_occurrence.
#> ℹ Summarising records in observation in condition_occurrence.
#> ℹ Summarising records with start before birth date in condition_occurrence.
#> ℹ Summarising records with end date before start date in condition_occurrence.
#> ℹ Summarising domains in condition_occurrence.
#> ℹ Summarising standard concepts in condition_occurrence.
#> ℹ Summarising source vocabularies in condition_occurrence.
#> ℹ Summarising concept types in condition_occurrence.
#> ℹ Summarising missing data in condition_occurrence.

summarisedResult |>
  filter(variable_name == "records_per_person") |>
  select(variable_name, estimate_name, estimate_value)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: variable_name <chr>, estimate_name <chr>, estimate_value <chr>

Quality

When the argument quality = TRUE is set, the results will include a quality assessment of the data.
This assessment provides information such as:

The proportion of records that fall outside the subjects’ observation periods.
Issues with date columns (e.g., start dates occurring after end dates, or dates preceding a subject’s birth date).
The presence of person_id values that do not exist in the person table.

summarisedResult <- summariseClinicalRecords(
  cdm = cdm,
  omopTableName = "condition_occurrence",
  recordsPerPerson = NULL, 
  conceptSummary = FALSE,
  missing = FALSE,
  quality = TRUE
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising subjects not in person table in condition_occurrence.
#> ℹ Summarising records in observation in condition_occurrence.
#> ℹ Summarising records with start before birth date in condition_occurrence.
#> ℹ Summarising records with end date before start date in condition_occurrence.

summarisedResult |>
  select(variable_name, estimate_name, estimate_value) 
#> # A tibble: 13 × 3
#>    variable_name                estimate_name estimate_value
#>    <chr>                        <chr>         <chr>         
#>  1 Number subjects              count         2694          
#>  2 Number subjects              percentage    100           
#>  3 Number records               count         65332         
#>  4 Subjects not in person table count         0             
#>  5 Subjects not in person table percentage    0.00          
#>  6 In observation               count         450           
#>  7 In observation               count         64882         
#>  8 Start date before birth date count         0             
#>  9 End date before start date   count         0             
#> 10 In observation               percentage    0.69          
#> 11 In observation               percentage    99.31         
#> 12 Start date before birth date percentage    0.00          
#> 13 End date before start date   percentage    0.00

Concept Summary

When the argument conceptSummary = TRUE is set, the results will also include information about the concepts contained in the table, such as:

The domain to which each concept belongs.
Whether each concept is a standard concept.
The type and source vocabulary associated with each concept.

summarisedResult <- summariseClinicalRecords(
  cdm = cdm,
  omopTableName = "drug_exposure",
  recordsPerPerson = NULL, 
  conceptSummary = TRUE,
  missing = FALSE,
  quality = FALSE
)
#> ℹ Adding variables of interest to drug_exposure.
#> ℹ Summarising records per person in drug_exposure.
#> ℹ Summarising domains in drug_exposure.
#> ℹ Summarising standard concepts in drug_exposure.
#> ℹ Summarising source vocabularies in drug_exposure.
#> ℹ Summarising concept types in drug_exposure.
#> ℹ Summarising concept class in drug_exposure.

summarisedResult |>
  select(variable_name, variable_level, estimate_name, estimate_value) 
#> # A tibble: 37 × 4
#>    variable_name     variable_level                 estimate_name estimate_value
#>    <chr>             <chr>                          <chr>         <chr>         
#>  1 Number subjects   NA                             count         2694          
#>  2 Number subjects   NA                             percentage    100           
#>  3 Number records    NA                             count         67707         
#>  4 Domain            Drug                           count         67707         
#>  5 Standard concept  S                              count         67707         
#>  6 Source vocabulary No matching concept            count         35            
#>  7 Source vocabulary CVX                            count         25710         
#>  8 Source vocabulary NDC                            count         2694          
#>  9 Source vocabulary RxNorm                         count         39268         
#> 10 Type concept id   Dispensed in Outpatient office count         25710         
#> # ℹ 27 more rows

Missingness

When the argument missing = TRUE is set, the results will include a summary of missing data in the table, including the number of 0s in the concept columns.
This output is analogous to the results produced by the OmopSketch function summariseMissingData().

summarisedResult <- summariseClinicalRecords(
  cdm = cdm,
  omopTableName = "condition_occurrence",
  recordsPerPerson = NULL, 
  conceptSummary = FALSE,
  missing = TRUE,
  quality = FALSE
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising missing data in condition_occurrence.

summarisedResult |>
  select(variable_name, variable_level, estimate_name, estimate_value) 
#> # A tibble: 53 × 4
#>    variable_name   variable_level          estimate_name   estimate_value
#>    <chr>           <chr>                   <chr>           <chr>         
#>  1 Number subjects NA                      count           2694          
#>  2 Number subjects NA                      percentage      100           
#>  3 Number records  NA                      count           65332         
#>  4 Column name     condition_occurrence_id na_count        0             
#>  5 Column name     condition_occurrence_id na_percentage   0.00          
#>  6 Column name     condition_occurrence_id zero_count      0             
#>  7 Column name     condition_occurrence_id zero_percentage 0.00          
#>  8 Column name     person_id               na_count        0             
#>  9 Column name     person_id               na_percentage   0.00          
#> 10 Column name     person_id               zero_count      0             
#> # ℹ 43 more rows

Strata

It is also possible to stratify the results by sex and age groups:

summarisedResult <- summariseClinicalRecords(
  cdm = cdm,
  omopTableName = "condition_occurrence",
  recordsPerPerson = c("mean", "sd", "q05", "q95"),
  quality = TRUE,
  conceptSummary = TRUE,
  sex = TRUE,
  ageGroup = list("<35" = c(0, 34), ">=35" = c(35, Inf))
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising subjects not in person table in condition_occurrence.
#> ℹ Summarising records in observation in condition_occurrence.
#> ℹ Summarising records with start before birth date in condition_occurrence.
#> ℹ Summarising records with end date before start date in condition_occurrence.
#> ℹ Summarising domains in condition_occurrence.
#> ℹ Summarising standard concepts in condition_occurrence.
#> ℹ Summarising source vocabularies in condition_occurrence.
#> ℹ Summarising concept types in condition_occurrence.
#> ℹ Summarising missing data in condition_occurrence.

summarisedResult |>
  select(variable_name, strata_level, estimate_name, estimate_value) 
#> # A tibble: 663 × 4
#>    variable_name      strata_level estimate_name estimate_value
#>    <chr>              <chr>        <chr>         <chr>         
#>  1 Number subjects    overall      count         2694          
#>  2 Number subjects    overall      percentage    100           
#>  3 Records per person overall      mean          24.2509       
#>  4 Records per person overall      sd            7.4065        
#>  5 Records per person overall      q05           14            
#>  6 Records per person overall      q95           38            
#>  7 Number records     overall      count         65332         
#>  8 Number subjects    <35          count         2694          
#>  9 Number subjects    >=35         count         2656          
#> 10 Number subjects    <35          percentage    100           
#> # ℹ 653 more rows

Notice that, by default, the “overall” group will also be included, as well as crossed strata (that means, sex == "Female" and ageGroup == "\>35").

Also, see that the analysis can be conducted for multiple OMOP tables at the same time:

summarisedResult <- summariseClinicalRecords(
  cdm = cdm,
  omopTableName = c("visit_occurrence", "drug_exposure"),
  recordsPerPerson = c("mean", "sd"),
  quality = FALSE,
  conceptSummary = FALSE,
  missingData = FALSE
)
#> ℹ Adding variables of interest to visit_occurrence.
#> ℹ Summarising records per person in visit_occurrence.
#> ℹ Adding variables of interest to drug_exposure.
#> ℹ Summarising records per person in drug_exposure.

summarisedResult |>
  select(group_level, variable_name, estimate_name, estimate_value)
#> # A tibble: 10 × 4
#>    group_level      variable_name      estimate_name estimate_value
#>    <chr>            <chr>              <chr>         <chr>         
#>  1 visit_occurrence Number subjects    count         890           
#>  2 visit_occurrence Number subjects    percentage    100           
#>  3 visit_occurrence Records per person mean          1.1652        
#>  4 visit_occurrence Records per person sd            0.4145        
#>  5 visit_occurrence Number records     count         1037          
#>  6 drug_exposure    Number subjects    count         2694          
#>  7 drug_exposure    Number subjects    percentage    100           
#>  8 drug_exposure    Records per person mean          25.1325       
#>  9 drug_exposure    Records per person sd            5.2457        
#> 10 drug_exposure    Number records     count         67707

Date Range

We can also filter the clinical table to a specific time window by setting the dateRange argument.


summarisedResult <- summariseClinicalRecords(
  cdm = cdm, 
  omopTableName ="drug_exposure",
  dateRange = as.Date(c("1990-01-01", "2010-01-01"))
) 
#> ℹ Adding variables of interest to drug_exposure.
#> ℹ Summarising records per person in drug_exposure.
#> ℹ Summarising subjects not in person table in drug_exposure.
#> ℹ Summarising records in observation in drug_exposure.
#> ℹ Summarising records with start before birth date in drug_exposure.
#> ℹ Summarising records with end date before start date in drug_exposure.
#> ℹ Summarising domains in drug_exposure.
#> ℹ Summarising standard concepts in drug_exposure.
#> ℹ Summarising source vocabularies in drug_exposure.
#> ℹ Summarising concept types in drug_exposure.
#> ℹ Summarising concept class in drug_exposure.
#> ℹ Summarising missing data in drug_exposure.

summarisedResult |>
  settings() |>
  glimpse()
#> Rows: 1
#> Columns: 10
#> $ result_id          <int> 1
#> $ result_type        <chr> "summarise_clinical_records"
#> $ package_name       <chr> "OmopSketch"
#> $ package_version    <chr> "1.0.0.900"
#> $ group              <chr> "omop_table"
#> $ strata             <chr> ""
#> $ additional         <chr> ""
#> $ min_cell_count     <chr> "0"
#> $ study_period_end   <chr> "2010-01-01"
#> $ study_period_start <chr> "1990-01-01"

Tidy the summarised object

tableClinicalRecords() will help you to tidy the previous results and create a formatted table of type gt, reactable or datatable. By default it creates a gt table.

summarisedResult <- summariseClinicalRecords(cdm,
  omopTableName = "condition_occurrence",
  recordsPerPerson = c("mean", "sd", "q05", "q95"),
  quality = TRUE, 
  conceptSummary = TRUE,
  sex = TRUE
)
#> ℹ Adding variables of interest to condition_occurrence.
#> ℹ Summarising records per person in condition_occurrence.
#> ℹ Summarising subjects not in person table in condition_occurrence.
#> ℹ Summarising records in observation in condition_occurrence.
#> ℹ Summarising records with start before birth date in condition_occurrence.
#> ℹ Summarising records with end date before start date in condition_occurrence.
#> ℹ Summarising domains in condition_occurrence.
#> ℹ Summarising standard concepts in condition_occurrence.
#> ℹ Summarising source vocabularies in condition_occurrence.
#> ℹ Summarising concept types in condition_occurrence.
#> ℹ Summarising missing data in condition_occurrence.

tableClinicalRecords(result = summarisedResult, type = "gt")

Summary of condition_occurrence table
Variable name	Variable level	Estimate name	Database name
Variable name	Variable level	Estimate name	GiBleed
condition_occurrence; overall
Number records	–	N	65,332.00
Number subjects	–	N (%)	2,694 (100.00%)
Subjects not in person table	–	N (%)	0 (0.00%)
Records per person	–	Mean (SD)	24.25 (7.41)
		q05	14.00
		q95	38.00
In observation	No	N (%)	450 (0.69%)
	Yes	N (%)	64,882 (99.31%)
Domain	Condition	N (%)	65,332 (100.00%)
Source vocabulary	Icd10cm	N (%)	479 (0.73%)
	No matching concept	N (%)	27 (0.04%)
	Snomed	N (%)	64,826 (99.23%)
Standard concept	S	N (%)	65,332 (100.00%)
Type concept id	Ehr encounter diagnosis	N (%)	65,332 (100.00%)
Start date before birth date	–	N (%)	0 (0.00%)
End date before start date	–	N (%)	0 (0.00%)
Column name	Condition concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition end date	N missing data (%)	8,652 (13.24%)
	Condition end datetime	N missing data (%)	8,652 (13.24%)
	Condition occurrence id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition source concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition source value	N missing data (%)	0 (0.00%)
	Condition start date	N missing data (%)	0 (0.00%)
	Condition start datetime	N missing data (%)	0 (0.00%)
	Condition status concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	65,332 (100.00%)
	Condition status source value	N missing data (%)	65,332 (100.00%)
	Condition type concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Person id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Provider id	N missing data (%)	65,332 (100.00%)
		N zeros (%)	0 (0.00%)
	Stop reason	N missing data (%)	65,332 (100.00%)
	Visit detail id	N missing data (%)	0 (0.00%)
		N zeros (%)	65,332 (100.00%)
	Visit occurrence id	N missing data (%)	64 (0.10%)
		N zeros (%)	0 (0.00%)
condition_occurrence; Female
Number records	–	N	33,744.00
Number subjects	–	N (%)	1,373 (100.00%)
Records per person	–	Mean (SD)	24.58 (7.59)
		q05	14.00
		q95	38.00
In observation	No	N (%)	227 (0.67%)
	Yes	N (%)	33,517 (99.33%)
Domain	Condition	N (%)	33,744 (100.00%)
Source vocabulary	Icd10cm	N (%)	242 (0.72%)
	No matching concept	N (%)	15 (0.04%)
	Snomed	N (%)	33,487 (99.24%)
Standard concept	S	N (%)	33,744 (100.00%)
Type concept id	Ehr encounter diagnosis	N (%)	33,744 (100.00%)
Column name	Condition concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition end date	N missing data (%)	4,397 (13.03%)
	Condition end datetime	N missing data (%)	4,397 (13.03%)
	Condition occurrence id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition source concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition source value	N missing data (%)	0 (0.00%)
	Condition start date	N missing data (%)	0 (0.00%)
	Condition start datetime	N missing data (%)	0 (0.00%)
	Condition status concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	33,744 (100.00%)
	Condition status source value	N missing data (%)	33,744 (100.00%)
	Condition type concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Person id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Provider id	N missing data (%)	33,744 (100.00%)
		N zeros (%)	0 (0.00%)
	Stop reason	N missing data (%)	33,744 (100.00%)
	Visit detail id	N missing data (%)	0 (0.00%)
		N zeros (%)	33,744 (100.00%)
	Visit occurrence id	N missing data (%)	24 (0.07%)
		N zeros (%)	0 (0.00%)
condition_occurrence; Male
Number records	–	N	31,588.00
Number subjects	–	N (%)	1,321 (100.00%)
Records per person	–	Mean (SD)	23.91 (7.20)
		q05	13.00
		q95	37.00
In observation	No	N (%)	223 (0.71%)
	Yes	N (%)	31,365 (99.29%)
Domain	Condition	N (%)	31,588 (100.00%)
Source vocabulary	Icd10cm	N (%)	237 (0.75%)
	No matching concept	N (%)	12 (0.04%)
	Snomed	N (%)	31,339 (99.21%)
Standard concept	S	N (%)	31,588 (100.00%)
Type concept id	Ehr encounter diagnosis	N (%)	31,588 (100.00%)
Column name	Condition concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition end date	N missing data (%)	4,255 (13.47%)
	Condition end datetime	N missing data (%)	4,255 (13.47%)
	Condition occurrence id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition source concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Condition source value	N missing data (%)	0 (0.00%)
	Condition start date	N missing data (%)	0 (0.00%)
	Condition start datetime	N missing data (%)	0 (0.00%)
	Condition status concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	31,588 (100.00%)
	Condition status source value	N missing data (%)	31,588 (100.00%)
	Condition type concept id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Person id	N missing data (%)	0 (0.00%)
		N zeros (%)	0 (0.00%)
	Provider id	N missing data (%)	31,588 (100.00%)
		N zeros (%)	0 (0.00%)
	Stop reason	N missing data (%)	31,588 (100.00%)
	Visit detail id	N missing data (%)	0 (0.00%)
		N zeros (%)	31,588 (100.00%)
	Visit occurrence id	N missing data (%)	40 (0.13%)
		N zeros (%)	0 (0.00%)

Disconnect from CDM

Finally, disconnect from the mock CDM.

cdmDisconnect(cdm = cdm)