Skip to contents

Introduction

In this vignette, we will explore the OmopSketch functions that provide information about individuals characteristics at specific points in time. We will employ summarisePopulationCharacteristics() to generate a summary of the demographic details within the database population. Additionally, we will tidy and present the results using tablePopulationCharacteristics(), which supports either gt or flextable for formatting the output.

Create a mock cdm

Before we dive into OmopSketch functions, we need first to load the essential packages and create a mock CDM using the Eunomia database.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(CDMConnector)
library(DBI)
library(duckdb)
library(OmopSketch)

# Connect to Eunomia database
con <- DBI::dbConnect(duckdb::duckdb(), CDMConnector::eunomia_dir())
#> Creating CDM database /tmp/RtmpGWnfvO/GiBleed_5.3.zip
cdm <- CDMConnector::cdmFromCon(
  con = con, cdmSchema = "main", writeSchema = "main"
)
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#>  target signature 'duckdb_connection#Id'.
#>  "duckdb_connection#ANY" would also be valid

cdm 
#> 
#> ── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────
#> • omop tables: person, observation_period, visit_occurrence, visit_detail,
#> condition_occurrence, drug_exposure, procedure_occurrence, device_exposure,
#> measurement, observation, death, note, note_nlp, specimen, fact_relationship,
#> location, care_site, provider, payer_plan_period, cost, drug_era, dose_era,
#> condition_era, metadata, cdm_source, concept, vocabulary, domain,
#> concept_class, concept_relationship, relationship, concept_synonym,
#> concept_ancestor, source_to_concept_map, drug_strength
#> • cohort tables: -
#> • achilles tables: -
#> • other tables: -

Summarise population characteristics

To start, we will use summarisePopulationCharacteristics() function to generate a summarised result object, capturing demographic characteristics at both observation_period_start_date and observation_period_end_date.

summarisedResult <- summarisePopulationCharacteristics(cdm)
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#>  Building new trimmed cohort
#> Creating initial cohort
#>  Cohort trimmed
#>  adding demographics columns
#> 
#>  summarising data
#> 
#>  summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#>  variable_name: from integer to character

summarisedResult |> glimpse()
#> Rows: 42
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthet…
#> $ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level      <chr> "demographics", "demographics", "demographics", "demo…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "Number records", "Number subjects", "Cohort start da…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "min", "q25", "median", "q75", "max…
#> $ estimate_type    <chr> "integer", "integer", "date", "date", "date", "date",…
#> $ estimate_value   <chr> "2694", "2694", "1908-09-22", "1950-07-13", "1961-03-…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

To tidy and display the summarised result using a gt table, we can use tablePopulationCharacteristics() function.

summarisedResult |>
  tablePopulationCharacteristics(type = "flextable")
#> ! Results have not been suppressed.
Variable name Variable level Estimate name Database name
Synthea synthetic health database
Number records - N 2,694
Number subjects - N 2,694
Cohort start date - Median [Q25 - Q75] 1961-03-18 [1950-07-13 - 1970-08-29]
Range 1908-09-22 to 1986-11-03
Cohort end date - Median [Q25 - Q75] 2018-12-14 [2018-08-02 - 2019-04-06]
Range 1945-07-20 to 2019-07-03
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Age at end - Median [Q25 - Q75] 57 [47 - 67]
Range 31 to 110
Sex Female N% 1,373 (50.97)
Male N% 1,321 (49.03)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Future observation - Median [Q25 - Q75] 20,870 [17,494 - 24,701]
Mean (SD) 21,601.60 (5,460.69)
Range 11,396 to 40,348

To obtain a flextable instead of a gt, you can simply change the type argument to "flextable". Additionally, it is important to note that age at start, prior observation, and future observation are calculated at the start date defined (in this case, at individuals observation_period_start_date). On the other hand, age at end is calculated at the end date defined (i.e., individuals observation_period_end_date).

Trim study period

To focus on a specific period within the observation data, rather than analysing the entire individuals’ observation period, we can trim the study period by using the studyPeriod argument. This allows to analyse the demographic metrics within a defined time range rather than the default observation start and end dates.

summarisePopulationCharacteristics(cdm,
                                   studyPeriod = c("1950-01-01", "1999-12-31")) |>
  tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#>  Building new trimmed cohort
#> Creating initial cohort
#>  Cohort trimmed
#>  adding demographics columns
#> 
#>  summarising data
#> 
#>  summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#>  variable_name: from integer to character
#> ! Results have not been suppressed.
Variable name Variable level Estimate name Database name
Synthea synthetic health database
Number records - N 2,693
Number subjects - N 2,693
Cohort start date - Median [Q25 - Q75] 1961-03-19 [1950-07-22 - 1970-08-30]
Range 1950-01-01 to 1986-11-03
Cohort end date - Median [Q25 - Q75] 1999-12-31 [1999-12-31 - 1999-12-31]
Range 1961-02-26 to 1999-12-31
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 3.31 (8.32)
Range 0 to 41
Age at end - Median [Q25 - Q75] 38 [29 - 49]
Range 13 to 91
Sex Female N% 1,372 (50.95)
Male N% 1,321 (49.05)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 1,252.46 (3,094.66)
Range 0 to 15,076
Future observation - Median [Q25 - Q75] 20,489 [17,290 - 23,961]
Mean (SD) 20,352.30 (3,799.83)
Range 4,074 to 25,383

However, if you are interested in analysing the demographic characteristics starting from a specific date without restricting the study end, you can define just the start of the study period. By default, summarisePopulationCharacteristics() function will use the observation_period_end_date to calculate the end-point statistics when the end date is not defined.

summarisePopulationCharacteristics(cdm,
                                   studyPeriod = c("1950-01-01", NA)) |>
  tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#>  Building new trimmed cohort
#> Creating initial cohort
#>  Cohort trimmed
#>  adding demographics columns
#> 
#>  summarising data
#> 
#>  summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#>  variable_name: from integer to character
#> ! Results have not been suppressed.
Variable name Variable level Estimate name Database name
Synthea synthetic health database
Number records - N 2,693
Number subjects - N 2,693
Cohort start date - Median [Q25 - Q75] 1961-03-19 [1950-07-22 - 1970-08-30]
Range 1950-01-01 to 1986-11-03
Cohort end date - Median [Q25 - Q75] 2018-12-14 [2018-08-03 - 2019-04-06]
Range 1961-02-26 to 2019-07-03
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 3.31 (8.32)
Range 0 to 41
Age at end - Median [Q25 - Q75] 57 [47 - 67]
Range 31 to 110
Sex Female N% 1,372 (50.95)
Male N% 1,321 (49.05)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 1,252.46 (3,094.66)
Range 0 to 15,076
Future observation - Median [Q25 - Q75] 20,489 [17,290 - 23,961]
Mean (SD) 20,352.30 (3,799.83)
Range 4,074 to 25,383

Similarly, if you are only interested in analysing the population characteristics up to a specific end date, you can define only the end date and set the startDate = NA. By default the observation_period_start_date will be used.

Stratify by age groups and sex

Population characteristics can also be estimated by stratifying the data based on age and sex using ageGroups and sex arguments.

summarisePopulationCharacteristics(cdm,
                                   sex = TRUE,
                                   ageGroup = list("<60" = c(0,59), ">=60" = c(60, Inf))) |>
  tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#>  Building new trimmed cohort
#> Creating initial cohort
#>  Cohort trimmed
#>  adding demographics columns
#> 
#>  summarising data
#> 
#>  summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#>  variable_name: from integer to character
#> ! Results have not been suppressed.
Variable name Variable level Estimate name Database name
Synthea synthetic health database
overall; overall
Number records - N 2,694
Number subjects - N 2,694
Cohort start date - Median [Q25 - Q75] 1961-03-18 [1950-07-13 - 1970-08-29]
Range 1908-09-22 to 1986-11-03
Cohort end date - Median [Q25 - Q75] 2018-12-14 [2018-08-02 - 2019-04-06]
Range 1945-07-20 to 2019-07-03
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Age at end - Median [Q25 - Q75] 57 [47 - 67]
Range 31 to 110
Sex Female N% 1,373 (50.97)
Male N% 1,321 (49.03)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Future observation - Median [Q25 - Q75] 20,870 [17,494 - 24,701]
Mean (SD) 21,601.60 (5,460.69)
Range 11,396 to 40,348
<60; overall
Number records - N 2,694
Number subjects - N 2,694
Cohort start date - Median [Q25 - Q75] 1961-03-18 [1950-07-13 - 1970-08-29]
Range 1908-09-22 to 1986-11-03
Cohort end date - Median [Q25 - Q75] 2018-12-14 [2018-08-02 - 2019-04-06]
Range 1945-07-20 to 2019-07-03
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Age at end - Median [Q25 - Q75] 57 [47 - 67]
Range 31 to 110
Sex Female N% 1,373 (50.97)
Male N% 1,321 (49.03)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Future observation - Median [Q25 - Q75] 20,870 [17,494 - 24,701]
Mean (SD) 21,601.60 (5,460.69)
Range 11,396 to 40,348
overall; Female
Number records - N 1,373
Number subjects - N 1,373
Cohort start date - Median [Q25 - Q75] 1961-05-13 [1950-08-09 - 1971-01-04]
Range 1908-09-22 to 1986-04-17
Cohort end date - Median [Q25 - Q75] 2018-12-18 [2018-08-12 - 2019-04-07]
Range 1945-07-20 to 2019-07-01
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Age at end - Median [Q25 - Q75] 57 [47 - 67]
Range 31 to 110
Sex Female N% 1,373 (100.00)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Future observation - Median [Q25 - Q75] 20,860 [17,381 - 24,682]
Mean (SD) 21,665.77 (5,623.53)
Range 11,396 to 40,348
overall; Male
Number records - N 1,321
Number subjects - N 1,321
Cohort start date - Median [Q25 - Q75] 1961-01-23 [1950-04-13 - 1970-04-19]
Range 1909-02-14 to 1986-11-03
Cohort end date - Median [Q25 - Q75] 2018-12-09 [2018-07-26 - 2019-04-03]
Range 1967-02-18 to 2019-07-03
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Age at end - Median [Q25 - Q75] 57 [48 - 67]
Range 31 to 109
Sex Male N% 1,321 (100.00)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Future observation - Median [Q25 - Q75] 20,972 [17,556 - 24,703]
Mean (SD) 21,534.91 (5,287.44)
Range 11,438 to 40,005
<60; Female
Number records - N 1,373
Number subjects - N 1,373
Cohort start date - Median [Q25 - Q75] 1961-05-13 [1950-08-09 - 1971-01-04]
Range 1908-09-22 to 1986-04-17
Cohort end date - Median [Q25 - Q75] 2018-12-18 [2018-08-12 - 2019-04-07]
Range 1945-07-20 to 2019-07-01
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Age at end - Median [Q25 - Q75] 57 [47 - 67]
Range 31 to 110
Sex Female N% 1,373 (100.00)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Future observation - Median [Q25 - Q75] 20,860 [17,381 - 24,682]
Mean (SD) 21,665.77 (5,623.53)
Range 11,396 to 40,348
<60; Male
Number records - N 1,321
Number subjects - N 1,321
Cohort start date - Median [Q25 - Q75] 1961-01-23 [1950-04-13 - 1970-04-19]
Range 1909-02-14 to 1986-11-03
Cohort end date - Median [Q25 - Q75] 2018-12-09 [2018-07-26 - 2019-04-03]
Range 1967-02-18 to 2019-07-03
Age at start - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Age at end - Median [Q25 - Q75] 57 [48 - 67]
Range 31 to 109
Sex Male N% 1,321 (100.00)
Prior observation - Median [Q25 - Q75] 0 [0 - 0]
Mean (SD) 0.00 (0.00)
Range 0 to 0
Future observation - Median [Q25 - Q75] 20,972 [17,556 - 24,703]
Mean (SD) 21,534.91 (5,287.44)
Range 11,438 to 40,005