Summarise population characteristics

Introduction

In this vignette, we will explore the OmopSketch functions that provide information about individuals characteristics at specific points in time. We will employ summarisePopulationCharacteristics() to generate a summary of the demographic details within the database population. Additionally, we will tidy and present the results using tablePopulationCharacteristics(), which supports either gt or flextable for formatting the output.

Create a mock cdm

Before we dive into OmopSketch functions, we need first to load the essential packages and create a mock CDM using the Eunomia database.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(CDMConnector)
library(DBI)
library(duckdb)
library(OmopSketch)

# Connect to Eunomia database
con <- DBI::dbConnect(duckdb::duckdb(), CDMConnector::eunomia_dir())
#> Creating CDM database /tmp/RtmpGWnfvO/GiBleed_5.3.zip
cdm <- CDMConnector::cdmFromCon(
  con = con, cdmSchema = "main", writeSchema = "main"
)
#> Note: method with signature 'DBIConnection#Id' chosen for function 'dbExistsTable',
#>  target signature 'duckdb_connection#Id'.
#>  "duckdb_connection#ANY" would also be valid

cdm 
#> 
#> ── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────
#> • omop tables: person, observation_period, visit_occurrence, visit_detail,
#> condition_occurrence, drug_exposure, procedure_occurrence, device_exposure,
#> measurement, observation, death, note, note_nlp, specimen, fact_relationship,
#> location, care_site, provider, payer_plan_period, cost, drug_era, dose_era,
#> condition_era, metadata, cdm_source, concept, vocabulary, domain,
#> concept_class, concept_relationship, relationship, concept_synonym,
#> concept_ancestor, source_to_concept_map, drug_strength
#> • cohort tables: -
#> • achilles tables: -
#> • other tables: -

To start, we will use summarisePopulationCharacteristics() function to generate a summarised result object, capturing demographic characteristics at both observation_period_start_date and observation_period_end_date.

summarisedResult <- summarisePopulationCharacteristics(cdm)
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#> 
#> ℹ summarising data
#> 
#> ✔ summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#> • variable_name: from integer to character

summarisedResult |> glimpse()
#> Rows: 42
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "Synthea synthetic health database", "Synthea synthet…
#> $ group_name       <chr> "cohort_name", "cohort_name", "cohort_name", "cohort_…
#> $ group_level      <chr> "demographics", "demographics", "demographics", "demo…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "Number records", "Number subjects", "Cohort start da…
#> $ variable_level   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ estimate_name    <chr> "count", "count", "min", "q25", "median", "q75", "max…
#> $ estimate_type    <chr> "integer", "integer", "date", "date", "date", "date",…
#> $ estimate_value   <chr> "2694", "2694", "1908-09-22", "1950-07-13", "1961-03-…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

To tidy and display the summarised result using a gt table, we can use tablePopulationCharacteristics() function.

summarisedResult |>
  tablePopulationCharacteristics(type = "flextable")
#> ! Results have not been suppressed.

Variable name	Variable level	Estimate name	Database name
Variable name	Variable level	Estimate name	Synthea synthetic health database
Number records	-	N	2,694
Number subjects	-	N	2,694
Cohort start date	-	Median [Q25 - Q75]	1961-03-18 [1950-07-13 - 1970-08-29]
		Range	1908-09-22 to 1986-11-03
Cohort end date	-	Median [Q25 - Q75]	2018-12-14 [2018-08-02 - 2019-04-06]
		Range	1945-07-20 to 2019-07-03
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Age at end	-	Median [Q25 - Q75]	57 [47 - 67]
		Range	31 to 110
Sex	Female	N%	1,373 (50.97)
	Male	N%	1,321 (49.03)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Future observation	-	Median [Q25 - Q75]	20,870 [17,494 - 24,701]
		Mean (SD)	21,601.60 (5,460.69)
		Range	11,396 to 40,348

To obtain a flextable instead of a gt, you can simply change the type argument to "flextable". Additionally, it is important to note that age at start, prior observation, and future observation are calculated at the start date defined (in this case, at individuals observation_period_start_date). On the other hand, age at end is calculated at the end date defined (i.e., individuals observation_period_end_date).

Trim study period

To focus on a specific period within the observation data, rather than analysing the entire individuals’ observation period, we can trim the study period by using the studyPeriod argument. This allows to analyse the demographic metrics within a defined time range rather than the default observation start and end dates.

summarisePopulationCharacteristics(cdm,
                                   studyPeriod = c("1950-01-01", "1999-12-31")) |>
  tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#> 
#> ℹ summarising data
#> 
#> ✔ summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#> • variable_name: from integer to character
#> ! Results have not been suppressed.

Variable name	Variable level	Estimate name	Database name
Variable name	Variable level	Estimate name	Synthea synthetic health database
Number records	-	N	2,693
Number subjects	-	N	2,693
Cohort start date	-	Median [Q25 - Q75]	1961-03-19 [1950-07-22 - 1970-08-30]
		Range	1950-01-01 to 1986-11-03
Cohort end date	-	Median [Q25 - Q75]	1999-12-31 [1999-12-31 - 1999-12-31]
		Range	1961-02-26 to 1999-12-31
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	3.31 (8.32)
		Range	0 to 41
Age at end	-	Median [Q25 - Q75]	38 [29 - 49]
		Range	13 to 91
Sex	Female	N%	1,372 (50.95)
	Male	N%	1,321 (49.05)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	1,252.46 (3,094.66)
		Range	0 to 15,076
Future observation	-	Median [Q25 - Q75]	20,489 [17,290 - 23,961]
		Mean (SD)	20,352.30 (3,799.83)
		Range	4,074 to 25,383

However, if you are interested in analysing the demographic characteristics starting from a specific date without restricting the study end, you can define just the start of the study period. By default, summarisePopulationCharacteristics() function will use the observation_period_end_date to calculate the end-point statistics when the end date is not defined.

summarisePopulationCharacteristics(cdm,
                                   studyPeriod = c("1950-01-01", NA)) |>
  tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#> 
#> ℹ summarising data
#> 
#> ✔ summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#> • variable_name: from integer to character
#> ! Results have not been suppressed.

Variable name	Variable level	Estimate name	Database name
Variable name	Variable level	Estimate name	Synthea synthetic health database
Number records	-	N	2,693
Number subjects	-	N	2,693
Cohort start date	-	Median [Q25 - Q75]	1961-03-19 [1950-07-22 - 1970-08-30]
		Range	1950-01-01 to 1986-11-03
Cohort end date	-	Median [Q25 - Q75]	2018-12-14 [2018-08-03 - 2019-04-06]
		Range	1961-02-26 to 2019-07-03
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	3.31 (8.32)
		Range	0 to 41
Age at end	-	Median [Q25 - Q75]	57 [47 - 67]
		Range	31 to 110
Sex	Female	N%	1,372 (50.95)
	Male	N%	1,321 (49.05)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	1,252.46 (3,094.66)
		Range	0 to 15,076
Future observation	-	Median [Q25 - Q75]	20,489 [17,290 - 23,961]
		Mean (SD)	20,352.30 (3,799.83)
		Range	4,074 to 25,383

Similarly, if you are only interested in analysing the population characteristics up to a specific end date, you can define only the end date and set the startDate = NA. By default the observation_period_start_date will be used.

Stratify by age groups and sex

Population characteristics can also be estimated by stratifying the data based on age and sex using ageGroups and sex arguments.

summarisePopulationCharacteristics(cdm,
                                   sex = TRUE,
                                   ageGroup = list("<60" = c(0,59), ">=60" = c(60, Inf))) |>
  tablePopulationCharacteristics()
#> ! cohort columns will be reordered to match the expected order:
#>   cohort_definition_id, subject_id, cohort_start_date, and cohort_end_date.
#> ℹ Building new trimmed cohort
#> Creating initial cohort
#> ✔ Cohort trimmed
#> ℹ adding demographics columns
#> 
#> ℹ summarising data
#> 
#> ✔ summariseCharacteristics finished!
#> 
#> ! The following column type were changed:
#> • variable_name: from integer to character
#> ! Results have not been suppressed.

Variable name	Variable level	Estimate name	Database name
Variable name	Variable level	Estimate name	Synthea synthetic health database
overall; overall
Number records	-	N	2,694
Number subjects	-	N	2,694
Cohort start date	-	Median [Q25 - Q75]	1961-03-18 [1950-07-13 - 1970-08-29]
		Range	1908-09-22 to 1986-11-03
Cohort end date	-	Median [Q25 - Q75]	2018-12-14 [2018-08-02 - 2019-04-06]
		Range	1945-07-20 to 2019-07-03
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Age at end	-	Median [Q25 - Q75]	57 [47 - 67]
		Range	31 to 110
Sex	Female	N%	1,373 (50.97)
	Male	N%	1,321 (49.03)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Future observation	-	Median [Q25 - Q75]	20,870 [17,494 - 24,701]
		Mean (SD)	21,601.60 (5,460.69)
		Range	11,396 to 40,348
<60; overall
Number records	-	N	2,694
Number subjects	-	N	2,694
Cohort start date	-	Median [Q25 - Q75]	1961-03-18 [1950-07-13 - 1970-08-29]
		Range	1908-09-22 to 1986-11-03
Cohort end date	-	Median [Q25 - Q75]	2018-12-14 [2018-08-02 - 2019-04-06]
		Range	1945-07-20 to 2019-07-03
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Age at end	-	Median [Q25 - Q75]	57 [47 - 67]
		Range	31 to 110
Sex	Female	N%	1,373 (50.97)
	Male	N%	1,321 (49.03)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Future observation	-	Median [Q25 - Q75]	20,870 [17,494 - 24,701]
		Mean (SD)	21,601.60 (5,460.69)
		Range	11,396 to 40,348
overall; Female
Number records	-	N	1,373
Number subjects	-	N	1,373
Cohort start date	-	Median [Q25 - Q75]	1961-05-13 [1950-08-09 - 1971-01-04]
		Range	1908-09-22 to 1986-04-17
Cohort end date	-	Median [Q25 - Q75]	2018-12-18 [2018-08-12 - 2019-04-07]
		Range	1945-07-20 to 2019-07-01
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Age at end	-	Median [Q25 - Q75]	57 [47 - 67]
		Range	31 to 110
Sex	Female	N%	1,373 (100.00)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Future observation	-	Median [Q25 - Q75]	20,860 [17,381 - 24,682]
		Mean (SD)	21,665.77 (5,623.53)
		Range	11,396 to 40,348
overall; Male
Number records	-	N	1,321
Number subjects	-	N	1,321
Cohort start date	-	Median [Q25 - Q75]	1961-01-23 [1950-04-13 - 1970-04-19]
		Range	1909-02-14 to 1986-11-03
Cohort end date	-	Median [Q25 - Q75]	2018-12-09 [2018-07-26 - 2019-04-03]
		Range	1967-02-18 to 2019-07-03
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Age at end	-	Median [Q25 - Q75]	57 [48 - 67]
		Range	31 to 109
Sex	Male	N%	1,321 (100.00)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Future observation	-	Median [Q25 - Q75]	20,972 [17,556 - 24,703]
		Mean (SD)	21,534.91 (5,287.44)
		Range	11,438 to 40,005
<60; Female
Number records	-	N	1,373
Number subjects	-	N	1,373
Cohort start date	-	Median [Q25 - Q75]	1961-05-13 [1950-08-09 - 1971-01-04]
		Range	1908-09-22 to 1986-04-17
Cohort end date	-	Median [Q25 - Q75]	2018-12-18 [2018-08-12 - 2019-04-07]
		Range	1945-07-20 to 2019-07-01
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Age at end	-	Median [Q25 - Q75]	57 [47 - 67]
		Range	31 to 110
Sex	Female	N%	1,373 (100.00)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Future observation	-	Median [Q25 - Q75]	20,860 [17,381 - 24,682]
		Mean (SD)	21,665.77 (5,623.53)
		Range	11,396 to 40,348
<60; Male
Number records	-	N	1,321
Number subjects	-	N	1,321
Cohort start date	-	Median [Q25 - Q75]	1961-01-23 [1950-04-13 - 1970-04-19]
		Range	1909-02-14 to 1986-11-03
Cohort end date	-	Median [Q25 - Q75]	2018-12-09 [2018-07-26 - 2019-04-03]
		Range	1967-02-18 to 2019-07-03
Age at start	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Age at end	-	Median [Q25 - Q75]	57 [48 - 67]
		Range	31 to 109
Sex	Male	N%	1,321 (100.00)
Prior observation	-	Median [Q25 - Q75]	0 [0 - 0]
		Mean (SD)	0.00 (0.00)
		Range	0 to 0
Future observation	-	Median [Q25 - Q75]	20,972 [17,556 - 24,703]
		Mean (SD)	21,534.91 (5,287.44)
		Range	11,438 to 40,005