Introduction
This vignette walks through the OmopSketch functions
designed to provide a concise overview of the OMOP person
table. Two functions cover this workflow:
-
summarisePerson(): computes summary statistics and data-quality checks for thepersontable, including total subject counts, observation-period coverage, sex/race/ethnicity distributions, birth-date components, and summaries of id-columns (location_id,provider_id,care_site_id). -
tablePerson(): renders the results as a formatted table (gt,reactable, ordatatable).
Setup
Load the required packages and create a mock CDM using omock.
library(dplyr)
library(OmopSketch)
library(omock)
cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")
cdmSummarise the person table
Call summarisePerson() to compute all summaries. It
returns a summarised_result
object — a standardised tidy format used across the OMOP analytics
ecosystem.
result <- summarisePerson(cdm = cdm)
result |> glimpse()
#> Rows: 123
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name <chr> "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed"…
#> $ group_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ group_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "Number subjects", "Number subjects not in observatio…
#> $ variable_level <chr> NA, NA, NA, "Female", "Female", "Male", "Male", "None…
#> $ estimate_name <chr> "count", "count", "percentage", "count", "percentage"…
#> $ estimate_type <chr> "integer", "integer", "numeric", "integer", "numeric"…
#> $ estimate_value <chr> "2694", "0", "0", "1373", "50.9651076466221", "1321",…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…What the function reports
summarisePerson() covers the following summaries, each
stored as a separate variable_name in the result:
| Variable | Description |
|---|---|
| Number subjects | Total row count in person. |
| Number subjects not in observation | Count and percentage of persons absent from
observation_period. A warning is emitted when this is
non-zero. |
| Sex | Count and percentage for Female, Male, and
None (derived via
PatientProfiles::addSexQuery()). |
| Sex source | Distribution of raw gender_source_value. |
| Race | Distribution of race_concept_id, resolved to concept
names. |
| Race source | Distribution of raw race_source_value. |
| Ethnicity | Distribution of ethnicity_concept_id, resolved to
concept names. |
| Ethnicity source | Distribution of raw ethnicity_source_value. |
| Year of birth | Numeric summary: missingness, quantiles (Q05, Q25, median, Q75, Q95), min/max. |
| Month of birth | Same numeric summary as year of birth. |
| Day of birth | Same numeric summary as year of birth. |
| Location | Missing count, zero count, and distinct values for
location_id. When location_id is empty, the
function attempts to derive it from care_site_id and notes
this in a message. |
| Provider | Missing count, zero count, distinct values, and (below threshold)
individual provider_name labels. |
| Care site | Missing count, zero count, distinct values, and (below threshold)
individual care_site_name labels. |
Threshold
For location_id, provider_id, and
care_site_id, summarisePerson() always reports
three aggregate statistics: number missing, number of zeros, and number
of distinct values. Whether it goes further and lists each value
individually depends on a threshold.
If the number of distinct values is below 15 (the
default), the function joins to the corresponding lookup table and
appends one row per unique label — location_source_value,
provider_name, or care_site_name — each with
its own count and percentage. This gives you a readable breakdown rather
than just a count of how many distinct ids exist.
If the number of distinct values is at or above 15, only the aggregate stats are returned. Listing dozens or hundreds of individual sites would produce noise rather than insight.
You can tune this cutoff for your database before calling
summarisePerson():
# Raise the threshold to show individual labels for up to 50 distinct values
options(OmopSketch.personLabels = 50)
result <- summarisePerson(cdm = cdm)Estimate types
Depending on the variable, estimates include:
-
count/percentage— for categorical variables (sex, race, ethnicity). -
count_missing/percentage_missing— missingness for id-columns and birth-date fields. -
count_0/percentage_0— zero values for id-columns. -
distinct_values— number of unique non-null values for id-columns. -
min,q05,q25,median,q75,q95,max— quantile summaries for birth-date fields.
Visualise the results
tablePerson() formats the summarised_result
into a publication-ready table. The type argument accepts
"gt" (default), "reactable", or
"datatable".
tablePerson(result = result, type = "gt")
#> ! The following column type were changed:
#> • variable_name: from integer to character| Variable name | Variable level | Estimate name |
Data source
|
|---|---|---|---|
| GiBleed | |||
| Number subjects | – | N | 2,694 |
| Number subjects not in observation | – | N (%) | 0 (0.00%) |
| Sex | Female | N (%) | 1,373 (50.97%) |
| Male | N (%) | 1,321 (49.03%) | |
| None | N (%) | 0 (0.00%) | |
| Sex source | F | N (%) | 1,373 (50.97%) |
| M | N (%) | 1,321 (49.03%) | |
| Race | Missing | N (%) | 2,243 (83.26%) |
| No matching concept | N (%) | 451 (16.74%) | |
| Race source | asian | N (%) | 212 (7.87%) |
| black | N (%) | 338 (12.55%) | |
| hispanic | N (%) | 435 (16.15%) | |
| native | N (%) | 14 (0.52%) | |
| other | N (%) | 2 (0.07%) | |
| white | N (%) | 1,693 (62.84%) | |
| Ethnicity | Missing | N (%) | 435 (16.15%) |
| No matching concept | N (%) | 2,259 (83.85%) | |
| Ethnicity source | african | N (%) | 119 (4.42%) |
| american | N (%) | 79 (2.93%) | |
| american_indian | N (%) | 14 (0.52%) | |
| arab | N (%) | 2 (0.07%) | |
| asian_indian | N (%) | 81 (3.01%) | |
| central_american | N (%) | 75 (2.78%) | |
| chinese | N (%) | 131 (4.86%) | |
| dominican | N (%) | 105 (3.90%) | |
| english | N (%) | 218 (8.09%) | |
| french | N (%) | 129 (4.79%) | |
| french_canadian | N (%) | 74 (2.75%) | |
| german | N (%) | 130 (4.83%) | |
| greek | N (%) | 19 (0.71%) | |
| irish | N (%) | 438 (16.26%) | |
| italian | N (%) | 295 (10.95%) | |
| mexican | N (%) | 42 (1.56%) | |
| polish | N (%) | 107 (3.97%) | |
| portuguese | N (%) | 93 (3.45%) | |
| puerto_rican | N (%) | 258 (9.58%) | |
| russian | N (%) | 34 (1.26%) | |
| scottish | N (%) | 48 (1.78%) | |
| south_american | N (%) | 60 (2.23%) | |
| swedish | N (%) | 29 (1.08%) | |
| west_indian | N (%) | 114 (4.23%) | |
| Year of birth | – | Missing (%) | 0 (0.00%) |
| Median [Q25 - Q75] | 1,961 [1,950 - 1,970] | ||
| 90% Range [Q05 to Q95] | 1,922 to 1,979 | ||
| Range [min to max] | 1,908 to 1,986 | ||
| Month of birth | – | Missing (%) | 0 (0.00%) |
| Median [Q25 - Q75] | 7 [4 - 10] | ||
| 90% Range [Q05 to Q95] | 1 to 12 | ||
| Range [min to max] | 1 to 12 | ||
| Day of birth | – | Missing (%) | 0 (0.00%) |
| Median [Q25 - Q75] | 16 [8 - 23] | ||
| 90% Range [Q05 to Q95] | 2 to 29 | ||
| Range [min to max] | 1 to 31 | ||
| Location | – | Missing (%) | 2,694 (100.00%) |
| Zero count (%) | 0 (0.00%) | ||
| Distinct values | 1 | ||
| Provider | – | Missing (%) | 2,694 (100.00%) |
| Zero count (%) | 0 (0.00%) | ||
| Distinct values | 1 | ||
| Care site | – | Missing (%) | 2,694 (100.00%) |
| Zero count (%) | 0 (0.00%) | ||
| Distinct values | 1 |
The column headers are driven by the CDM name and the estimate
combinations are rendered as human-readable labels
(e.g. N (%), Median [Q25 - Q75],
Missing N (%)).
Disconnect
cdmDisconnect(cdm = cdm)