Summarise the person table

Introduction

This vignette walks through the OmopSketch functions designed to provide a concise overview of the OMOP person table. Two functions cover this workflow:

summarisePerson(): computes summary statistics and data-quality checks for the person table, including total subject counts, observation-period coverage, sex/race/ethnicity distributions, birth-date components, and summaries of id-columns (location_id, provider_id, care_site_id).
tablePerson(): renders the results as a formatted table (gt, reactable, or datatable).

Setup

Load the required packages and create a mock CDM using omock.

library(dplyr)
library(OmopSketch)
library(omock)

cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")
cdm

Call summarisePerson() to compute all summaries. It returns a summarised_result object — a standardised tidy format used across the OMOP analytics ecosystem.

result <- summarisePerson(cdm = cdm)
result |> glimpse()
#> Rows: 123
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ cdm_name         <chr> "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed"…
#> $ group_name       <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ group_level      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "Number subjects", "Number subjects not in observatio…
#> $ variable_level   <chr> NA, NA, NA, "Female", "Female", "Male", "Male", "None…
#> $ estimate_name    <chr> "count", "count", "percentage", "count", "percentage"…
#> $ estimate_type    <chr> "integer", "integer", "numeric", "integer", "numeric"…
#> $ estimate_value   <chr> "2694", "0", "0", "1373", "50.9651076466221", "1321",…
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

What the function reports

summarisePerson() covers the following summaries, each stored as a separate variable_name in the result:

Variable	Description
Number subjects	Total row count in `person`.
Number subjects not in observation	Count and percentage of persons absent from `observation_period`. A warning is emitted when this is non-zero.
Sex	Count and percentage for `Female`, `Male`, and `None` (derived via `PatientProfiles::addSexQuery()`).
Sex source	Distribution of raw `gender_source_value`.
Race	Distribution of `race_concept_id`, resolved to concept names.
Race source	Distribution of raw `race_source_value`.
Ethnicity	Distribution of `ethnicity_concept_id`, resolved to concept names.
Ethnicity source	Distribution of raw `ethnicity_source_value`.
Year of birth	Numeric summary: missingness, quantiles (Q05, Q25, median, Q75, Q95), min/max.
Month of birth	Same numeric summary as year of birth.
Day of birth	Same numeric summary as year of birth.
Location	Missing count, zero count, and distinct values for `location_id`. When `location_id` is empty, the function attempts to derive it from `care_site_id` and notes this in a message.
Provider	Missing count, zero count, distinct values, and (below threshold) individual `provider_name` labels.
Care site	Missing count, zero count, distinct values, and (below threshold) individual `care_site_name` labels.

Threshold

For location_id, provider_id, and care_site_id, summarisePerson() always reports three aggregate statistics: number missing, number of zeros, and number of distinct values. Whether it goes further and lists each value individually depends on a threshold.

If the number of distinct values is below 15 (the default), the function joins to the corresponding lookup table and appends one row per unique label — location_source_value, provider_name, or care_site_name — each with its own count and percentage. This gives you a readable breakdown rather than just a count of how many distinct ids exist.

If the number of distinct values is at or above 15, only the aggregate stats are returned. Listing dozens or hundreds of individual sites would produce noise rather than insight.

You can tune this cutoff for your database before calling summarisePerson():

# Raise the threshold to show individual labels for up to 50 distinct values
options(OmopSketch.personLabels = 50)
result <- summarisePerson(cdm = cdm)

Estimate types

Depending on the variable, estimates include:

count / percentage — for categorical variables (sex, race, ethnicity).
count_missing / percentage_missing — missingness for id-columns and birth-date fields.
count_0 / percentage_0 — zero values for id-columns.
distinct_values — number of unique non-null values for id-columns.
min, q05, q25, median, q75, q95, max — quantile summaries for birth-date fields.

Visualise the results

tablePerson() formats the summarised_result into a publication-ready table. The type argument accepts "gt" (default), "reactable", or "datatable".

tablePerson(result = result, type = "gt")
#> ! The following column types were changed:
#> • variable_name: from integer to character

Summary of person table
Variable name	Variable level	Estimate name	Data source
Variable name	Variable level	Estimate name	GiBleed
Number subjects	–	N	2,694
Number subjects not in observation	–	N (%)	0 (0.00%)
Sex	Female	N (%)	1,373 (50.97%)
	Male	N (%)	1,321 (49.03%)
	None	N (%)	0 (0.00%)
Sex source	F	N (%)	1,373 (50.97%)
	M	N (%)	1,321 (49.03%)
Race	Missing	N (%)	2,243 (83.26%)
	No matching concept	N (%)	451 (16.74%)
Race source	asian	N (%)	212 (7.87%)
	black	N (%)	338 (12.55%)
	hispanic	N (%)	435 (16.15%)
	native	N (%)	14 (0.52%)
	other	N (%)	2 (0.07%)
	white	N (%)	1,693 (62.84%)
Ethnicity	Missing	N (%)	435 (16.15%)
	No matching concept	N (%)	2,259 (83.85%)
Ethnicity source	african	N (%)	119 (4.42%)
	american	N (%)	79 (2.93%)
	american_indian	N (%)	14 (0.52%)
	arab	N (%)	2 (0.07%)
	asian_indian	N (%)	81 (3.01%)
	central_american	N (%)	75 (2.78%)
	chinese	N (%)	131 (4.86%)
	dominican	N (%)	105 (3.90%)
	english	N (%)	218 (8.09%)
	french	N (%)	129 (4.79%)
	french_canadian	N (%)	74 (2.75%)
	german	N (%)	130 (4.83%)
	greek	N (%)	19 (0.71%)
	irish	N (%)	438 (16.26%)
	italian	N (%)	295 (10.95%)
	mexican	N (%)	42 (1.56%)
	polish	N (%)	107 (3.97%)
	portuguese	N (%)	93 (3.45%)
	puerto_rican	N (%)	258 (9.58%)
	russian	N (%)	34 (1.26%)
	scottish	N (%)	48 (1.78%)
	south_american	N (%)	60 (2.23%)
	swedish	N (%)	29 (1.08%)
	west_indian	N (%)	114 (4.23%)
Year of birth	–	Missing (%)	0 (0.00%)
		Median [Q25 - Q75]	1,961 [1,950 - 1,970]
		90% Range [Q05 to Q95]	1,922 to 1,979
		Range [min to max]	1,908 to 1,986
Month of birth	–	Missing (%)	0 (0.00%)
		Median [Q25 - Q75]	7 [4 - 10]
		90% Range [Q05 to Q95]	1 to 12
		Range [min to max]	1 to 12
Day of birth	–	Missing (%)	0 (0.00%)
		Median [Q25 - Q75]	16 [8 - 23]
		90% Range [Q05 to Q95]	2 to 29
		Range [min to max]	1 to 31
Location	–	Missing (%)	2,694 (100.00%)
		Zero count (%)	0 (0.00%)
		Distinct values	1
Provider	–	Missing (%)	2,694 (100.00%)
		Zero count (%)	0 (0.00%)
		Distinct values	1
Care site	–	Missing (%)	2,694 (100.00%)
		Zero count (%)	0 (0.00%)
		Distinct values	1

The column headers are driven by the CDM name and the estimate combinations are rendered as human-readable labels (e.g. N (%), Median [Q25 - Q75], Missing N (%)).

Disconnect

cdmDisconnect(cdm = cdm)