Skip to contents

Introduction

This vignette walks through the OmopSketch functions designed to provide a concise overview of the OMOP person table. Two functions cover this workflow:

  • summarisePerson(): computes summary statistics and data-quality checks for the person table, including total subject counts, observation-period coverage, sex/race/ethnicity distributions, birth-date components, and summaries of id-columns (location_id, provider_id, care_site_id).
  • tablePerson(): renders the results as a formatted table (gt, reactable, or datatable).

Setup

Load the required packages and create a mock CDM using omock.

library(dplyr)
library(OmopSketch)
library(omock)

cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")
cdm

Summarise the person table

Call summarisePerson() to compute all summaries. It returns a summarised_result object — a standardised tidy format used across the OMOP analytics ecosystem.

result <- summarisePerson(cdm = cdm)
result |> glimpse()
#> Rows: 123
#> Columns: 13
#> $ result_id        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#> $ cdm_name         <chr> "GiBleed", "GiBleed", "GiBleed", "GiBleed", "GiBleed"…
#> $ group_name       <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ group_level      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_name      <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level     <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name    <chr> "Number subjects", "Number subjects not in observatio…
#> $ variable_level   <chr> NA, NA, NA, "Female", "Female", "Male", "Male", "None…
#> $ estimate_name    <chr> "count", "count", "percentage", "count", "percentage"…
#> $ estimate_type    <chr> "integer", "integer", "numeric", "integer", "numeric"…
#> $ estimate_value   <chr> "2694", "0", "0", "1373", "50.9651076466221", "1321",
#> $ additional_name  <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…

What the function reports

summarisePerson() covers the following summaries, each stored as a separate variable_name in the result:

Variable Description
Number subjects Total row count in person.
Number subjects not in observation Count and percentage of persons absent from observation_period. A warning is emitted when this is non-zero.
Sex Count and percentage for Female, Male, and None (derived via PatientProfiles::addSexQuery()).
Sex source Distribution of raw gender_source_value.
Race Distribution of race_concept_id, resolved to concept names.
Race source Distribution of raw race_source_value.
Ethnicity Distribution of ethnicity_concept_id, resolved to concept names.
Ethnicity source Distribution of raw ethnicity_source_value.
Year of birth Numeric summary: missingness, quantiles (Q05, Q25, median, Q75, Q95), min/max.
Month of birth Same numeric summary as year of birth.
Day of birth Same numeric summary as year of birth.
Location Missing count, zero count, and distinct values for location_id. When location_id is empty, the function attempts to derive it from care_site_id and notes this in a message.
Provider Missing count, zero count, distinct values, and (below threshold) individual provider_name labels.
Care site Missing count, zero count, distinct values, and (below threshold) individual care_site_name labels.

Threshold

For location_id, provider_id, and care_site_id, summarisePerson() always reports three aggregate statistics: number missing, number of zeros, and number of distinct values. Whether it goes further and lists each value individually depends on a threshold.

If the number of distinct values is below 15 (the default), the function joins to the corresponding lookup table and appends one row per unique label — location_source_value, provider_name, or care_site_name — each with its own count and percentage. This gives you a readable breakdown rather than just a count of how many distinct ids exist.

If the number of distinct values is at or above 15, only the aggregate stats are returned. Listing dozens or hundreds of individual sites would produce noise rather than insight.

You can tune this cutoff for your database before calling summarisePerson():

# Raise the threshold to show individual labels for up to 50 distinct values
options(OmopSketch.personLabels = 50)
result <- summarisePerson(cdm = cdm)

Estimate types

Depending on the variable, estimates include:

  • count / percentage — for categorical variables (sex, race, ethnicity).
  • count_missing / percentage_missing — missingness for id-columns and birth-date fields.
  • count_0 / percentage_0 — zero values for id-columns.
  • distinct_values — number of unique non-null values for id-columns.
  • min, q05, q25, median, q75, q95, max — quantile summaries for birth-date fields.

Visualise the results

tablePerson() formats the summarised_result into a publication-ready table. The type argument accepts "gt" (default), "reactable", or "datatable".

tablePerson(result = result, type = "gt")
#> ! The following column type were changed:
#>  variable_name: from integer to character
Summary of person table
Variable name Variable level Estimate name
Data source
GiBleed
Number subjects N 2,694
Number subjects not in observation N (%) 0 (0.00%)
Sex Female N (%) 1,373 (50.97%)
Male N (%) 1,321 (49.03%)
None N (%) 0 (0.00%)
Sex source F N (%) 1,373 (50.97%)
M N (%) 1,321 (49.03%)
Race Missing N (%) 2,243 (83.26%)
No matching concept N (%) 451 (16.74%)
Race source asian N (%) 212 (7.87%)
black N (%) 338 (12.55%)
hispanic N (%) 435 (16.15%)
native N (%) 14 (0.52%)
other N (%) 2 (0.07%)
white N (%) 1,693 (62.84%)
Ethnicity Missing N (%) 435 (16.15%)
No matching concept N (%) 2,259 (83.85%)
Ethnicity source african N (%) 119 (4.42%)
american N (%) 79 (2.93%)
american_indian N (%) 14 (0.52%)
arab N (%) 2 (0.07%)
asian_indian N (%) 81 (3.01%)
central_american N (%) 75 (2.78%)
chinese N (%) 131 (4.86%)
dominican N (%) 105 (3.90%)
english N (%) 218 (8.09%)
french N (%) 129 (4.79%)
french_canadian N (%) 74 (2.75%)
german N (%) 130 (4.83%)
greek N (%) 19 (0.71%)
irish N (%) 438 (16.26%)
italian N (%) 295 (10.95%)
mexican N (%) 42 (1.56%)
polish N (%) 107 (3.97%)
portuguese N (%) 93 (3.45%)
puerto_rican N (%) 258 (9.58%)
russian N (%) 34 (1.26%)
scottish N (%) 48 (1.78%)
south_american N (%) 60 (2.23%)
swedish N (%) 29 (1.08%)
west_indian N (%) 114 (4.23%)
Year of birth Missing (%) 0 (0.00%)
Median [Q25 - Q75] 1,961 [1,950 - 1,970]
90% Range [Q05 to Q95] 1,922 to 1,979
Range [min to max] 1,908 to 1,986
Month of birth Missing (%) 0 (0.00%)
Median [Q25 - Q75] 7 [4 - 10]
90% Range [Q05 to Q95] 1 to 12
Range [min to max] 1 to 12
Day of birth Missing (%) 0 (0.00%)
Median [Q25 - Q75] 16 [8 - 23]
90% Range [Q05 to Q95] 2 to 29
Range [min to max] 1 to 31
Location Missing (%) 2,694 (100.00%)
Zero count (%) 0 (0.00%)
Distinct values 1
Provider Missing (%) 2,694 (100.00%)
Zero count (%) 0 (0.00%)
Distinct values 1
Care site Missing (%) 2,694 (100.00%)
Zero count (%) 0 (0.00%)
Distinct values 1

The column headers are driven by the CDM name and the estimate combinations are rendered as human-readable labels (e.g. N (%), Median [Q25 - Q75], Missing N (%)).

Disconnect

cdmDisconnect(cdm = cdm)