vignettes/GeneratingKeeper.Rmd
GeneratingKeeper.RmdThis vignette describes how to use Keeper to generate patient summaries for case adjudication from data mapped to the OMOP Common Data Model (CDM).
Keeper extracts and summarizes patient-level data for individuals in a specified cohort to facilitate the review of patient profiles. Such examinations can be used to interactively develop a phenotype (cohort definition) of a disease or - in its primary use case - to determine if patients truly have the disease and subsequently calculate positive predictive value (PPV).
This review should be conducted by someone familiar with the disease of interest, the underlying data, and the data collection process. Alternatively, the review can be performed by large language models (LLMs), which additionally enables the calculation of sensitivity and specificity. Please refer to the Using Keeper with LLMs vignette for more details on that use case.
The first step is to write a clinical definition. For this exercise, we will use a brief version, using Type 1 Diabetes Mellitus (T1DM) as our example phenotype.
T1DM is an autoimmune condition characterized by decreased insulin production by the pancreas. Onset most commonly occurs in childhood or adolescence, but it can present in adults. Symptoms include weight loss, polyuria, polydipsia, fatigue, and others. Common differential diagnoses (conditions that must be ruled out) include type 2 diabetes, pancreatic disorders such as cystic fibrosis, pancreatic necrosis, steroid-induced diabetes, renal glycosuria, and other conditions. Diagnostic procedures include glucose measurements, C-peptide, pancreatic and insulin antibodies, as well as HbA1c testing. It is primarily treated with insulin. Complications can include hypo- and hyperglycemia, neuropathy, nephropathy, cerebrovascular disease, and peripheral artery disease.
We will use this definition to construct our inputs. The concept of differential diagnosis is crucial; for each input category (except the disease of interest itself), we will consider concepts related to both T1DM and its differential diagnoses to evaluate evidence for the disease and to effectively rule out alternatives.
Keeper extracts data based on user-defined concept sets. If a concept belonging to an input concept set is found in a patient’s records, Keeper will extract it along with its date relative to the index date. Therefore, careful concept set selection is highly important.
These input concept sets can be created manually, but this process can be challenging and labor-intensive. Alternatively, we can use large language models (LLMs) to generate initial concept sets.
We use the ellmer package to connect to an LLM from your
provider of choice, including Anthropic, Google, OpenAI, or a local LLM.
For example, we can connect to OpenAI’s ChatGPT using:
library(ellmer)
client <- chat_openai()This assumes you have set the OPENAI_API_KEY
environmental variable. See the ellmer package for details on
connecting to various providers.
We also need access to a database containing the OHDSI Vocabulary tables. We specify the connection details like so:
library(DatabaseConnector)
connectionDetails <- createConnectionDetails(
dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret"
)
vocabularyDatabaseSchema <- "cdm"See the createConnectionDetails()
documentation for more information on connecting to your database
server.
Next, we can generate the concept sets:
conceptSets <- generateKeeperConceptSets(
phenotype = "Type I Diabetes Mellitus (T1DM)",
client = client,
vocabConnectionDetails = connectionDetails,
vocabDatabaseSchema = vocabDatabaseSchema
)
conceptSets## # A tibble: 481 × 5
## conceptId conceptName vocabularyId conceptSetName target
## <int> <chr> <chr> <chr> <chr>
## 1 201254 Type 1 diabetes mellitus SNOMED doi Disea…
## 2 435216 Disorder due to type 1 diabetes… SNOMED doi Disea…
## 3 201826 Type 2 diabetes mellitus SNOMED alternativeDi… Disea…
## 4 195771 Secondary diabetes mellitus SNOMED alternativeDi… Disea…
## 5 195212 Hypercortisolism SNOMED alternativeDi… Disea…
## 6 438476 Vasopressin resistance SNOMED alternativeDi… Disea…
## 7 40480068 Drug-induced hyperglycemia SNOMED alternativeDi… Disea…
## 8 4163735 Hemochromatosis SNOMED alternativeDi… Disea…
## 9 193170 Renal glycosuria SNOMED alternativeDi… Disea…
## 10 37016349 Hyperglycemia due to type 2 dia… SNOMED alternativeDi… Disea…
## # ℹ 471 more rows
We can also create the concept sets manually, or, if we used an LLM to generate them, review the outputs. The format of the concept sets should be a data frame with the following columns:
conceptId: The specific concept
ID.conceptName: The name of the
concept.vocabularyId: The vocabulary the
concept belongs to.conceptSetName: The category of the
concept set. Allowed values are: "doi",
"alternativeDiagnosis", "symptoms",
"drugs", "diagnosticProcedures",
"measurements", "treatmentProcedures",
"complications".target: Either "doi" or
"alternativeDiagnosis", depending on which condition the
concept is related to. This distinction only matters for color-coding
within the Shiny app.Importantly, when using useDescendants = TRUE in
generateKeeper() (which is the default setting), all
descendants of the concepts specified here will automatically be
included.
Below, we discuss each concept set category in detail.
DOI (Disease of Interest) is the target condition being evaluated. Here, we select two concepts along with their descendants:
The first code represents T1DM itself, while the second code denotes
diseases occurring due to T1DM, which implies the patient also has T1DM.
A common strategy is to select the codes used as index event criteria in
the phenotype definition. If useAncestor is set to
TRUE (the default behavior), Keeper will use the hierarchy
to pull in descendants of the selected concepts.
The DOI is looked up in the CONDITION_OCCURRENCE
table.
Alternative diagnoses are the competing conditions we want to rule out. Differential diagnoses for T1DM include the following conditions:
Alternative diagnosis codes are looked up in the
CONDITION_OCCURRENCE table within 90 days before and after
the index date.
Note: For all subsequent categories, we want to select the concepts relevant to the DOI as well as those relevant to the alternative (competing/differential) diagnoses.
Here we input symptoms typically occurring in T1DM and its differential diagnoses. These are signs and symptoms that occur in a short time window before disease onset.
Based on our clinical definition, we selected the following codes:
These are broad SNOMED codes representing the symptoms we are interested in; source codes of the corresponding conditions map either to them directly or to their descendants.
A good approach for selecting codes for this section (and subsequent sections) is to input your term in ATLAS Search and click on the green shopping cart (the Phoebe initial code selection feature) to get a starting point. Then, use Phoebe (the Recommend tab within the ATLAS Concept Set module) to explore related recommendations. Instructions on how to use Phoebe can be found here. While you can explore your local data to find appropriate SNOMED codes using string searches, you are more likely to miss relevant codes this way.
Symptoms are looked up in the OBSERVATION and
CONDITION_OCCURRENCE tables within the 30 days prior to the
index date.
We selected drugs (ancestor terms with their descendants) used to treat T1DM as well as the differential diagnoses:
Drugs are looked up in the DRUG_ERA table any time prior
to and any time after the index date (displayed as two separate
columns).
Diagnostic procedures are the procedure codes used for diagnosing the disease of interest or alternative disease(s).
Diagnostic procedures are looked up in the
PROCEDURE_OCCURRENCE table within 30 days prior to and
after the index date.
Measurements are laboratory tests used to diagnose T1DM and differential diagnoses:
Note that there are often many variants of a given measurement. Be sure to include them all.
Measurements are looked up in the MEASUREMENT table
within 30 days prior to and after the index date.
Treatment procedures correspond to the treatment of the disease of interest or alternative disease(s). In this case, most procedures correspond to alternative diagnoses:
Treatment procedures are looked up in the
PROCEDURE_OCCURRENCE table any time after the index
date.
Complications are other conditions occurring as a result of the disease. We selected the following codes along with their descendants:
Complications are looked up in the CONDITION_OCCURRENCE
table any time before or after the index date (displayed as two separate
columns).
Keeper creates profiles of persons in a specified cohort. Cohorts can be created using ATLAS, R, or SQL. The cohort table must contain the following fields:
cohort_definition_id (INT): A unique
identifier per cohort.subject_id (BIGINT): A unique
identifier per person. This should correspond to the person ID in the
CDM data.cohort_start_date (DATE): The date the
person enters the cohort.cohort_end_date (DATE): (Optional) The
date the person exits the cohort.More information on creating cohorts can be found in the Book of OHDSI.
When using LLMs, it is also possible to create a highly sensitive cohort - a cohort that is unlikely to miss any true cases. Please refer to the Using Keeper with LLMs vignette for instructions on creating highly sensitive cohorts.
For this example, we will use the Capr package to define
a simple cohort definition:
library(Capr)
t1dmConceptIds <- c(201254, 435216)
t1dmCs <- cs(
descendants(t1dmConceptIds),
name = "Type 1 Diabetes Mellitus"
)
t1dmCohort <- cohort(
entry = entry(
conditionOccurrence(t1dmCs, firstOccurrence())
),
exit = exit(
endStrategy = observationExit()
)
)
# Note: this will automatically assign cohort ID 1:
cohortSet <- makeCohortSet(t1dmCohort)We can instantiate this cohort in our database using the
CohortGenerator package. First, we must specify how to
connect to the server holding the CDM data, and where the cohort table
will be created:
connectionDetails <- createConnectionDetails(
dbms = "postgresql",
server = "localhost/ohdsi",
user = "joe",
password = "supersecret"
)
cdmDatabaseSchema <- "cdm"
cohortDatabaseSchema <- "cdm"
cohortTable <- "cohort"
options(sqlRenderTempEmulationSchema = NULL)Next, we create the cohort table and generate the cohort:
library(CohortGenerator)
connection <- connect(connectionDetails)
createCohortTables(
connection = connection,
cohortTableNames = getCohortTableNames(cohortTable),
cohortDatabaseSchema = cohortDatabaseSchema
)
CohortGenerator::generateCohortSet(
connection = connection,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTableNames = getCohortTableNames(cohortTable),
cohortDefinitionSet = cohortSet
)
disconnect(connection)With a cohort and input concept sets defined, we are ready to run Keeper. We generate the Keeper profiles using the following code:
keeper <- generateKeeper(
connection = connection,
cohortDatabaseSchema = cohortDatabaseSchema,
cdmDatabaseSchema = cdmDatabaseSchema,
cohortTable = cohortTable,
cohortDefinitionId = 1,
sampleSize = 20,
keeperConceptSets = conceptSets,
phenotypeName = "Type I Diabetes Mellitus (T1DM)",
removePii = TRUE
)Here, conceptSets is the data frame we created earlier
holding the target concepts. We use cohortDefinitionId = 1
because that was implied in our code in the previous section. We specify
a sampleSize of 20, meaning Keeper will randomly select up
to 20 persons from the cohort. Alternatively, we could have provided a
specific set of person IDs for Keeper to restrict its query to.
We provide a phenotypeName that will be stored with the
Keeper profiles. We also set removePii = TRUE so that the
output will not contain the original person IDs or absolute dates,
ensuring the output is completely anonymized with no personally
identifying information (PII).
Keeper will populate the following categories with concepts observed in the person’s records:
removePersonId = FALSE).OBSERVATION_PERIOD records, formatted as days
prior to days after the index date.CONDITION_OCCURRENCE on day 0, along with their
corresponding type and status.CONDITION_ERA
selected as symptoms within the 30 days prior to the index date,
excluding day 0. This list does not include the disease of interest or
complications. (If you want to track symptoms outside of this window,
please place those codes in the complications concept set).CONDITION_ERA selected as the disease of interest or
complications at any time prior to the index date, excluding day 0.DRUG_ERA
selected as drugs of interest at any time prior to the index date,
excluding day 0, formatted as the day the era starts and the length of
the drug era.PROCEDURE_OCCURRENCE selected as treatments of interest at
any time prior to the index date, excluding day 0.PROCEDURE_OCCURRENCE selected as diagnostic procedures at
any time prior to the index date, excluding day 0.MEASUREMENT
selected as lab tests of interest within 30 days before and 30 days
after day 0. These are formatted as value and unit (if available) and
assessed against the reference range provided in the
MEASUREMENT table (e.g., normal, abnormal high, abnormal
low).CONDITION_ERA selected as competing diagnoses within 90
days before and 90 days after day 0. This list does not include the
disease of interest.We can review the Keeper profiles manually or use an LLM, as described in the Using Keeper with LLMs vignette. When reviewing profiles manually, we can choose to do this in a spreadsheet program like Microsoft Excel or by using the built-in Shiny app.
We can convert the output of generateKeeper into a data
frame having one row per person, with one column for each of the Keeper
output categories:
keeperTable <- convertKeeperToTable(keeper)
keeperTable## # A tibble: 20 × 21
## generatedId cohortPrevalence phenotype age sex observationPeriod race
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 4 0.0488 T1DM 67 MALE -640 days - 166 da… ""
## 2 8 0.0488 T1DM 75 FEMALE -961 days - 5557 d… ""
## 3 16 0.0488 T1DM 67 FEMALE -151 days - 36 days ""
## 4 10 0.0488 T1DM 79 FEMALE -1182 days - 1377 … ""
## 5 13 0.0488 T1DM 78 FEMALE -667 days - 2632 d… ""
## 6 18 0.0488 T1DM 70 FEMALE -26 days - 89 days ""
## 7 19 0.0488 T1DM 70 FEMALE -126 days - 118 da… ""
## 8 12 0.0488 T1DM 75 MALE -94 days - 1003 da… ""
## 9 17 0.0488 T1DM 81 MALE -718 days - 373 da… ""
## 10 7 0.0488 T1DM 76 MALE -685 days - 410 da… ""
## 11 3 0.0488 T1DM 81 FEMALE -145 days - 1598 d… ""
## 12 9 0.0488 T1DM 70 MALE -543 days - 181 da… ""
## 13 2 0.0488 T1DM 82 FEMALE -1437 days - 1502 … ""
## 14 6 0.0488 T1DM 67 MALE -7 days - 387 days ""
## 15 14 0.0488 T1DM 67 FEMALE -2107 days - 1375 … ""
## 16 11 0.0488 T1DM 70 MALE -1123 days - 576 d… ""
## 17 1 0.0488 T1DM 67 MALE -406 days - 51 days ""
## 18 15 0.0488 T1DM 64 FEMALE -16 days - 1050 da… ""
## 19 20 0.0488 T1DM 67 MALE -129 days - 2275 d… ""
## 20 5 0.0488 T1DM 97 MALE -2708 days - 1207 … ""
## # ℹ 14 more variables: ethnicity <chr>, presentation <chr>, visits <chr>,
## # symptoms <chr>, priorDisease <chr>, postDisease <chr>, priorDrugs <chr>,
## # postDrugs <chr>, priorTreatmentProcedures <chr>,
## # postTreatmentProcedures <chr>, alternativeDiagnoses <chr>,
## # diagnosticProcedures <chr>, measurements <chr>, death <chr>
We can then save this table to a CSV file and open it in Excel:
readr::write_csv(keeperTable, "e:/temp/KeeperT1dm.csv")Alternatively, we can launch the built-in Shiny app to review the profiles interactively:
launchReviewerApp(
keeper = keeper,
keeperConceptSets = conceptSets,
decisionsFileName = "decisions.csv"
)This will launch the Shiny application, which looks like this:

Any decisions you log within the app will be written to the specified
decisions file (e.g., decisions.csv). If the decisions file
does not yet exist, Keeper will create it for you.