Introduction

This vignette describes how you can implement existing logistic regression models in the PatientLevelPrediction framework. This allows you to for example externally validate them at scale in the OHDSI data network.

As an example we are going to implement the CHADS2 model:

Gage BF, Waterman AD, Shannon W, Boechler M, Rich MW, Radford MJ. Validation of clinical classification schemes for predicting stroke: results from the National Registry of Atrial Fibrillation. JAMA. 2001 Jun 13;285(22):2864-70

To implement the model you need to create three tables: the model table, the covariate table, and the intercept table. The model table specifies the modelId (sequence number), the modelCovariateId (sequence number) and the covariateValue (beta for the covariate). The covariate table specifies the mapping between the covariates from the published model and the standard Patient Level Prediction framework covariates, i.e. its maps to a combination of an analysisid and a concept_id (see below). The intercept table specifies per modelId the intercept.

Model implementation

Define the model

The CHADS2 is a score based model with:

##   Points                        Covariate
## 1      1         Congestive heart failure
## 2      1                     Hypertension
## 3      1                  Age >= 75 years
## 4      1                Diabetes mellitus
## 5      2 Stroke/transient ischemic attack

The model table should therefore be defined as:

##   modelId modelCovariateId covariateValue
## 1       1                1              1
## 2       1                2              1
## 3       1                3              1
## 4       1                4              1
## 5       1                5              2

The covariateTable will then specify what standard covariates need to be included in the model.

In this case we choose the following Standard SNOMED concept_ids: 319835 for congestive heart failure, 316866 for hypertensive disorder, 201820 for diabetes, and 381591 for cerebrovascular disease. It is allowed to add multiple concept_ids as seperate rows for the same modelCovariateId if concept sets are needed. These concept_ids can be found using the vocabulary search in ATLAS.

The Patient Level Prediction standard covariates are of the form: conceptid*1000 + analysisid. The analysisid specifies the domain of the covariate and its lookback window. Examples can be found here: https://github.com/OHDSI/FeatureExtraction/blob/master/inst/csv/PrespecAnalyses.csv

Our example of CHADS2 uses agegroup and conditions in the full history. Therefore we need to define the standard covariates using the FeatureExtraction::createCovariateSettings as follows:

library(PatientLevelPrediction)
covSet <- FeatureExtraction::createCovariateSettings(useDemographicsAgeGroup = T,                             
                                                     useConditionOccurrenceLongTerm = T,
                                                     includedCovariateIds = c(),
                                                     longTermStartDays = -9999, 
                                                     endDays = 0)

In the above code we used the useConditionOccurrenceLongTerm (these have an analysis id of 102) and we defined the longTermStartDays to be -9999 days relative to index (so we get the full history). We include the index date in our lookback period by specifying endDays = 0. The includeCovariateIds is set to 0, but this will be updated when you run the next code to pick out the standard covariates of interest. As we picked analysis id 102, the standard covariate for anytime prior congestive heart failure is 319835102, the same logic follows for the other conditions, so the covariate table will be:

##   modelCovariateId covariateId
## 1                1   319835102
## 2                2   316866102
## 3                3       15003
## 4                3       16003
## 5                3       17003
## 6                3       18003
## 7                3       19003
## 8                4   201820102
## 9                5   381591102

modelCovariateId 3 was age>= 75, as the standard covariate age groups are in 5 year groups, we needed to add the age groups 75-80, 80-85, 85-90, 90-95 and 95-100, these correspond to the covaraiteIds 15003, 16003, 17003, 18003 and 19003 respectively.

To create the tables in R for CHADS2 you need to make the following dataframes:

model_table <- data.frame(modelId = c(1,1,1,1,1),
                          modelCovariateId = 1:5, 
                          coefficientValue = c(1, 1, 1, 1, 2)
                          )

covariate_table <- data.frame(modelCovariateId = c(1,2,3,3,3,3,3,4,5),
                              covariateId = c(319835102, 316866102, 
                                            15003, 16003, 17003, 18003, 19003, 
                                            201820102, 381591102)
                              )

interceptTable <-  data.frame(modelId = 1, 
                              interceptValue = 0)

Create the model

Now you have everything in place actually create the existing model. First specify the current environment as executing createExistingModelSql creates two functions for running the existing model into the specificed environment. Next a few additional settings are needed: as some models require an intercept, there is an option for this (set it to 0 if an intercept isn’t needed), also the type specifies the final mapping (either logistic or linear/score), in our example we are calculating a score. We finally need to specify the analysisId for the newly created CHADS2 covariate.

e <- environment()
PatientLevelPrediction::createExistingModelSql(modelTable = model_table, 
                       modelNames = 'CHADS2', 
                       interceptTable = data.frame(modelId = 1, interceptValue = 0),
                       covariateTable = covariate_table, 
                       type = 'score',
                       analysisId = 112, covariateSettings = covSettings, e = e)

Once run you will find two new functions in your environment:

createExistingmodelsCovariateSettings()
getExistingmodelsCovariateSettings()

Run the model

Now you can use the functions you previously created to extract the existing model risk scores for a target population:

plpData <- PatientLevelPrediction::getPlpData(connectionDetails, 
                      cdmDatabaseSchema = 'databasename.dbo',
                      cohortId = 1,
                      outcomeIds = 2, 
                      cohortDatabaseSchema = 'databasename.dbo', 
                      cohortTable =  'cohort' , 
                      outcomeDatabaseSchema = 'databasename.dbo', 
                      outcomeTable = 'cohort', 
                      covariateSettings =  createExistingmodelsCovariateSettings(),
                      sampleSize = 20000
                      )

To implement and evaluate an existing model you can use the function:

PatientLevelPrediction::evaluateExistingModel()

with the following parameters:

modelTable - a data.frame containing the columns: modelId, covariateId and coefficientValue
covariateTable - a data.frame containing the columns: covariateId and standardCovariateId - this provides a set of standardCovariateId to define each model covariate.
interceptTable - a data.frame containing the columns modelId and interceptValue or NULL if the model doesn’t have an intercept (equal to zero).
type - the type of model (either: score or logistic)
covariateSettings - this is used to determine the startDay and endDay for the standard covariates
customCovariates - a data.frame with the covariateId and sql to generate the covariate value.
riskWindowStart - the time at risk starts at target cohort start date + riskWindowStart
addExposureDaysToEnd - if true then the time at risk window ends a the cohort end date + riskWindowEnd rather than cohort start date + riskWindowEnd
riskWindowEnd - the time at risk ends at target cohort start/end date + riskWindowStart
requireTimeAtRisk - whether to add a constraint to the number of days observed during the time at risk period in including people into the study
minTimeAtRisk - the minimum number of days observation during the time at risk a target population person needs to be included
includeAllOutcomes - Include outcomes even if they do not satisfy the minTimeAtRisk? (useful if the outcome is associated to death or rare)
removeSubjectsWithPriorOutcome - remove target population people who have the outcome prior to the time at tisk period?
connectionDetails - the connection to the CDM database

Finally you need to add the settings for downloading the new data:

cdmDatabaseSchema
cohortDatabaseSchema
cohortTable
cohortId
outcomeDatabaseSchema
outcomeTable
outcomeId
oracleTempSchema

To run the external validation of an existing model where the target population are those in the cohort table with id 1 and the outcome is those in the cohort table with id 2 and we are looking to predict first time occurrance of the outcome 1 day to 365 days after the target cohort start date (asusming you have the modelTable, covariateTable and interceptTable in the format explained above):

# if the existing model uses gender and condition groups looking back 200 days:
covSet <- FeatureExtraction::createCovariateSettings(useDemographicsGender = T,
                                                     useConditionGroupEraMediumTerm = T, 
                                                     mediumTermStartDays = -200)
result <- evaluateExistingModel(modelTable = modelTable,
                                covariateTable = covariateTable,
                                interceptTable = NULL,
                                type = 'score', 
                                covariateSettings =  covSet,
                                riskWindowStart = 1, 
                                addExposureDaysToEnd = F, 
                                riskWindowEnd = 365, 
                                requireTimeAtRisk = T, 
                                minTimeAtRisk = 364, 
                                includeAllOutcomes = T, 
                                removeSubjectsWithPriorOutcome = T, 
                                connectionDetails = connectionDetails, 
                                cdmDatabaseSchema = 'databasename.dbo',
                                cohortId = 1,
                                outcomeId = 2, 
                                cohortDatabaseSchema = 'databasename.dbo', 
                                cohortTable =  'cohort' , 
                                outcomeDatabaseSchema = 'databasename.dbo', 
                                outcomeTable = 'cohort'
                      )

Acknowledgments

Considerable work has been dedicated to provide the PatientLevelPrediction package.

citation("PatientLevelPrediction")

## 
## To cite PatientLevelPrediction in publications use:
## 
## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018).
## "Design and implementation of a standardized framework to generate
## and evaluate patient-level prediction models using observational
## healthcare data." _Journal of the American Medical Informatics
## Association_, *25*(8), 969-975. <URL:
## https://doi.org/10.1093/jamia/ocy032>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
##     title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
##     journal = {Journal of the American Medical Informatics Association},
##     volume = {25},
##     number = {8},
##     pages = {969-975},
##     year = {2018},
##     url = {https://doi.org/10.1093/jamia/ocy032},
##   }

This work is supported in part through the National Science Foundation grant IIS 1251151.

Implementing Existing Prediction Models using the OHDSI PatientLevelPrediction framework

Jenna Reps, Martijn J. Schuemie, Patrick B. Ryan, Peter R. Rijnbeek

2019-03-04