vignettes/AddingExistingModels.Rmd
AddingExistingModels.Rmd
This vignette describes how you can implement existing logistic regression models in the PatientLevelPrediction
framework. This allows you to for example externally validate them at scale in the OHDSI data network.
As an example we are going to implement the CHADS2 model:
Gage BF, Waterman AD, Shannon W, Boechler M, Rich MW, Radford MJ. Validation of clinical classification schemes for predicting stroke: results from the National Registry of Atrial Fibrillation. JAMA. 2001 Jun 13;285(22):2864-70
To implement the model you need to create three tables: the model table, the covariate table, and the intercept table. The model table specifies the modelId (sequence number), the modelCovariateId (sequence number) and the covariateValue (beta for the covariate). The covariate table specifies the mapping between the covariates from the published model and the standard Patient Level Prediction framework covariates, i.e. its maps to a combination of an analysisid and a concept_id (see below). The intercept table specifies per modelId the intercept.
The CHADS2 is a score based model with:
## Points Covariate
## 1 1 Congestive heart failure
## 2 1 Hypertension
## 3 1 Age >= 75 years
## 4 1 Diabetes mellitus
## 5 2 Stroke/transient ischemic attack
The model table should therefore be defined as:
## modelId modelCovariateId covariateValue
## 1 1 1 1
## 2 1 2 1
## 3 1 3 1
## 4 1 4 1
## 5 1 5 2
The covariateTable will then specify what standard covariates need to be included in the model.
In this case we choose the following Standard SNOMED concept_ids: 319835 for congestive heart failure, 316866 for hypertensive disorder, 201820 for diabetes, and 381591 for cerebrovascular disease. It is allowed to add multiple concept_ids as seperate rows for the same modelCovariateId if concept sets are needed. These concept_ids can be found using the vocabulary search in ATLAS.
The Patient Level Prediction standard covariates are of the form: conceptid*1000 + analysisid. The analysisid specifies the domain of the covariate and its lookback window. Examples can be found here: https://github.com/OHDSI/FeatureExtraction/blob/master/inst/csv/PrespecAnalyses.csv
Our example of CHADS2 uses agegroup and conditions in the full history. Therefore we need to define the standard covariates using the FeatureExtraction::createCovariateSettings as follows:
library(PatientLevelPrediction)
covSet <- FeatureExtraction::createCovariateSettings(useDemographicsAgeGroup = T,
useConditionOccurrenceLongTerm = T,
includedCovariateIds = c(),
longTermStartDays = -9999,
endDays = 0)
In the above code we used the useConditionOccurrenceLongTerm (these have an analysis id of 102) and we defined the longTermStartDays to be -9999 days relative to index (so we get the full history). We include the index date in our lookback period by specifying endDays = 0. The includeCovariateIds is set to 0, but this will be updated when you run the next code to pick out the standard covariates of interest. As we picked analysis id 102, the standard covariate for anytime prior congestive heart failure is 319835102, the same logic follows for the other conditions, so the covariate table will be:
## modelCovariateId covariateId
## 1 1 319835102
## 2 2 316866102
## 3 3 15003
## 4 3 16003
## 5 3 17003
## 6 3 18003
## 7 3 19003
## 8 4 201820102
## 9 5 381591102
modelCovariateId 3 was age>= 75, as the standard covariate age groups are in 5 year groups, we needed to add the age groups 75-80, 80-85, 85-90, 90-95 and 95-100, these correspond to the covaraiteIds 15003, 16003, 17003, 18003 and 19003 respectively.
To create the tables in R for CHADS2 you need to make the following dataframes:
model_table <- data.frame(modelId = c(1,1,1,1,1),
modelCovariateId = 1:5,
coefficientValue = c(1, 1, 1, 1, 2)
)
covariate_table <- data.frame(modelCovariateId = c(1,2,3,3,3,3,3,4,5),
covariateId = c(319835102, 316866102,
15003, 16003, 17003, 18003, 19003,
201820102, 381591102)
)
interceptTable <- data.frame(modelId = 1,
interceptValue = 0)
Now you have everything in place actually create the existing model. First specify the current environment as executing createExistingModelSql creates two functions for running the existing model into the specificed environment. Next a few additional settings are needed: as some models require an intercept, there is an option for this (set it to 0 if an intercept isn’t needed), also the type specifies the final mapping (either logistic or linear/score), in our example we are calculating a score. We finally need to specify the analysisId for the newly created CHADS2 covariate.
e <- environment()
PatientLevelPrediction::createExistingModelSql(modelTable = model_table,
modelNames = 'CHADS2',
interceptTable = data.frame(modelId = 1, interceptValue = 0),
covariateTable = covariate_table,
type = 'score',
analysisId = 112, covariateSettings = covSettings, e = e)
Once run you will find two new functions in your environment:
Now you can use the functions you previously created to extract the existing model risk scores for a target population:
plpData <- PatientLevelPrediction::getPlpData(connectionDetails,
cdmDatabaseSchema = 'databasename.dbo',
cohortId = 1,
outcomeIds = 2,
cohortDatabaseSchema = 'databasename.dbo',
cohortTable = 'cohort' ,
outcomeDatabaseSchema = 'databasename.dbo',
outcomeTable = 'cohort',
covariateSettings = createExistingmodelsCovariateSettings(),
sampleSize = 20000
)
To implement and evaluate an existing model you can use the function:
PatientLevelPrediction::evaluateExistingModel()
with the following parameters:
Finally you need to add the settings for downloading the new data:
To run the external validation of an existing model where the target population are those in the cohort table with id 1 and the outcome is those in the cohort table with id 2 and we are looking to predict first time occurrance of the outcome 1 day to 365 days after the target cohort start date (asusming you have the modelTable, covariateTable and interceptTable in the format explained above):
# if the existing model uses gender and condition groups looking back 200 days:
covSet <- FeatureExtraction::createCovariateSettings(useDemographicsGender = T,
useConditionGroupEraMediumTerm = T,
mediumTermStartDays = -200)
result <- evaluateExistingModel(modelTable = modelTable,
covariateTable = covariateTable,
interceptTable = NULL,
type = 'score',
covariateSettings = covSet,
riskWindowStart = 1,
addExposureDaysToEnd = F,
riskWindowEnd = 365,
requireTimeAtRisk = T,
minTimeAtRisk = 364,
includeAllOutcomes = T,
removeSubjectsWithPriorOutcome = T,
connectionDetails = connectionDetails,
cdmDatabaseSchema = 'databasename.dbo',
cohortId = 1,
outcomeId = 2,
cohortDatabaseSchema = 'databasename.dbo',
cohortTable = 'cohort' ,
outcomeDatabaseSchema = 'databasename.dbo',
outcomeTable = 'cohort'
)
Considerable work has been dedicated to provide the PatientLevelPrediction
package.
citation("PatientLevelPrediction")
##
## To cite PatientLevelPrediction in publications use:
##
## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018).
## "Design and implementation of a standardized framework to generate
## and evaluate patient-level prediction models using observational
## healthcare data." _Journal of the American Medical Informatics
## Association_, *25*(8), 969-975. <URL:
## https://doi.org/10.1093/jamia/ocy032>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
## journal = {Journal of the American Medical Informatics Association},
## volume = {25},
## number = {8},
## pages = {969-975},
## year = {2018},
## url = {https://doi.org/10.1093/jamia/ocy032},
## }
This work is supported in part through the National Science Foundation grant IIS 1251151.