vignettes/BuildingEnsembleModels.Rmd
BuildingEnsembleModels.Rmd
Ensemble models combine several models to improve the overall performance. Traditionally, weak learners were combined to boost performance but recent results show that combining several strong approaches can also result in a better performance. There are many examples in literature where ensemble models outperform individual models using stacking, i.e. a final logistic regresssion layer accross the individual model outputs, but other approaches like weigthing has also shown promising results.
This vignette describes how you can use the Observational Health Data Sciencs and Informatics (OHDSI) PatientLevelPrediction
package to build ensemble models. This vignette assumes you have read and are comfortable with building single patient level prediction models as described in the BuildingPredictiveModels
vignette.
This will enable studying ensemble methods at scale in the OHDSI data network.
In PatientLevelPrediction package, four ensemble strategies have been implemented:
Use the PatientLevelPrediction
package to generate a population
and plpData
object. Alternatively, you can make use of the data simulator. The following code snippet creates a population of 12000 patients.
data(plpDataSimulationProfile)
set.seed(1234)
sampleSize <- 2000
plpData <- simulatePlpData(
plpDataSimulationProfile,
n = sampleSize
)
population <- createStudyPopulation(
plpData,
outcomeId = 2,
binary = TRUE,
firstExposureOnly = FALSE,
washoutPeriod = 0,
removeSubjectsWithPriorOutcome = FALSE,
priorOutcomeLookback = 99999,
requireTimeAtRisk = FALSE,
minTimeAtRisk = 0,
riskWindowStart = 0,
addExposureDaysToStart = FALSE,
riskWindowEnd = 365,
addExposureDaysToEnd = FALSE,
verbosity = "INFO"
)
Specify the prediction algorithms to be combined.
# Use LASSO logistic regression and Random Forest as base predictors
model1 <- setLassoLogisticRegression()
model2 <- setRandomForest()
Specify a test fraction and a sequence of training set fractions.
testFraction <- 0.2
Specify an ensembleStrategy to combine multiple predictors. The strategy used for ensembling the outputs from different models, it can be ‘mean’, ‘product’, ‘weighted’ and ‘stacked’: ‘mean’ the average probability from differnt models ‘product’ the product rule ‘weighted’ the weighted average probability from different models using train AUC as weights. ‘stacked’ the stakced ensemble trains a logistics regression on different models.
ensembleStrategy <- 'stacked'
Specify the test split to be used.
# Use a split by person, alterantively a time split is possible
testSplit <- 'person'
Run the ensemble learning to combine model1 and model2. You can also use different plpData for different models.
plpData <- loadPlpData("<data file>")
populationSettings <- ensembleModel$populationSettings
populationSettings$plpData <- plpData
population <- do.call(createStudyPopulation, populationSettings)
Load the model.
ensembleModel <- loadEnsemblePlpModel("<model folder>")
Get the predictions by applying the model:
prediction <- applyEnsembleModel(population,
dataList = list(plpData, plpData),
ensembleModel = ensembleModel)$prediction
Considerable work has been dedicated to provide the PatientLevelPrediction
package.
citation("PatientLevelPrediction")
##
## To cite PatientLevelPrediction in publications use:
##
## Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek P (2018). "Design
## and implementation of a standardized framework to generate and evaluate
## patient-level prediction models using observational healthcare data."
## _Journal of the American Medical Informatics Association_, *25*(8),
## 969-975. <URL: https://doi.org/10.1093/jamia/ocy032>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## author = {J. M. Reps and M. J. Schuemie and M. A. Suchard and P. B. Ryan and P. Rijnbeek},
## title = {Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data},
## journal = {Journal of the American Medical Informatics Association},
## volume = {25},
## number = {8},
## pages = {969-975},
## year = {2018},
## url = {https://doi.org/10.1093/jamia/ocy032},
## }
Please reference this paper if you use the PLP Package in your work: