This provides a general framework for training patient level prediction models. The user can select various default feature selection methods or incorporate their own, The user can also select from a range of default classifiers or incorporate their own. There are three types of evaluations for the model patient (randomly splits people into train/validation sets) or year (randomly splits data into train/validation sets based on index year - older in training, newer in validation) or both (same as year spliting but checks there are no overlaps in patients within training set and validaiton set - any overlaps are removed from validation set)

runPlp(population, plpData, minCovariateFraction = 0.001, normalizeData = T,
  modelSettings, testSplit = "time", testFraction = 0.25,
  trainFraction = NULL, splitSeed = NULL, nfold = 3, indexes = NULL,
  saveDirectory = NULL, savePlpData = T, savePlpResult = T,
  savePlpPlots = T, saveEvaluation = T, verbosity = "INFO",
  timeStamp = FALSE, analysisId = NULL, save = NULL)

Arguments

population

The population created using createStudyPopulation() who will be used to develop the model

plpData

An object of type plpData - the patient level prediction data extracted from the CDM.

minCovariateFraction

The minimum fraction of target population who must have a covariate for it to be included in the model training

normalizeData

Whether to normalise the covariates before training (Default: TRUE)

modelSettings

An object of class modelSettings created using one of the function:

  • setLassoLogisticRegression() A lasso logistic regression model

  • setGradientBoostingMachine() A gradient boosting machine

  • setAdaBoost() An ada boost model

  • setRandomForest() A random forest model

  • setDecisionTree() A decision tree model

  • setCovNN()) A convolutional neural network model

  • setCIReNN() A recurrent neural network model

  • setMLP() A neural network model

  • setDeepNN() A deep neural network model

  • setKNN() A KNN model

testSplit

Either 'person' or 'time' specifying the type of evaluation used. 'time' find the date where testFraction of patients had an index after the date and assigns patients with an index prior to this date into the training set and post the date into the test set 'person' splits the data into test (1-testFraction of the data) and train (validationFraction of the data) sets. The split is stratified by the class label.

testFraction

The fraction of the data to be used as the test set in the patient split evaluation.

trainFraction

A real number between 0 and 1 indicating the train set fraction of the data. If not set trainFraction is equal to 1 - test

splitSeed

The seed used to split the test/train set when using a person type testSplit

nfold

The number of folds used in the cross validation (default 3)

indexes

A dataframe containing a rowId and index column where the index value of -1 means in the test set, and positive integer represents the cross validation fold (default is NULL)

saveDirectory

The path to the directory where the results will be saved (if NULL uses working directory)

savePlpData

Binary indicating whether to save the plpData object (default is T)

savePlpResult

Binary indicating whether to save the object returned by runPlp (default is T)

savePlpPlots

Binary indicating whether to save the performance plots as pdf files (default is T)

saveEvaluation

Binary indicating whether to save the oerformance as csv files (default is T)

verbosity

Sets the level of the verbosity. If the log level is at or higher in priority than the logger threshold, a message will print. The levels are:

  • DEBUGHighest verbosity showing all debug statements

  • TRACEShowing information about start and end of steps

  • INFOShow informative information (Default)

  • WARNShow warning messages

  • ERRORShow error messages

  • FATALBe silent except for fatal errors

timeStamp

If TRUE a timestamp will be added to each logging statement. Automatically switched on for TRACE level.

analysisId

Identifier for the analysis. It is used to create, e.g., the result folder. Default is a timestamp.

save

Old input - please now use saveDirectory

Value

An object containing the model or location where the model is save, the data selection settings, the preprocessing and training settings as well as various performance measures obtained by the model.

predict

A function that can be applied to new data to apply the trained model and make predictions

model

A list of class plpModel containing the model, training metrics and model metadata

prediction

A dataframe containing the prediction for each person in the test set

evalType

The type of evaluation that was performed ('person' or 'time')

performanceTest

A list detailing the size of the test sets

performanceTrain

A list detailing the size of the train sets

time

The complete time taken to do the model framework

Details

Users can define a risk period of interest for the prediction of the outcome relative to index or use the cohprt dates. The user can then specify whether they wish to exclude patients who are not observed during the whole risk period, cohort period or experienced the outcome prior to the risk period.

Examples

# NOT RUN {
#******** EXAMPLE 1 ********* 
#load plpData:
plpData <- loadPlpData(file.path('C:','User','home','data'))

#create study population to develop model on
#require minimum of 365 days observation prior to at risk start
#no prior outcome and person must be observed for 365 after index (minTimeAtRisk)
#with risk window from 0 to 365 days after index
population <- createStudyPopulation(plpData,outcomeId=2042,
                                    firstExposureOnly = FALSE,
                                    washoutPeriod = 365,
                                    removeSubjectsWithPriorOutcome = TRUE,
                                    priorOutcomeLookback = 99999,
                                    requireTimeAtRisk = TRUE,
                                    minTimeAtRisk=365,
                                    riskWindowStart = 0,
                                    addExposureDaysToStart = FALSE,
                                    riskWindowEnd = 365,
                                    addExposureDaysToEnd = FALSE)

#lasso logistic regression predicting outcome 200 in cohorts 10 
#using no feature selection with a time split evaluation with 30% in test set
#70% in train set where the model hyper-parameters are selected using 3-fold cross validation:
#and results are saved to file.path('C:','User','home')
model.lr <- lassoLogisticRegression.set()
mod.lr <- runPlp(population=population,
                        plpData= plpData, minCovariateFraction = 0.001,
                        modelSettings = model.lr ,
                        testSplit = 'time', testFraction=0.3,
                        nfold=3, indexes=NULL,
                        saveDirectory =file.path('C:','User','myPredictionName'),
                        verbosity='INFO')

#******** EXAMPLE 2 *********                                               
# Gradient boosting machine with a grid search to select hyper parameters  
# using the test/train/folds created for the lasso logistic regression above                       
model.gbm <- gradientBoostingMachine.set(rsampRate=c(0.5,0.9,1),csampRate=1,
                           ntrees=c(10,100), bal=c(F,T),
                           max_depth=c(4,5), learn_rate=c(0.1,0.01))
mod.gbm <- runPlp(population=population,
                        plpData= plpData,
                        modelSettings = model.gbm,
                        testSplit = 'time', testFraction=0.3,
                        nfold=3, indexes=mod.lr$indexes,
                        saveDirectory =file.path('C:','User','myPredictionName2'))
# }