Building Deep Learning Models

Introduction

DeepPatientLevelPrediction

Patient level prediction aims to use historic data to learn a function between an input (a patient’s features such as age/gender/comorbidities at index) and an output (whether the patient experienced an outcome during some time-at-risk). Deep learning is example of the the current state-of-the-art classifiers that can be implemented to learn the function between inputs and outputs.

Deep Learning models are widely used to automatically learn high-level feature representations from the data, and have achieved remarkable results in image processing, speech recognition and computational biology. Recently, interesting results have been shown using large observational healthcare data (e.g., electronic healthcare data or claims data), but more extensive research is needed to assess the power of Deep Learning in this domain.

This vignette describes how you can use the Observational Health Data Sciences and Informatics (OHDSI) PatientLevelPrediction package and DeepPatientLevelPrediction package to build Deep Learning models. This vignette assumes you have read and are comfortable with building patient level prediction models as described in the BuildingPredictiveModels vignette. Furthermore, this vignette assumes you are familiar with Deep Learning methods.

Background

Deep Learning models are build by stacking an often large number of neural network layers that perform feature engineering steps, e.g embedding, and are collapsed in a final linear layer (equivalent to logistic regression). These algorithms need a lot of data to converge to a good representation, but currently the sizes of the large observational healthcare databases are growing fast which would make Deep Learning an interesting approach to test within OHDSI’s Patient-Level Prediction Framework. The current implementation allows us to perform research at scale on the value and limitations of Deep Learning using observational healthcare data.

In the package we use pytorch through the reticulate package.

Note that training Deep Learning models is computationally intensive, our implementation therefore supports both GPU and CPU. A GPU is highly recommended and neccesary for most models for Deep Learning!

Requirements

Full details about the package requirements and instructions on installing the package can be found here.

Integration with PatientLevelPrediction

The DeepPatientLevelPrediction package provides additional model settings that can be used within the PatientLevelPrediction package runPlp() and runMultiplePlp() functions. To use both packages you first need to pick the deep learning architecture you wish to fit (see below) and then you specify this as the modelSettings inside runPlp().

# load the data
plpData <- PatientLevelPrediction::loadPlpData('locationOfData')

# pick the set<Model> from  DeepPatientLevelPrediction
deepLearningModel <- DeepPatientLevelPrediction::setDefaultResNet()

# use PatientLevelPrediction to fit model
deepLearningResult <- PatientLevelPrediction::runPlp(
    plpData = plpData, 
    outcomeId = 1230, 
    modelSettings = deepLearningModel,
    analysisId = 'resNetTorch', 
    ...
  )

Non-Temporal Architectures

We implemented the following non-temporal (2D data matrix) architectures:

Simple MultiLayerPerceptron

Overall concept

A multilayer perceptron (MLP) model is a directed graph consisting of an input layer, one or more hidden layers and an output layer. The model takes in the input feature values and feeds these forward through the graph to determine the output class. A process known as ‘backpropagation’ is used to train the model. Backpropagation requires some ground truth and involves automatically calculating the derivative of the model parameters with respect to the the error between the model’s predictions and ground truth. Then the model learns how to adjust the model’s parameters to reduce the error.

Example

Set Function

To use the package to fit a MLP model you can use the setMultiLayerPerceptron() function to specify the hyper-parameter settings for the MLP.

Inputs

The numLayers and sizeHidden inputs define the network topology via the number of layers and neurons in the network’s hidden layers.

The dropout input specifies the probability that a layer randomly sets some inputs to 0 at each step during training time. A value of 0.2 means that 20% of the layers inputs will be set to 0. This is used to reduce overfitting.

The sizeEmbedding input specifies the size of the embedding used. The first layer is an embedding layer which converts each sparse feature to a dense learned vector. An embedding is a lower dimensional projection of the features where distance between points is a measure of similarity.

The weightDecay input corresponds to the weight decay in the objective function. During model fitting the aim is to minimize the objective function. The objective function is made up of the prediction error (the difference between the prediction vs the truth) plus the square of the weights multiplied by the weight decay. The larger the weight decay, the more you penalize having large weights. If you set the weight decay too large, the model will never fit well enough, if you set it too low, you need to be careful of overfitting (so try to stop model fitting earlier).

The learningRate input is the learning rate which is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. The smaller the learningRate the longer it will take to fit the model and the model weights may get stuck, but if the learningRate is too large, the weights may sub-optimally converge too fast.

The seed lets the user use the same random initialization of the network’s weights as a previous run.

The hyperParamSearch chooses the strategy to find the best hyperparameters. Currently a random search and grid search are supported. Grid search searches every possible combination of hyperparameters while random search samples randomly from the combinations. Since neural networks can be very flexible and have many hyperparameter combinations it’s almost never feasible to do a full grid search unless the network is really small.

The randomSample chooses how many random samples to use.

The device specifies what device to use. Either cpu or cuda. Or if you have many GPU’s cuda:x where x is the gpu number as seen in nvidia-smi.

The batchSize corresponds to the number of data points (patients) used per iteration to estimate the network error during model fitting.

The epochs corresponds to how many time to run through the entire training data while fitting the model.

Example Code

For example, the following code will try 10 different network configurations sampled from the possible combinations given and pick the one that obtains the greatest AUROC via cross validation in the training data and then fit the model with that configuration using all the training data. The standard output of runPlp() will be returned - this contains the MLP model along with the performance details and settings. Note that all possible combinations are 222*2 or 16 but specify randomSample=10 to only try 10 of those.

modelSettings <- setMultiLayerPerceptron(
  numLayers = c(3L, 5L),
  sizeHidden = c(64L, 128L), 
  dropout = c(0.2),
  sizeEmbedding = c(32L, 64L),
  estimatorSettings = setEstimator(
    learningRate = c(1e-3, 1e-4),
    weightDecay = c(1e-5),
    batchSize = c(128L),
    epochs=c(5L),
    seed=12L
  ),
  randomSample=10L
)

mlpResult <- PatientLevelPrediction::runPlp(
    plpData = plpData, 
    outcomeId = 3, 
    modelSettings = modelSettings,
    analysisId = 'MLP', 
    analysisName = 'Testing Deep Learning', 
    populationSettings = populationSet, 
    splitSettings = PatientLevelPrediction::createDefaultSplitSetting(), 
    preprocessSettings = PatientLevelPrediction::createPreprocessSettings(), 
    executeSettings = PatientLevelPrediction::createExecuteSettings(
      runSplitData = T, 
      runSampleData = F, 
      runfeatureEngineering = F, 
      runPreprocessData = T, 
      runModelDevelopment = T, 
      runCovariateSummary = F
    ), 
    saveDirectory = file.path(testLoc, 'DeepNNTorch')
  )

ResNet

Overall concept

Deep learning models are often trained via a process known as gradient descent. During this process the network weights are updated based on the gradient of the error function for the current weights. However, as the number of layers in the network increase, there is a greater chance of experiencing an issue known vanishing or exploding gradients. The vanishing or exploding gradient is when the gradient goes to 0 or infinity, which negatively impacts the model fitting.

The residual network (ResNet) was introduced to address the vanishing or exploding gradient issue. It works by adding connections between non-adjacent layers, termed a ‘skip connection’.

The ResNet calculates embeddings for every feature and then averages them to compute an embedding per patient.

Our implementation of a ResNet for tabular data is based on this paper.

Example

Set Function

To use the package to fit a ResNet model you can use the setResNet() function to specify the hyperparameter settings for the network.

Inputs

Model inputs:

numLayers: How many layers to use in the model.

sizeHidden: How many neurons in each hidden layer

hiddenFactor: How much to increase number of neurons in each layer (see paper)

residualDropout andhiddenDropout : How much dropout to apply in hidden layer or residual connection

sizeEmbedding : The size of the initial embedding layer

Estimator inputs:

weightDecay : How much weight decay to apply, which penalizes bigger weights

learningRate : Which learning rate to use

seed : seed for weight initialization

device : Which device to use, such as a cpu or a gpu

batchSize : Size of batch of data used per iteration during training

epochs : How many runs through the data

Hyperparameter tuning inputs:

hyperParamSearch : Which type of hyperparameter search to use, either random sampling or exhaustive (grid) search

randomSample: If doing a random search for hyperparameters, how many random samples to use

randomSampleSeed: Seed to make hyperparameter search reproducible

Example Code

For example, the following code will fit a two layer ResNet where each layer has 32 neurons which increases by a factor of two before decreasing again (hiddenFactor). 10% of inputs to each layer and residual connection within the layer are randomly zeroed during training but not testing.The embedding layer has 32 neurons. Learning rate of 3e-4 with weight decay of 1e-6 is used for the optimizer. No hyperparameter search is done since each input only includes one option.

resset <- setResNet(
  numLayers = c(2L), 
  sizeHidden = c(32L),
  hiddenFactor = c(2L),
  residualDropout = c(0.1), 
  hiddenDropout = c(0.1),
  sizeEmbedding = c(32L),
  estimatorSettings = setEstimator(learningRate = c(3e-4),
                                   weightDecay = c(1e-6),
                                   #device='cuda:0', # uncomment to use GPU
                                   batchSize = 128L, 
                                   epochs = 3L,
                                   seed = 42L),
  hyperParamSearch = 'random',
  randomSample = 1
)

resResult <- PatientLevelPrediction::runPlp(
    plpData = plpData, 
    outcomeId = 3, 
    modelSettings = resset,
    analysisId = 'ResNet', 
    analysisName = 'Testing ResNet', 
    populationSettings = populationSet, 
    splitSettings = PatientLevelPrediction::createDefaultSplitSetting(), 
    preprocessSettings = PatientLevelPrediction::createPreprocessSettings(), 
    executeSettings = PatientLevelPrediction::createExecuteSettings(
      runSplitData = T, 
      runSampleData = F, 
      runfeatureEngineering = F, 
      runPreprocessData = T, 
      runModelDevelopment = T, 
      runCovariateSummary = F
    ), 
    saveDirectory = file.path(getwd(), 'ResNet') # change to save elsewhere
  )

Transformer

Overall concept

Recently there has been a surge of models in natural language processing and computer vision that utilize attention. This is a technique where the model learns where to look and what to focus on in the input data. This was first described in the attention is all you need paper. Here we have used an implementation that has shown good performance on non-temporal tabular data from this paper.

This architecture is computationally expensive and scales badly with longer sequence length. In this case the sequence is the amount of features each patient has. Users need to be aware of how many features they are feeding to the model since this will effect the computation time heavily. This is something you control in FeatureExtraction when you create your covariateSettings.

Examples

Set Function

To use the package to fit a Transformer model you can use the setTransformer() function to specify the hyperparameter settings for the network.

Inputs

The training and hyperparameter tuning inputs are the same as for the ResNet.

Model inputs:

numBlocks : How many Transformer blocks to use, each block includes a self-attention layer and a feedforward block with two linear layers.

dimToken : Dimension of the embedding for each feature.

dimOut : Dimension of output, for binary problems this is 1.

numHeads : Number of attention heads for the self-attention, dimToken needs to be divisible by numHeads.

attDropout, ffnDropout and resDropout : How much dropout to apply on attentions, feedforward block or in residual connections

dimHidden : How many neurons in linear layers inside the feedforward block

Example Code

modelSettings <- setTransformer(numBlocks = 3L,
                                dimToken = 32L,
                                dimOut = 1, 
                                numHeads = 4L,
                                attDropout = 0.25,
                                ffnDropout = 0.25,
                                resDropout = 0,
                                dimHidden = 128L,
                                estimatorSettings = setEstimator(
                                  learningRate = 3e-4,
                                  weightDecay = 1e-6,
                                  batchSize = 128L,
                                  epochs = 10L,
                                  device = 'cpu'
                                ),
                                randomSample=1L)
                              


TransformerResult <- PatientLevelPrediction::runPlp(
    plpData = plpData, 
    outcomeId = 3, 
    modelSettings = modelSettings,
    analysisId = 'Transformer', 
    analysisName = 'Testing transformer', 
    populationSettings = populationSet, 
    splitSettings = PatientLevelPrediction::createDefaultSplitSetting(), 
    preprocessSettings = PatientLevelPrediction::createPreprocessSettings(), 
    executeSettings = PatientLevelPrediction::createExecuteSettings(
      runSplitData = T, 
      runSampleData = F, 
      runfeatureEngineering = F, 
      runPreprocessData = T, 
      runModelDevelopment = T, 
      runCovariateSummary = F
    ), 
    saveDirectory = file.path(getwd(), 'Transformer') # change to save elsewhere
  )

Acknowledgments

Considerable work has been dedicated to provide the DeepPatientLevelPrediction package.

citation("DeepPatientLevelPrediction")

## To cite package 'DeepPatientLevelPrediction' in publications use:
## 
##   Fridgeirsson E, Reps J, Chan You S, Kim C, John H (8).
##   _DeepPatientLevelPrediction: Deep Learning For Patient Level
##   Prediction Using Data In The OMOP Common Data Model_. R package
##   version 2.1.0, <https://github.com/OHDSI/DeepPatientLevelPrediction>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {DeepPatientLevelPrediction: Deep Learning For Patient Level Prediction Using Data In The
## OMOP Common Data Model},
##     author = {Egill Fridgeirsson and Jenna Reps and Seng {Chan You} and Chungsoo Kim and Henrik John},
##     year = {8},
##     note = {R package version 2.1.0},
##     url = {https://github.com/OHDSI/DeepPatientLevelPrediction},
##   }

Please reference this paper if you use the PLP Package in your work:

Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. J Am Med Inform Assoc. 2018;25(8):969-975.

Jenna Reps, Egill Fridgeirsson, Chungsoo Kim, Henrik John, Seng Chan You, Xiaoyong Pan

2024-08-02

Introduction

DeepPatientLevelPrediction

Background

Requirements

Integration with PatientLevelPrediction

Non-Temporal Architectures

Simple MultiLayerPerceptron

Overall concept

Example

Set Function

Inputs

Example Code

ResNet

Overall concept

Example

Set Function

Inputs

Model inputs:

Estimator inputs:

Hyperparameter tuning inputs:

Example Code

Transformer

Overall concept

Examples

Set Function

Inputs

Model inputs:

Example Code

Acknowledgments