vignettes/BuildingDeepModels.Rmd
BuildingDeepModels.Rmd
Patient level prediction aims to use historic data to learn a function between an input (a patient’s features such as age/gender/comorbidities at index) and an output (whether the patient experienced an outcome during some time-at-risk). Deep learning is example of the the current state-of-the-art classifiers that can be implemented to learn the function between inputs and outputs.
Deep Learning models are widely used to automatically learn high-level feature representations from the data, and have achieved remarkable results in image processing, speech recognition and computational biology. Recently, interesting results have been shown using large observational healthcare data (e.g., electronic healthcare data or claims data), but more extensive research is needed to assess the power of Deep Learning in this domain.
This vignette describes how you can use the Observational Health Data
Sciences and Informatics (OHDSI) PatientLevelPrediction
package and DeepPatientLevelPrediction
package to build Deep Learning models. This vignette assumes you have
read and are comfortable with building patient level prediction models
as described in the BuildingPredictiveModels
vignette. Furthermore, this vignette assumes you are familiar with
Deep Learning methods.
Deep Learning models are build by stacking an often large number of neural network layers that perform feature engineering steps, e.g embedding, and are collapsed in a final linear layer (equivalent to logistic regression). These algorithms need a lot of data to converge to a good representation, but currently the sizes of the large observational healthcare databases are growing fast which would make Deep Learning an interesting approach to test within OHDSI’s Patient-Level Prediction Framework. The current implementation allows us to perform research at scale on the value and limitations of Deep Learning using observational healthcare data.
In the package we use pytorch
through the
reticulate
package.
Many network architectures have recently been proposed and we have implemented a number of them, however, this list will grow in the near future. It is important to understand that some of these architectures require a 2D data matrix, i.e. |patient|x|feature|, and others use a 3D data matrix |patient|x|feature|x|time|. The FeatureExtraction Package has been extended to enable the extraction of both data formats as will be described with examples below.
Note that training Deep Learning models is computationally intensive, our implementation therefore supports both GPU and CPU. A GPU is highly recommended and neccesary for most models for Deep Learning!
Full details about the package requirements and instructions on installing the package can be found here.
The DeepPatientLevelPrediction
package provides
additional model settings that can be used within the
PatientLevelPrediction
package runPlp()
and
runMultiplePlp()
functions. To use both packages you first
need to pick the deep learning architecture you wish to fit (see below)
and then you specify this as the modelSettings inside
runPlp()
.
# load the data
plpData <- PatientLevelPrediction::loadPlpData('locationOfData')
# pick the set<Model> from DeepPatientLevelPrediction
deepLearningModel <- DeepPatientLevelPrediction::setDefaultResNet()
# use PatientLevelPrediction to fit model
deepLearningResult <- PatientLevelPrediction::runPlp(
plpData = plpData,
outcomeId = 1230,
modelSettings = deepLearningModel,
analysisId = 'resNetTorch',
...
)
We implemented the following non-temporal (2D data matrix) architectures:
A multilayer perceptron (MLP) model is a directed graph consisting of an input layer, one or more hidden layers and an output layer. The model takes in the input feature values and feeds these forward through the graph to determine the output class. A process known as ‘backpropagation’ is used to train the model. Backpropagation requires some ground truth and involves automatically calculating the derivative of the model parameters with respect to the the error between the model’s predictions and ground truth. Then the model learns how to adjust the model’s parameters to reduce the error.
To use the package to fit a MLP model you can use the
setMultiLayerPerceptron()
function to specify the
hyper-parameter settings for the MLP.
The numLayers
and sizeHidden
inputs define
the network topology via the number of layers and neurons in the
network’s hidden layers.
The dropout
input specifies the probability that a layer
randomly sets some inputs to 0 at each step during training time. A
value of 0.2
means that 20% of the layers inputs will be
set to 0. This is used to reduce overfitting.
The sizeEmbedding
input specifies the size of the
embedding used. The first layer is an embedding layer which converts
each sparse feature to a dense learned vector. An embedding is a lower
dimensional projection of the features where distance between points is
a measure of similarity.
The weightDecay
input corresponds to the weight decay in
the objective function. During model fitting the aim is to minimize the
objective function. The objective function is made up of the prediction
error (the difference between the prediction vs the truth) plus the
square of the weights multiplied by the weight decay. The larger the
weight decay, the more you penalize having large weights. If you set the
weight decay too large, the model will never fit well enough, if you set
it too low, you need to be careful of overfitting (so try to stop model
fitting earlier).
The learningRate
input is the learning rate which is a
hyperparameter that controls how much to change the model in response to
the estimated error each time the model weights are updated. The smaller
the learningRate
the longer it will take to fit the model
and the model weights may get stuck, but if the
learningRate
is too large, the weights may sub-optimally
converge too fast.
The seed
lets the user use the same random
initialization of the network’s weights as a previous run.
The hyperParamSearch
chooses the strategy to find the
best hyperparameters. Currently a random search and grid search are
supported. Grid search searches every possible combination of
hyperparameters while random search samples randomly from the
combinations. Since neural networks can be very flexible and have many
hyperparameter combinations it’s almost never feasible to do a full grid
search unless the network is really small.
The randomSample
chooses how many random samples to
use.
The device
specifies what device to use. Either
cpu
or cuda
. Or if you have many GPU’s
cuda:x
where x is the gpu number as seen in
nvidia-smi
.
The batchSize
corresponds to the number of data points
(patients) used per iteration to estimate the network error during model
fitting.
The epochs
corresponds to how many time to run through
the entire training data while fitting the model.
For example, the following code will try 10 different network
configurations sampled from the possible combinations given and pick the
one that obtains the greatest AUROC via cross validation in the training
data and then fit the model with that configuration using all the
training data. The standard output of runPlp()
will be
returned - this contains the MLP model along with the performance
details and settings. Note that all possible combinations are
222*2 or 16 but specify randomSample=10
to only
try 10 of those.
modelSettings <- setMultiLayerPerceptron(
numLayers = c(3L, 5L),
sizeHidden = c(64L, 128L),
dropout = c(0.2),
sizeEmbedding = c(32L, 64L),
estimatorSettings = setEstimator(
learningRate = c(1e-3, 1e-4),
weightDecay = c(1e-5),
batchSize = c(128L),
epochs=c(5L),
seed=12L
),
randomSample=10L
)
mlpResult <- PatientLevelPrediction::runPlp(
plpData = plpData,
outcomeId = 3,
modelSettings = modelSettings,
analysisId = 'MLP',
analysisName = 'Testing Deep Learning',
populationSettings = populationSet,
splitSettings = PatientLevelPrediction::createDefaultSplitSetting(),
preprocessSettings = PatientLevelPrediction::createPreprocessSettings(),
executeSettings = PatientLevelPrediction::createExecuteSettings(
runSplitData = T,
runSampleData = F,
runfeatureEngineering = F,
runPreprocessData = T,
runModelDevelopment = T,
runCovariateSummary = F
),
saveDirectory = file.path(testLoc, 'DeepNNTorch')
)
Deep learning models are often trained via a process known as gradient descent. During this process the network weights are updated based on the gradient of the error function for the current weights. However, as the number of layers in the network increase, there is a greater chance of experiencing an issue known vanishing or exploding gradients. The vanishing or exploding gradient is when the gradient goes to 0 or infinity, which negatively impacts the model fitting.
The residual network (ResNet) was introduced to address the vanishing or exploding gradient issue. It works by adding connections between non-adjacent layers, termed a ‘skip connection’.
The ResNet calculates embeddings for every feature and then averages them to compute an embedding per patient.
Our implementation of a ResNet for tabular data is based on this paper.
To use the package to fit a ResNet model you can use the
setResNet()
function to specify the hyperparameter settings
for the network.
numLayers
: How many layers to use in the model.
sizeHidden
: How many neurons in each hidden layer
hiddenFactor
: How much to increase number of neurons in
each layer (see paper)
residualDropout
andhiddenDropout
: How much
dropout to apply in hidden layer or residual connection
sizeEmbedding
: The size of the initial embedding
layer
weightDecay
: How much weight decay to apply, which
penalizes bigger weights
learningRate
: Which learning rate to use
seed
: seed for weight initialization
device
: Which device to use, such as a cpu
or a gpu
batchSize
: Size of batch of data used per iteration
during training
epochs
: How many runs through the data
hyperParamSearch
: Which type of hyperparameter search
to use, either random sampling or exhaustive (grid) search
randomSample
: If doing a random search for
hyperparameters, how many random samples to use
randomSampleSeed
: Seed to make hyperparameter search
reproducible
For example, the following code will fit a two layer ResNet where each layer has 32 neurons which increases by a factor of two before decreasing again (hiddenFactor). 10% of inputs to each layer and residual connection within the layer are randomly zeroed during training but not testing.The embedding layer has 32 neurons. Learning rate of 3e-4 with weight decay of 1e-6 is used for the optimizer. No hyperparameter search is done since each input only includes one option.
resset <- setResNet(
numLayers = c(2L),
sizeHidden = c(32L),
hiddenFactor = c(2L),
residualDropout = c(0.1),
hiddenDropout = c(0.1),
sizeEmbedding = c(32L),
estimatorSettings = setEstimator(learningRate = c(3e-4),
weightDecay = c(1e-6),
#device='cuda:0', # uncomment to use GPU
batchSize = 128L,
epochs = 3L,
seed = 42L),
hyperParamSearch = 'random',
randomSample = 1
)
resResult <- PatientLevelPrediction::runPlp(
plpData = plpData,
outcomeId = 3,
modelSettings = resset,
analysisId = 'ResNet',
analysisName = 'Testing ResNet',
populationSettings = populationSet,
splitSettings = PatientLevelPrediction::createDefaultSplitSetting(),
preprocessSettings = PatientLevelPrediction::createPreprocessSettings(),
executeSettings = PatientLevelPrediction::createExecuteSettings(
runSplitData = T,
runSampleData = F,
runfeatureEngineering = F,
runPreprocessData = T,
runModelDevelopment = T,
runCovariateSummary = F
),
saveDirectory = file.path(getwd(), 'ResNet') # change to save elsewhere
)
Recently there has been a surge of models in natural language processing and computer vision that utilize attention. This is a technique where the model learns where to look and what to focus on in the input data. This was first described in the attention is all you need paper. Here we have used an implementation that has shown good performance on non-temporal tabular data from this paper.
This architecture is computationally expensive and scales badly with
longer sequence length. In this case the sequence is the amount of
features each patient has. Users need to be aware of how many features
they are feeding to the model since this will effect the computation
time heavily. This is something you control in
FeatureExtraction
when you create your
covariateSettings
.
To use the package to fit a Transformer model you can use the
setTransformer()
function to specify the hyperparameter
settings for the network.
The training and hyperparameter tuning inputs are the same as for the ResNet.
numBlocks
: How many Transformer blocks to use, each
block includes a self-attention layer and a feedforward block with two
linear layers.
dimToken
: Dimension of the embedding for each
feature.
dimOut
: Dimension of output, for binary problems this
is 1.
numHeads
: Number of attention heads for the
self-attention, dimToken
needs to be divisible by
numHeads
.
attDropout
, ffnDropout
and
resDropout
: How much dropout to apply on attentions,
feedforward block or in residual connections
dimHidden
: How many neurons in linear layers inside the
feedforward block
modelSettings <- setTransformer(numBlocks = 3L,
dimToken = 32L,
dimOut = 1,
numHeads = 4L,
attDropout = 0.25,
ffnDropout = 0.25,
resDropout = 0,
dimHidden = 128L,
estimatorSettings = setEstimator(
learningRate = 3e-4,
weightDecay = 1e-6,
batchSize = 128L,
epochs = 10L,
device = 'cpu'
),
randomSample=1L)
TransformerResult <- PatientLevelPrediction::runPlp(
plpData = plpData,
outcomeId = 3,
modelSettings = modelSettings,
analysisId = 'Transformer',
analysisName = 'Testing transformer',
populationSettings = populationSet,
splitSettings = PatientLevelPrediction::createDefaultSplitSetting(),
preprocessSettings = PatientLevelPrediction::createPreprocessSettings(),
executeSettings = PatientLevelPrediction::createExecuteSettings(
runSplitData = T,
runSampleData = F,
runfeatureEngineering = F,
runPreprocessData = T,
runModelDevelopment = T,
runCovariateSummary = F
),
saveDirectory = file.path(getwd(), 'Transformer') # change to save elsewhere
)
Considerable work has been dedicated to provide the
DeepPatientLevelPrediction
package.
citation("DeepPatientLevelPrediction")
## To cite package 'DeepPatientLevelPrediction' in publications use:
##
## Fridgeirsson E, Reps J, Chan You S, Kim C, John H (8).
## _DeepPatientLevelPrediction: Deep Learning For Patient Level
## Prediction Using Data In The OMOP Common Data Model_. R package
## version 2.1.0, <https://github.com/OHDSI/DeepPatientLevelPrediction>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {DeepPatientLevelPrediction: Deep Learning For Patient Level Prediction Using Data In The
## OMOP Common Data Model},
## author = {Egill Fridgeirsson and Jenna Reps and Seng {Chan You} and Chungsoo Kim and Henrik John},
## year = {8},
## note = {R package version 2.1.0},
## url = {https://github.com/OHDSI/DeepPatientLevelPrediction},
## }
Please reference this paper if you use the PLP Package in your work: