Skip to contents

Split the plpData into test/train sets using a splitting settings of class splitSettings

Usage

splitData(
  plpData = plpData,
  population = population,
  splitSettings = createDefaultSplitSetting(splitSeed = 42)
)

Arguments

plpData

An object of type plpData - the patient level prediction data extracted from the CDM.

population

The population created using createStudyPopulation that define who will be used to develop the model

splitSettings

An object of type splitSettings specifying the split - the default can be created using createDefaultSplitSetting

Value

Returns a list containing the training data (Train) and optionally the test data (Test). Train is an Andromeda object containing

  • covariateRef: a table with the covariate information

  • labels: a table (rowId, outcomeCount, ...) for each data point in the train data (outcomeCount is the class label)

  • folds: a table (rowId, index) specifying which training fold each data point is in.

Test is an Andromeda object containing

  • covariateRef: a table with the covariate information

  • labels: a table (rowId, outcomeCount, ...) for each data point in the test data (outcomeCount is the class label)

Examples

data("simulationProfile")
plpData <- simulatePlpData(simulationProfile, n = 1000)
#> Generating covariates
#> Generating cohorts
#> Generating outcomes
population <- createStudyPopulation(plpData)
#> Outcome is 0 or 1
#> Population created with: 969 observations, 969 unique subjects and 428 outcomes
#> Population created in 0.0459 secs
splitSettings <- createDefaultSplitSetting(testFraction = 0.50, 
                                           trainFraction = 0.50, nfold = 5)
data = splitData(plpData, population, splitSettings)
#> seed: 7588
#> Creating a 50% test and 50% train (into 5 folds) random stratified split by class
#> Data split into 484 test cases and 485 train cases (98, 97, 97, 97, 96)
#> Data split in 0.253 secs
# test data should be ~500 rows (changes because of study population)
nrow(data$Test$labels)
#> [1] 484
# train data should be ~500 rows
nrow(data$Train$labels)
#> [1] 485
# should be five fold in the train data
length(unique(data$Train$folds$index))
#> [1] 5