Split the plpData into test/train sets using a splitting settings of class splitSettings
Source: R/DataSplitting.R
splitData.Rd
Split the plpData into test/train sets using a splitting settings of class
splitSettings
Usage
splitData(
plpData = plpData,
population = population,
splitSettings = createDefaultSplitSetting(splitSeed = 42)
)
Arguments
- plpData
An object of type
plpData
- the patient level prediction data extracted from the CDM.- population
The population created using
createStudyPopulation
that define who will be used to develop the model- splitSettings
An object of type
splitSettings
specifying the split - the default can be created usingcreateDefaultSplitSetting
Value
Returns a list containing the training data (Train) and optionally the test data (Test). Train is an Andromeda object containing
covariateRef: a table with the covariate information
labels: a table (rowId, outcomeCount, ...) for each data point in the train data (outcomeCount is the class label)
folds: a table (rowId, index) specifying which training fold each data point is in.
Test is an Andromeda object containing
covariateRef: a table with the covariate information
labels: a table (rowId, outcomeCount, ...) for each data point in the test data (outcomeCount is the class label)
Examples
data("simulationProfile")
plpData <- simulatePlpData(simulationProfile, n = 1000)
#> Generating covariates
#> Generating cohorts
#> Generating outcomes
population <- createStudyPopulation(plpData)
#> Outcome is 0 or 1
#> Population created with: 969 observations, 969 unique subjects and 428 outcomes
#> Population created in 0.0459 secs
splitSettings <- createDefaultSplitSetting(testFraction = 0.50,
trainFraction = 0.50, nfold = 5)
data = splitData(plpData, population, splitSettings)
#> seed: 7588
#> Creating a 50% test and 50% train (into 5 folds) random stratified split by class
#> Data split into 484 test cases and 485 train cases (98, 97, 97, 97, 96)
#> Data split in 0.253 secs
# test data should be ~500 rows (changes because of study population)
nrow(data$Test$labels)
#> [1] 484
# train data should be ~500 rows
nrow(data$Train$labels)
#> [1] 485
# should be five fold in the train data
length(unique(data$Train$folds$index))
#> [1] 5