Skip to contents

A function that wraps around FeatureExtraction::tidyCovariateData to normalise the data and remove rare or redundant features

Usage

preprocessData(covariateData, preprocessSettings = createPreprocessSettings())

Arguments

covariateData

The covariate part of the training data created by splitData after being sampled and having any required feature engineering

preprocessSettings

The settings for the preprocessing created by createPreprocessSettings The data processed

Value

The covariateData object with the processed covariates

Details

Returns an object of class covariateData that has been processed. This includes normalising the data and removing rare or redundant features. Redundant features are features that within an analysisId together cover all obervations.

Examples

library(dplyr)
data("simulationProfile")
plpData <- simulatePlpData(simulationProfile, n=1000)
#> Generating covariates
#> Generating cohorts
#> Generating outcomes
preProcessedData <- preprocessData(plpData$covariateData, createPreprocessSettings())
#> Removing 0 redundant covariates
#> Removing 0 infrequent covariates
#> Normalizing covariates
#> Tidying covariates took 0.622 secs
# check age is normalized by max value
preProcessedData$covariates %>% dplyr::filter(.data$covariateId == 1002)
#> # Source:   SQL [?? x 3]
#> # Database: sqlite 3.47.1 [/tmp/RtmpPJeNgk/file20b1759ba368.sqlite]
#>    rowId covariateId covariateValue
#>    <int>       <dbl>          <dbl>
#>  1     1        1002          0.851
#>  2     2        1002          0.830
#>  3     3        1002          0.702
#>  4     4        1002          0.766
#>  5     5        1002          0.723
#>  6     6        1002          1    
#>  7     7        1002          0.723
#>  8     8        1002          0.872
#>  9     9        1002          0.915
#> 10    10        1002          0.809
#> # ℹ more rows