Creating Cohort Subset Definitions

Introduction

This guide aims to describe the process of cohort subsetting using CohortGenerator. The purpose of Cohort subsetting operations is to allow the creation of common operations that can be applied to generated cohorts in order to subset to different operations in a consistent manner.

Subset definitions

Subset definitions are named sets of operations that can be applied to a set of one or more cohorts. The current operations that you can apply to cohorts are:

Limit subsets
Demographic subsets
Cohort subsetting

Operations can be sequentially chained within subset definitions and all outputs are considered full cohorts that can be passed into to other packages as if they are cohorts designed in packages

Demographic subset operations

This subsetting process allows you to capture the age, race/ethnicity gender within a cohort as subgroups. For example, “subset cohorts to subjects that are male between the ages 1 and 5 years old”.

Subsetting to other cohorts

This type of operation allows you to subset a cohort to only those subjects included in one or more other cohorts

Defining a subset definition

First get a Cohort definition set:

cohortDefinitionSet <- getCohortDefinitionSet(
  settingsFileName = "testdata/name/Cohorts.csv",
  jsonFolder = "testdata/name/cohorts",
  sqlFolder = "testdata/name/sql/sql_server",
  cohortFileNameFormat = "%s",
  cohortFileNameValue = c("cohortName"),
  packageName = "CohortGenerator",
  verbose = FALSE
)

cohortDefinitionSet$cohortId <- cohortDefinitionSet$cohortId + 1778210 # Match the cohort Ids taken from Atlas
cohortIds <- cohortDefinitionSet$cohortId
cohortDefinitionSet$atlasId <- cohortDefinitionSet$cohortId
cohortDefinitionSet$logicDescription <- ""

A definition can include different subset operations - these are applied strictly in order:

# Example, we want to have a HTN cohort that starts any time prior to the index start
# and the HTN cohort ends any time after the index start
subsetDef <- createCohortSubsetDefinition(
  name = "Patients in cohort cohort 1778213 with 365 days prior observation",
  definitionId = 1,
  subsetOperators = list(
    # here we are saying 'first subset to only those patients in cohort 1778213'
    createCohortSubset(
      name = "Subset to patients in cohort 1778213",
      # Note that this can be set to any id - if the
      # cohort is empty or doesn't exist this will not error
      cohortIds = 1778213,
      cohortCombinationOperator = "any",
      negate = FALSE,
      startWindow = createSubsetCohortWindow(
        startDay = -9999,
        endDay = 0,
        targetAnchor = "cohortStart"
      ),
      endWindow = createSubsetCohortWindow(
        startDay = 0,
        endDay = 9999,
        targetAnchor = "cohortStart"
      )
    ),

    # Next, subset to only those with 365 days of prior observation
    createLimitSubset(
      name = "Observation of at least 365 days prior",
      priorTime = 365,
      followUpTime = 0,
      limitTo = "all"
    )
  )
)

Reusing subset operators in multiple definitions

Next we create a similar definition that also subsetOperators the specified cohorts to require patients with specific demographic criteria. We can do that by copying the subset operations from our first definition and modifying them.

subsetOperations2 <- subsetDef$subsetOperators

# subset to those between aged 18 an 64
subsetOperations2[[3]] <-
  createDemographicSubset(
    name = "18 - 65",
    ageMin = 18,
    ageMax = 64
  )

subsetDef2 <- createCohortSubsetDefinition(
  name = "Patients in cohort 1778213 with 365 days prior obs, aged 18 - 64",
  definitionId = 2,
  subsetOperators = subsetOperations2
)

Applying subset definitions to a cohort definition set

Next we need to add the subset definitions to the base cohort set. This will automatically add identifiers and OHDSI SQL for the subset cohorts as well as storing references for saving definition sets for re-use.

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortSubsetDefinition(subsetDef)

knitr::kable(cohortDefinitionSet[, names(cohortDefinitionSet)[which(!names(cohortDefinitionSet) %in% c("json", "sql"))]])

cohortId	cohortName	atlasId	logicDescription	subsetParent	isSubset	subsetDefinitionId
1778211	celecoxib	1778211		1778211	FALSE	NA
1778212	celecoxibAge40	1778212		1778212	FALSE	NA
1778213	celecoxibAge40Male	1778213		1778213	FALSE	NA
1778214	celecoxibCensored	1778214		1778214	FALSE	NA
1778211001	celecoxib - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778211	TRUE	1
1778212001	celecoxibAge40 - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778212	TRUE	1
1778213001	celecoxibAge40Male - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778213	TRUE	1
1778214001	celecoxibCensored - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778214	TRUE	1

We can also apply a subset definition to only a limited number of target cohorts as follows

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortSubsetDefinition(subsetDef2, targetCohortIds = 1778212)

knitr::kable(cohortDefinitionSet[, names(cohortDefinitionSet)[which(!names(cohortDefinitionSet) %in% c("json", "sql"))]])

cohortId	cohortName	atlasId	logicDescription	subsetParent	isSubset	subsetDefinitionId
1778211	celecoxib	1778211		1778211	FALSE	NA
1778212	celecoxibAge40	1778212		1778212	FALSE	NA
1778213	celecoxibAge40Male	1778213		1778213	FALSE	NA
1778214	celecoxibCensored	1778214		1778214	FALSE	NA
1778211001	celecoxib - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778211	TRUE	1
1778212001	celecoxibAge40 - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778212	TRUE	1
1778213001	celecoxibAge40Male - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778213	TRUE	1
1778214001	celecoxibCensored - Patients in cohort cohort 1778213 with 365 days prior observation Subset to patients in cohort 1778213, Observation of at least 365 days prior	NA	NA	1778214	TRUE	1
1778212002	celecoxibAge40 - Patients in cohort 1778213 with 365 days prior obs, aged 18 - 64 Subset to patients in cohort 1778213, Observation of at least 365 days prior, 18 - 65	NA	NA	1778212	TRUE	2

The cohortDefinitionSet data.frame now has some additional columns:

subsetParent, isSubset, subsetDefinitionId

subsetParent indicates the parent cohort. For standard cohorts this will be their own ID. For out newly defined subsets, this will be the base cohort.

subsetDefinitionId displays the id of the subset applied to the cohort.

In addition, the name of the cohort displayed in this table is automatically generated from the base cohort name, the subset name and the names defined for the subset operations applied in the subset definition. As the number of resulting subsets can become very large, it is crucial to choose human interpretable naming conventions. For example, see the name of our first cohort and the resulting name of a child subset:

writeLines(c(
  paste("Cohort Id:", cohortDefinitionSet$cohortId[1]),
  paste("Name", cohortDefinitionSet$cohortName[1])
))

#> Cohort Id: 1778211
#> Name celecoxib

writeLines(c(
  paste("Cohort Id:", cohortDefinitionSet$cohortId[4]),
  paste("Subset Parent Id:", cohortDefinitionSet$subsetParent[4]),
  paste("Name", cohortDefinitionSet$cohortName[4])
))

#> Cohort Id: 1778214
#> Subset Parent Id: 1778214
#> Name celecoxibCensored

Note that when adding a subset definition to a cohort definition set, the target cohort ids e.g (1778211, 1778212) must exist in the cohortDefinitionSet and the output ids (1778211002, 1778212003) must be unique. As with all cohorts, any cohorts with these ids will be deleted prior to execution to prevent collisions. Note that the default expression for output cohort ids is targetId * 1000 + definitionId this may cause collisions that will cause addCohortSubsetDefinition to error. This can be modified by changing the identifierExpression parameter to createSubsetDefinition. This expression should be defined to guarantee uniqueness or adding the definition to a cohort definition set will fail.

Generating subsets

Executing CohortGenerator, we can now include the subset operations when our cohorts are generated:

connectionDetails <- Eunomia::getEunomiaConnectionDetails()
createCohortTables(
  connectionDetails = connectionDetails,
  cohortDatabaseSchema = "main",
  cohortTableNames = getCohortTableNames("my_cohort")
)
# ### As subsets are a big side effect we need to be clear what was generated and have good naming conventions
generatedCohorts <- generateCohortSet(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  cohortDatabaseSchema = "main",
  cohortTableNames = getCohortTableNames("my_cohort"),
  cohortDefinitionSet = cohortDefinitionSet,
  incremental = TRUE,
  incrementalFolder = file.path(someFolder, "RecordKeeping")
)

Cohort subset definitions can be run incrementally. In fact, if the base cohort definition changes for any reason, any subsets will automatically be re-executed when calling generateCohortSet.

Saving and loading subset definitions

Saving to packages/directories

Saving applied subsets can automatically be added to a project using saveCohortDefinitionSet

saveCohortDefinitionSet(cohortDefinitionSet,
  subsetJsonFolder = "<path_to_my_subset_definition>"
)

loading is also achieved with getCohortDefinitionSet

cohortDefinitionSet <- getCohortDefinitionSet(
  subsetJsonFolder = "<path_to_my_subset_definition>"
)

Any subset definitions should automatically be loaded and applied to the cohort definition set.

Writing json objects

Subset definitions can be converted to JSON objects as follows:

jsonDefinition <- subsetDef$toJSON()

For the purpose of writing to disk we recommend the use of ParallelLogger for consistency.

# Save to a file
ParallelLogger::saveSettingsToJson(subsetDef$toList(), "subsetDefinition1.json")

options(old)

Creating Cohort Subset Definitions

James P. Gilbert and Anthony G. Sena

2024-09-30

Introduction

Subset definitions

Demographic subset operations

Subsetting to other cohorts