Running Cohort Diagnostics

Introduction

This vignette discusses the process of generating a results set with CohortDiagnostics starting with cohort generation. Please see the HADES library for more information on the background for this.

pre-requisites

Ensure that CohortDiagnostics is installed on your system and updated to the latest version. For this example we will also be using the Eunomia test package. Optionally, you may install the ROhdsiWebApi package to download cohort definitions from an ATLAS instance:

remotes::install_github("OHDSI/Eunomia")
remotes::install_github("OHDSI/ROhdsiWebApi")

Configuring the connection to the server

We need to tell R how to connect to the server where the data are. CohortDiagnostics uses the DatabaseConnector package, which provides the createConnectionDetails function. Type ?createConnectionDetails for the specific settings required for the various database management systems (DBMS). For example, one might connect to a PostgreSQL database using this code:

library(CohortDiagnostics)

connectionDetails <- createConnectionDetails(
  dbms = "postgresql",
  server = "localhost/ohdsi",
  user = "joe",
  password = "supersecret"
)

For the purposes of this example, we will use the Eunomia test CDM package that is in an Sqlite database stored locally.

connectionDetails <- Eunomia::getEunomiaConnectionDetails()

cdmDatabaseSchema <- "main"
tempEmulationSchema <- NULL
cohortDatabaseSchema <- "main"
cohortTable <- "cohort"

The last four lines define the cdmDatabaseSchema, tempEmulationSchema, cohortDatabaseSchema, and cohortTable variables. We’ll use the cdmDatabaseSchema later to tell R where the data in CDM format live. The tempEmulationSchema is needed only for Oracle users, since Oracle does not support temporary tables. The cohortDatabaseSchema, and cohortTable specify where we want to instantiate our cohorts. Note that for Microsoft SQL Server, database schemas need to specify both the database and the schema, so for example cdmDatabaseSchema <- "my_cdm_data.dbo".

Loading cohort references from a package

The prefered usage of cohort diagnostics is through the use of a study package. This is a dedicated R package that can be installed on a system and run. The primary reason for this is due to reproducibility, cohort definitions and resources frequently change. However, a study package can be seen as a snapshot, frozen at the time of creation and incrementally updated.

For example, the cohort diagnostics package includes an example set of cohort sql and json to run on the Eunomia test data in the OMOP Common Data Model format.

library(CohortDiagnostics)
cohortDefinitionSet <- CohortGenerator::getCohortDefinitionSet(
  settingsFileName = "Cohorts.csv",
  jsonFolder = "cohorts",
  sqlFolder = "sql/sql_server",
  packageName = "CohortDiagnostics"
)

Looking at this data.frame of Cohorts you will see the sql and json for these cohorts:

View(cohortDefinitionSet)

Loading cohort references from WebApi

It is often desirable to perform cohort diagnostics on definitions stored in an ATLAS instance. Though this is not the preferred way of running studies (and this is certainly not the preferred method for an OHDSI network study involving multiple sites) it is possible to load references into a data frame used by cohort diagnostics.

The following code demonstrates how to create a set of cohort references from ATLAS that can be used by cohort diagnostics:

# Set up url
baseUrl <- "https://atlas.hosting.com/WebAPI"
# list of cohort ids
cohortIds <- c(18345, 18346)

cohortDefinitionSet <- ROhdsiWebApi::exportCohortDefinitionSet(
  baseUrl = baseUrl,
  cohortIds = cohortIds,
  generateStats = TRUE
)

Consult the ROhdsiWebApi documentation for details on authentication to your atlas instance. Please note that in order to generate inclusion rules statistics (a useful diagnostic tool) the parameter generateStats should be set to TRUE.

Generating cohorts

Cohorts must be generated before cohort diagnostics can be run.

Using CohortGenerator to instantiate cohorts

For example,

cohortTableNames <- CohortGenerator::getCohortTableNames(cohortTable = cohortTable)

# Next create the tables on the database
CohortGenerator::createCohortTables(
  connectionDetails = connectionDetails,
  cohortTableNames = cohortTableNames,
  cohortDatabaseSchema = "main",
  incremental = FALSE
)

# Generate the cohort set
CohortGenerator::generateCohortSet(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTableNames = cohortTableNames,
  cohortDefinitionSet = cohortDefinitionSet,
  incremental = FALSE
)

Note, that the above code will delete an existing table. However, incremental mode can be used when setting the parameter incremental = TRUE.

The resulting cohort table should include the columns:

cohort_definition_id, subject_id, cohort_start_date, cohort_end_date

Executing cohort diagnostics

Once cohort definitions are loaded and cohort tables have been populated cohort diagnostics is ready to be executed.

First we set an export folder, this is where the results will be stored.

exportFolder <- "export"

Then we execute the function (using the default settings) as follows:

executeDiagnostics(cohortDefinitionSet,
  connectionDetails = connectionDetails,
  cohortTable = cohortTable,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cdmDatabaseSchema = cdmDatabaseSchema,
  exportFolder = exportFolder,
  databaseId = "MyCdm",
  minCellCount = 5
)

Cohort Statistics Table Clean up

The above cohort generation process will create a number of residual tables. As the process is complete, these are no longer required and can be removed.

CohortGenerator::dropCohortStatsTables(
  connectionDetails = connectionDetails,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTableNames = cohortTableNames
)

Cohort Diagnostics Output

Once the diagnostics have completed, a zip file will have been created in the specified export folder. This zip file can be shared between sites, as it does not contain patient-identifiable information. When unzipped, the zip file will contain several .csv files that maybe easily audited. Note that cell counts smaller than 5 have been removed, as specified using the minCellCount argument, to ensure non-identifiability.

Creating an sqlite db file

Assuming you completed the steps described above for one or more databases, you should now have a set of zip files, one per database. Make sure to place all zip files in a single folder.

Optionally, we can pre-merge the zip files into an sqlite database so we can view results in the Shiny app:

createMergedResultsFile(exportFolder)

This file can be used in the shiny app to explore results. See the vignette “Viewing results using Diagnostics Explorer” for more details.

Running Cohort Diagnostics

Gowtham Rao and James P. Gilbert

2025-01-10

Introduction

pre-requisites

Configuring the connection to the server