Single studies using the CohortMethod package

Introduction

This vignette describes how you can use the CohortMethod package to perform a single new-user cohort study. We will walk through all the steps needed to perform an exemplar study, and we have selected the well-studied topic of the effect of coxibs versus non-selective non-steroidal anti-inflammatory drugs (NSAIDs) on gastrointestinal (GI) bleeding-related hospitalization. For simplicity, we focus on one coxib – celecoxib – and one non-selective NSAID – diclofenac.

Installation instructions

Before installing the CohortMethod package make sure you have Java available. For Windows users, RTools is also necessary. See these instructions for properly configuring your R environment.

The CohortMethod package is currently maintained in a Github repository, and has dependencies on other packages in Github. All of these packages can be downloaded and installed from within R using the drat package:

install.packages("remotes")
remotes::install_github("ohdsi/CohortMethod")

Once installed, you can type library(CohortMethod) to load the package.

Data extraction

The first step in running the CohortMethod is extracting all necessary data from the database server holding the data in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) format.

Configuring the connection to the server

We need to tell R how to connect to the server where the data are. CohortMethod uses the DatabaseConnector package, which provides the createConnectionDetails function. Type ?createConnectionDetails for the specific settings required for the various database management systems (DBMS). For example, one might connect to a PostgreSQL database using this code:

connectionDetails <- createConnectionDetails(dbms = "postgresql",
                                             server = "localhost/ohdsi",
                                             user = "joe",
                                             password = "supersecret")

cdmDatabaseSchema <- "my_cdm_data"
cohortDatabaseSchema <- "my_results"
cohortTable <- "my_cohorts"
options(sqlRenderTempEmulationSchema = NULL)

The last few lines define the cdmDatabaseSchema, cohortDatabaseSchema, and cohortTable variables. We’ll use these later to tell R where the data in CDM format live, and where we want to write intermediate tables. Note that for Microsoft SQL Server, databaseschemas need to specify both the database and the schema, so for example cdmDatabaseSchema <- "my_cdm_data.dbo". For database platforms that do not support temp tables, such as Oracle, it is also necessary to provide a schema where the user has write access that can be used to emulate temp tables. PostgreSQL supports temp tables, so we can set options(sqlRenderTempEmulationSchema = NULL) (or not set the sqlRenderTempEmulationSchema at all.)

Preparing the exposures and outcome(s)

We need to define the exposures and outcomes for our study. Here, we will define our exposures using the OHDSI Capr package. We define two cohorts, one for celecoxib and one for diclofenac. For each cohort we require a prior diagnosis of ‘osteoarthritis of knee’, and 365 days of continuous prior observation. we restrict to the first exposure per person:

library(Capr)

osteoArthritisOfKneeConceptId <- 4079750
celecoxibConceptId <- 1118084
diclofenacConceptId <- 1124300
osteoArthritisOfKnee <- cs(
  descendants(osteoArthritisOfKneeConceptId),
  name = "Osteoarthritis of knee"
)
attrition = attrition(
  "prior osteoarthritis of knee" = withAll(
    atLeast(1, conditionOccurrence(osteoArthritisOfKnee), duringInterval(eventStarts(-Inf, 0)))
  )
)
celecoxib <- cs(
  descendants(celecoxibConceptId),
  name = "Celecoxib"
)
diclofenac  <- cs(
  descendants(diclofenacConceptId),
  name = "Diclofenac"
)
celecoxibCohort <- cohort(
  entry = entry(
    drugExposure(celecoxib, firstOccurrence()),
    observationWindow = continuousObservation(priorDays = 365)
  ),
  attrition = attrition,
  exit = exit(endStrategy = drugExit(celecoxib,
                                     persistenceWindow = 30,
                                     surveillanceWindow = 0))
)
diclofenacCohort <- cohort(
  entry = entry(
    drugExposure(diclofenac, firstOccurrence()),
    observationWindow = continuousObservation(priorDays = 365)
  ),
  attrition = attrition,
  exit = exit(endStrategy = drugExit(diclofenac,
                                     persistenceWindow = 30,
                                     surveillanceWindow = 0))
)
# Note: this will automatically assign cohort IDs 1 and 2, respectively:
exposureCohorts <- makeCohortSet(celecoxibCohort, diclofenacCohort)

We’ll pull the outcome definition from the OHDSI PhenotypeLibrary:

library(PhenotypeLibrary)
outcomeCohorts <- getPlCohortDefinitionSet(77) # GI bleed

We combine the exposure and outcome cohort definitions, and use CohortGenerator to generate the cohorts:

allCohorts <- bind_rows(outcomeCohorts,
                        exposureCohorts)

library(CohortGenerator)
cohortTableNames <- getCohortTableNames(cohortTable = cohortTable)
createCohortTables(connectionDetails = connectionDetails,
                   cohortDatabaseSchema = cohortDatabaseSchema,
                   cohortTableNames = cohortTableNames)
generateCohortSet(connectionDetails = connectionDetails,
                  cdmDatabaseSchema = cdmDatabaseSchema,
                  cohortDatabaseSchema = cohortDatabaseSchema,
                  cohortTableNames = cohortTableNames,
                  cohortDefinitionSet = allCohorts)

If all went well, we now have a table with the cohorts of interest. We can see how many entries per cohort:

connection <- DatabaseConnector::connect(connectionDetails)
sql <- "SELECT cohort_definition_id, COUNT(*) AS count FROM @cohortDatabaseSchema.@cohortTable GROUP BY cohort_definition_id"
DatabaseConnector::renderTranslateQuerySql(connection, sql,  cohortDatabaseSchema = cohortDatabaseSchema, cohortTable = cohortTable)
DatabaseConnector::disconnect(connection)

##   cohort_concept_id  count
## 1                 1 109307
## 2                 2 176675
## 3                77 733601

Extracting the data from the server

Now we can tell CohortMethod to extract the cohorts, construct covariates, and extract all necessary data for our analysis.

Important: The target and comparator drug must not be included in the covariates, including any descendant concepts. You will need to manually add the drugs and descendants to the excludedCovariateConceptIds of the covariate settings. In this example code we exclude the concepts for celecoxib and diclofenac and specify addDescendantsToExclude = TRUE:

# Define which types of covariates must be constructed:
covSettings <- createDefaultCovariateSettings(
  excludedCovariateConceptIds = c(diclofenacConceptId, celecoxibConceptId),
  addDescendantsToExclude = TRUE
)

#Load data:
cohortMethodData <- getDbCohortMethodData(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  targetId = 1,
  comparatorId = 2,
  outcomeIds = 77,
  exposureDatabaseSchema = cohortDatabaseSchema,
  exposureTable = cohortTable,
  outcomeDatabaseSchema = cohortDatabaseSchema,
  outcomeTable = cohortTable,
  covariateSettings = covSettings
)
cohortMethodData

## # CohortMethodData object
## 
## Target cohort ID: 1
## Comparator cohort ID: 2
## Outcome cohort ID(s): 77
## 
## Inherits from CovariateData:
## # CovariateData object
## 
## All cohorts
## 
## Inherits from Andromeda:
## # Andromeda object
## # Physical location:  C:\Users\admin_mschuemi\AppData\Local\Temp\2\RtmpQ3vMpG\file3ed05b654a73.sqlite
## 
## Tables:
## $analysisRef (analysisId, analysisName, domainId, startDay, endDay, isBinary, missingMeansZero)
## $cohorts (rowId, personSeqId, personId, treatment, cohortStartDate, daysFromObsStart, daysToCohortEnd, daysToObsEnd)
## $covariateRef (covariateId, covariateName, analysisId, conceptId, valueAsConceptId, collisions)
## $covariates (rowId, covariateId, covariateValue)
## $outcomes (rowId, outcomeId, daysToEvent)

There are many parameters, but they are all documented in the CohortMethod manual. The createDefaultCovariateSettings function is described in the FeatureExtraction package. In short, we are pointing the function to the table created earlier and indicating which concept IDs in that table identify the target, comparator and outcome. We instruct that the default set of covariates should be constructed, including covariates for all conditions, drug exposures, and procedures that were found on or before the index date. To customize the set of covariates, please refer to the FeatureExtraction package vignette by typing vignette("UsingFeatureExtraction", package="FeatureExtraction").

All data about the cohorts, outcomes, and covariates are extracted from the server and stored in the cohortMethodData object. This object uses the Andromeda package to store information in a way that ensures R does not run out of memory, even when the data are large. We can use the generic summary() function to view some more information of the data we extracted:

summary(cohortMethodData)

## CohortMethodData object summary
## 
## Target cohort ID: 1
## Comparator cohort ID: 2
## Outcome cohort ID(s): 77
## 
## Target persons: 116987
## Comparator persons: 185248
## 
## Outcome counts:
##    Event count Person count
## 77       53774        28762
## 
## Covariates:
## Number of covariates: 90758
## Number of non-zero covariate values: 152579109

Saving the data to file

Creating the cohortMethodData file can take considerable computing time, and it is probably a good idea to save it for future sessions. Because cohortMethodData uses Andromeda, we cannot use R’s regular save function. Instead, we’ll have to use the saveCohortMethodData() function:

saveCohortMethodData(cohortMethodData, "coxibVsNonselVsGiBleed.zip")

We can use the loadCohortMethodData() function to load the data in a future session.

Defining new users

Typically, a new user is defined as first time use of a drug (either target or comparator), and typically a washout period (a minimum number of days prior first use) is used to make sure it is truly first use. When using the CohortMethod package, you can enforce the necessary requirements for new use in three ways:

When creating the cohorts in the database, for example using Capr.
When loading the cohorts using the getDbCohortMethodData function, you can use the firstExposureOnly, removeDuplicateSubjects, restrictToCommonPeriod, and washoutPeriod arguments. (As shown in the example above).
When defining the study population using the createStudyPopulation function (see below) using the firstExposureOnly, removeDuplicateSubjects, restrictToCommonPeriod, and washoutPeriod arguments.

The advantage of option 1 is that the input cohorts are already fully defined outside of the CohortMethod package, and for example external cohort characterization tools can be used on the same cohorts used in this package. The advantage of options 2 and 3 is that it saves you the trouble of limiting to first use yourself, for example allowing you to directly use the drug_era table in the CDM. Option 2 is more efficient than 3, since only data for first use will be fetched, while option 3 is less efficient but allows you to compare the original cohorts to the study population.

Defining the study population

Typically, the exposure cohorts and outcome cohorts will be defined independently of each other. When we want to produce an effect size estimate, we need to further restrict these cohorts and put them together, for example by removing exposed subjects that had the outcome prior to exposure, and only keeping outcomes that fall within a defined risk window. For this we can use the createStudyPopulation function:

studyPop <- createStudyPopulation(
  cohortMethodData = cohortMethodData,
  outcomeId = 3,
  firstExposureOnly = FALSE,
  restrictToCommonPeriod = FALSE,
  washoutPeriod = 0,
  removeDuplicateSubjects = "keep all",
  removeSubjectsWithPriorOutcome = TRUE,
  minDaysAtRisk = 1,
  riskWindowStart = 0,
  startAnchor = "cohort start",
  riskWindowEnd = 30,
  endAnchor = "cohort end"
)

Note that we’ve set firstExposureOnly and removeDuplicateSubjects to FALSE, and washoutPeriod to zero because we already filtered on these arguments when using the getDbCohortMethodData function. During loading we set restrictToCommonPeriod to FALSE, and we do the same here because we do not want to force the comparison to restrict only to time when both drugs are recorded. We specify the outcome ID we will use, and that people with outcomes prior to the risk window start date will be removed. The risk window is defined as starting at the cohort start date (the index date, riskWindowStart = 0 and startAnchor = "cohort start"), and the risk windows ends 30 days after the cohort ends (riskWindowEnd = 30 and endAnchor = "cohort end"). Note that the risk windows are truncated at the end of observation or the study end date. We also remove subjects who have no time at risk. To see how many people are left in the study population we can always use the getAttritionTable function:

getAttritionTable(studyPop)

## # A tibble: 4 × 5
##   description                     targetPersons comparatorPersons targetExposures comparatorExposures
##   <chr>                                   <int>             <int>           <int>               <int>
## 1 Original cohorts                       116987            185248          116987              185248
## 2 Restricting duplicate subje ...        106629            172368          106629              172368
## 3 No prior outcome                       102860            163868          102860              163868
## 4 Have at least 1 days at ris ...        102813            163794          102813              163794

One additional filtering step that is often used is matching or trimming on propensity scores, as will be discussed next.

Propensity scores

The CohortMethod can use propensity scores to adjust for potential confounders. Instead of the traditional approach of using a handful of predefined covariates, CohortMethod typically uses thousands to millions of covariates that are automatically constructed based on conditions, procedures and drugs in the records of the subjects.

Fitting a propensity model

We can fit a propensity model using the covariates constructed by the getDbcohortMethodData() function:

ps <- createPs(cohortMethodData = cohortMethodData, population = studyPop)

The createPs() function uses the Cyclops package to fit a large-scale regularized logistic regression.

To fit the propensity model, Cyclops needs to know the hyperparameter value which specifies the variance of the prior. By default Cyclops will use cross-validation to estimate the optimal hyperparameter. However, be aware that this can take a really long time. You can use the prior and control parameters of the createPs() to specify Cyclops behavior, including using multiple CPUs to speed-up the cross-validation.

Propensity score diagnostics

We can compute the area under the receiver-operator curve (AUC) for the propensity score model:

computePsAuc(ps)

## [1] 0.81

We can also plot the propensity score distribution, although we prefer the preference score distribution:

plotPs(ps, 
       scale = "preference", 
       showCountsLabel = TRUE, 
       showAucLabel = TRUE, 
       showEquiposeLabel = TRUE)

It is also possible to inspect the propensity model itself by showing the covariates that have non-zero coefficients:

getPsModel(ps, cohortMethodData)

## # A tibble: 6 × 3
##   coefficient covariateId covariateName                              
##         <dbl>       <dbl> <chr>                                      
## 1       -3.55  1150871413 ...gh 0 days relative to index: misoprostol
## 2        3.28     2001006 index year: 2001                           
## 3        3.11     2002006 index year: 2002                           
## 4        2.71     2003006 index year: 2003                           
## 5        2.52     2004006 index year: 2004                           
## 6        1.80     2007006 index year: 2007

One advantage of using the regularization when fitting the propensity model is that most coefficients will shrink to zero and fall out of the model. It is a good idea to inspect the remaining variables for anything that should not be there, for example variations of the drugs of interest that we forgot to exclude.

Finally, we can inspect the percent of the population in equipoise, meaning they have a prefence score between 0.3 and 0.7:

CohortMethod::computeEquipoise(ps)

## [1] 0.484931

A low equipoise indicates there is little overlap between the target and comparator populations.

Using the propensity score

We can use the propensity scores to trim, stratify, match, or weigh our population. For example, one could trim to equipoise, meaning only subjects with a preference score between 0.25 and 0.75 are kept:

trimmedPop <- trimByPsToEquipoise(ps)
plotPs(trimmedPop, ps, scale = "preference")

## Population size after trimming is 129286

Instead (or additionally), we could stratify the population based on the propensity score:

stratifiedPop <- stratifyByPs(ps, numberOfStrata = 5)
plotPs(stratifiedPop, ps, scale = "preference")

We can also match subjects based on propensity scores. In this example, we’re using one-to-one matching:

  matchedPop <- matchOnPs(ps, caliper = 0.2, caliperScale = "standardized logit", maxRatio = 1)
  plotPs(matchedPop, ps)

## Population size after matching is 123092

Note that for both stratification and matching it is possible to specify additional matching criteria such as age and sex using the stratifyByPsAndCovariates() and matchOnPsAndCovariates() functions, respectively.

We can see the effect of trimming and/or matching on the population using the getAttritionTable function:

getAttritionTable(matchedPop)

## # A tibble: 5 × 5
##   description                     targetPersons comparatorPersons targetExposures comparatorExposures
##   <chr>                                   <int>             <int>           <int>               <int>
## 1 Original cohorts                       116987            185248          116987              185248
## 2 Restricting duplicate subje ...        106629            172368          106629              172368
## 3 No prior outcome                       102860            163868          102860              163868
## 4 Have at least 1 days at ris ...        102813            163794          102813              163794
## 5 Matched on propensity score             61546             61546           61546               61546

Or, if we like, we can plot an attrition diagram:

drawAttritionDiagram(matchedPop)

Evaluating covariate balance

To evaluate whether our use of the propensity score is indeed making the two cohorts more comparable, we can compute the covariate balance before and after trimming, matching, and/or stratifying:

balance <- computeCovariateBalance(matchedPop, cohortMethodData)

plotCovariateBalanceScatterPlot(balance, showCovariateCountLabel = TRUE, showMaxLabel = TRUE)

## Warning: Removed 33422 rows containing missing values or values outside the scale range (`geom_point()`).

plotCovariateBalanceOfTopVariables(balance)

The ‘before matching’ population is the population as extracted by the getDbCohortMethodData function, so before any further filtering steps.

Inspecting select population characteristics

It is customary to include a table in your paper that lists some select population characteristics before and after matching/stratification/trimming. This is usually the first table, and so will be referred to as ‘table 1’. To generate this table, you can use the createCmTable1 function:

createCmTable1(balance)

                                                         Before matching                      After matching                     
                                                         Target          Comparator           Target         Comparator          
 Characteristic                                          %               %          Std. diff %              %          Std. diff
 Age group                                                                                                                       
    40 -  44                                              0.0             0.0        0.00      0.0            0.0        0.00    
    45 -  49                                              0.0             0.0        0.00      0.0            0.0        0.00    
    50 -  54                                              0.1             0.2        0.00      0.1            0.2       -0.01    
    55 -  59                                              0.4             0.5       -0.01      0.5            0.5       -0.01    
    60 -  64                                              1.0             1.2       -0.02      1.1            1.2        0.00    
    65 -  69                                             22.1            19.2        0.07     22.3           22.2        0.00    
    70 -  74                                             28.5            26.0        0.06     27.5           27.4        0.00    
    75 -  79                                             22.4            21.7        0.02     21.7           21.8        0.00    
    80 -  84                                             14.9            16.2       -0.04     15.2           15.1        0.00    
    85 -  89                                              7.5             9.8       -0.08      8.0            8.0        0.00    
    90 -  94                                              2.5             4.0       -0.08      2.9            2.9        0.00    
    95 -  99                                              0.4             1.0       -0.06      0.6            0.6        0.00    
   100 - 104                                              0.1             0.1       -0.02      0.1            0.1       -0.01    
   105 - 109                                              0.0             0.0        0.00                     0.0                
 Gender: female                                          64.2            67.8       -0.08     66.3           66.1        0.00    
 Medical history: General                                                                                                        
   Acute respiratory disease                             22.8            25.6       -0.06     23.9           23.9        0.00    
   Attention deficit hyperactivity disorder               0.2             0.2        0.00      0.2            0.2        0.00    
   Chronic liver disease                                  1.1             1.5       -0.03      1.1            1.1        0.00    
   Chronic obstructive lung disease                      10.0            11.6       -0.05      9.9            9.9        0.00    
   Crohn's disease                                        0.4             0.4       -0.01      0.4            0.4        0.00    
   Dementia                                               2.6             4.3       -0.09      2.9            2.9        0.00    
   Depressive disorder                                   10.0            12.2       -0.07     10.5           10.4        0.00    
   Diabetes mellitus                                     22.8            27.6       -0.11     24.0           23.8        0.00    
   Gastroesophageal reflux disease                       17.8            19.9       -0.05     18.0           17.6        0.01    
   Gastrointestinal hemorrhage                            3.6             3.9       -0.02      2.3            2.3        0.00    
   Human immunodeficiency virus infection                 0.1             0.1       -0.01      0.1            0.1        0.00    
   Hyperlipidemia                                        47.6            54.2       -0.13     49.7           49.0        0.01    
   Lesion of liver                                        0.5             0.7       -0.03      0.5            0.5        0.00    
   Obesity                                               10.7            11.4       -0.02     10.0            9.7        0.01    
   Osteoarthritis                                        88.3            83.5        0.14     85.8           85.8        0.00    
   Pneumonia                                              4.5             5.6       -0.05      4.4            4.5        0.00    
   Psoriasis                                              1.3             1.5       -0.02      1.4            1.4        0.00    
   Renal impairment                                       7.5            13.5       -0.19      8.0            8.0        0.00    
   Rheumatoid arthritis                                   3.5             4.3       -0.04      3.7            3.8        0.00    
   Schizophrenia                                          0.1             0.1       -0.01      0.1            0.1        0.00    
   Ulcerative colitis                                     0.5             0.6       -0.01      0.5            0.5        0.00    
   Urinary tract infectious disease                      11.6            13.9       -0.07     12.0           12.0        0.00    
   Viral hepatitis C                                      0.2             0.3       -0.02      0.2            0.2        0.00    
 Medical history: Cardiovascular disease                                                                                         
   Atrial fibrillation                                    9.8            12.2       -0.07      9.7            9.8        0.00    
   Cerebrovascular disease                               11.0            12.5       -0.05     10.9           11.0        0.00    
   Coronary arteriosclerosis                             19.3            21.5       -0.05     18.4           18.7       -0.01    
   Heart disease                                         43.8            45.7       -0.04     41.5           41.9       -0.01    
   Heart failure                                          8.0            10.7       -0.09      7.6            7.7        0.00    
   Ischemic heart disease                                 9.0             9.8       -0.03      8.3            8.4       -0.01    
   Peripheral vascular disease                            9.3            12.8       -0.11      9.9            9.9        0.00    
   Pulmonary embolism                                     1.1             1.3       -0.02      1.1            1.1        0.00    
   Venous thrombosis                                      3.2             3.8       -0.03      3.3            3.3        0.00    
 Medical history: Neoplasms                                                                                                      
   Malignant lymphoma                                     1.0             1.1       -0.01      1.0            1.0        0.00    
   Malignant neoplasm of anorectum                        0.3             0.3        0.01      0.3            0.3        0.00    
   Malignant neoplastic disease                          18.7            19.5       -0.02     18.7           18.8        0.00    
   Malignant tumor of breast                              3.8             4.2       -0.02      4.0            4.0        0.00    
   Malignant tumor of colon                               0.7             0.7        0.00      0.6            0.6        0.00    
   Malignant tumor of lung                                0.3             0.4       -0.02      0.4            0.4        0.00    
   Malignant tumor of urinary bladder                     0.8             0.9       -0.01      0.8            0.8        0.00    
   Primary malignant neoplasm of prostate                 3.7             3.4        0.02      3.5            3.5        0.00    
 Medication use                                                                                                                  
   Agents acting on the renin-angiotensin system         50.7            53.9       -0.07     52.0           51.9        0.00    
   Antibacterials for systemic use                       70.1            71.5       -0.03     69.3           69.5        0.00    
   Antidepressants                                       29.7            31.9       -0.05     30.9           31.0        0.00    
   Antiepileptics                                        22.1            23.3       -0.03     21.7           21.7        0.00    
   Antiinflammatory and antirheumatic products           37.8            35.8        0.04     37.7           37.7        0.00    
   Antineoplastic agents                                  5.1             5.9       -0.03      5.4            5.3        0.00    
   Antipsoriatics                                         0.9             1.3       -0.04      0.9            0.9       -0.01    
   Antithrombotic agents                                 29.5            25.6        0.09     24.0           24.3       -0.01    
   Beta blocking agents                                  37.2            41.7       -0.09     38.0           38.0        0.00    
   Calcium channel blockers                              28.8            32.0       -0.07     29.5           29.4        0.00    
   Diuretics                                             45.4            47.7       -0.05     45.4           45.7       -0.01    
   Drugs for acid related disorders                      43.5            45.4       -0.04     41.4           41.8       -0.01    
   Drugs for obstructive airway diseases                 52.9            55.4       -0.05     54.1           53.9        0.00    
   Drugs used in diabetes                                18.6            22.0       -0.08     19.3           19.4        0.00    
   Immunosuppressants                                     4.7             6.0       -0.06      5.1            5.1        0.00    
   Lipid modifying agents                                56.1            59.7       -0.07     58.0           57.8        0.00    
   Opioids                                               51.8            44.6        0.15     45.8           46.0        0.00    
   Psycholeptics                                         35.4            34.5        0.02     34.9           34.9        0.00    
   Psychostimulants, agents used for adhd and nootropics  1.9             2.1       -0.01      2.1            2.1        0.00

Generalizability

The goal of any propensity score adjustments is typically to make the target and comparator cohorts comparably, to allow proper causal inference. However, in doing so, we often need to modify our population, for example dropping subjects that have no counterpart in the other exposure cohort. The population we end up estimating an effect for may end up being very different from the population we started with. An important question is: how different? And it what ways? If the populations before and after adjustment are very different, our estimated effect may not generalize to the original population (if effect modification is present). The getGeneralizabilityTable() function informs on these differences:

getGeneralizabilityTable(balance)

 [38;5;246m# A tibble: 90,758 × 5 [39m
   covariateId covariateName                     beforeMatchingMean afterMatchingMean stdDiff
          [3m [38;5;246m<dbl> [39m [23m  [3m [38;5;246m<chr> [39m [23m                                           [3m [38;5;246m<dbl> [39m [23m              [3m [38;5;246m<dbl> [39m [23m    [3m [38;5;246m<dbl> [39m [23m
 [38;5;250m 1 [39m     4.31 [38;5;246me [39m 9 ...sia for total knee replacement             0.117            0.010 [4m3 [24m    0.332
 [38;5;250m 2 [39m     2.11 [38;5;246me [39m 9 ...cing (total knee arthroplasty)             0.114            0.009 [4m9 [24m [4m4 [24m   0.327
 [38;5;250m 3 [39m     3.80 [38;5;246me [39m10 ...s and Devices - Other Implants             0.125            0.017 [4m6 [24m    0.326
 [38;5;250m 4 [39m     3.80 [38;5;246me [39m10 ...vices - General Classification             0.140            0.029 [4m7 [24m    0.318
 [38;5;250m 5 [39m     3.80 [38;5;246me [39m10 ... Room - General Classification             0.134            0.028 [4m2 [24m    0.310
 [38;5;250m 6 [39m     3.80 [38;5;246me [39m10 ...hesia - General Classification             0.121            0.025 [4m1 [24m    0.295
 [38;5;250m 7 [39m     3.80 [38;5;246me [39m10 ...rmacy - General Classification             0.176            0.067 [4m9 [24m    0.285
 [38;5;250m 8 [39m     3.80 [38;5;246me [39m10 ... - Evaluation Or Re-Evaluation             0.120            0.029 [4m8 [24m    0.278
 [38;5;250m 9 [39m     3.29 [38;5;246me [39m14 ...ncounter procedure = Procedure             0.096 [4m5 [24m           0.014 [4m9 [24m    0.276
 [38;5;250m10 [39m     4.31 [38;5;246me [39m 9 ...sia for total knee replacement             0.171            0.070 [4m7 [24m    0.267
 [38;5;246m# ℹ 90,748 more rows [39m

In this case, because we used PS matching, we are likely aiming to estimate the average treatment effect in the treated (ATT). For this reason, the getGeneralizabilityTable() function automatically selected the target cohort as the basis for evaluating generalizability: it shows, for each covariate, the mean value before and PS adjustment in the target cohort. Also shown is the standardized difference of mean, and the table is reverse sorted by the absolute standard difference of mean (ASDM).

Follow-up and power

Before we start fitting an outcome model, we might be interested to know whether we have sufficient power to detect a particular effect size. It makes sense to perform these power calculations once the study population has been fully defined, so taking into account loss to the various inclusion and exclusion criteria (such as no prior outcomes), and loss due to matching and/or trimming. Since the sample size is fixed in retrospective studies (the data has already been collected), and the true effect size is unknown, the CohortMethod package provides a function to compute the minimum detectable relative risk (MDRR) instead:

computeMdrr(
  population = studyPop,
  modelType = "cox",
  alpha = 0.05,
  power = 0.8,
  twoSided = TRUE
)

##   targetPersons comparatorPersons targetExposures comparatorExposures targetDays comparatorDays totalOutcomes     mdrr        se
## 1        102813            163794          102813              163794   14048422       12735984          1432 1.164277 0.0542909

In this example we used the studyPop object, so the population before any matching or trimming. If we want to know the MDRR after matching, we use the matchedPop object we created earlier instead:

computeMdrr(
  population = matchedPop,
  modelType = "cox",
  alpha = 0.05,
  power = 0.8,
  twoSided = TRUE
)

##   targetPersons comparatorPersons targetExposures comparatorExposures targetDays comparatorDays totalOutcomes     mdrr         se
## 1         61546             61546           61546               61546    8757268        5122251           683 1.239117 0.07652787

Even thought the MDRR in the matched population is higher, meaning we have less power, we should of course not be fooled: matching most likely eliminates confounding, and is therefore preferred to not matching.

To gain a better understanding of the amount of follow-up available we can also inspect the distribution of follow-up time. We defined follow-up time as time at risk, so not censored by the occurrence of the outcome. The getFollowUpDistribution can provide a simple overview:

getFollowUpDistribution(population = matchedPop)

##   100% 75% 50% 25%   0% Treatment
## 1    2  60  60 132 4232         1
## 2    2  45  60  75 3833         0

The output is telling us number of days of follow-up each quantile of the study population has. We can also plot the distribution:

plotFollowUpDistribution(population = matchedPop)

Outcome models

The outcome model is a model describing which variables are associated with the outcome.

Fitting a simple outcome model

In theory we could fit an outcome model without using the propensity scores. In this example we are fitting an outcome model using a Cox regression:

outcomeModel <- fitOutcomeModel(population = studyPop,
                                modelType = "cox")
outcomeModel

## Model type: cox
## Stratified: FALSE
## Use covariates: FALSE
## Use inverse probability of treatment weighting: FALSE
## Target estimand: ate
## Status: OK
## 
##           Estimate lower .95 upper .95    logRr seLogRr
## treatment  0.88382   0.79276   0.98512 -0.12350  0.0554

But of course we want to make use of the matching done on the propensity score:

outcomeModel <- fitOutcomeModel(population = matchedPop,
                                modelType = "cox",
                                stratified = TRUE)
outcomeModel

## Model type: cox
## Stratified: TRUE
## Use covariates: FALSE
## Use inverse probability of treatment weighting: FALSE
## Target estimand: att
## Status: OK
## 
##            Estimate lower .95 upper .95     logRr seLogRr
## treatment  0.983258  0.831067  1.160009 -0.016883  0.0851

Note that we define the sub-population to be only those in the matchedPop object, which we created earlier by matching on the propensity score. We also now use a stratified Cox model, conditioning on the propensity score match sets.

Instead of matching or stratifying we can also perform Inverse Probability of Treatment Weighting (IPTW):

outcomeModel <- fitOutcomeModel(population = ps,
                                modelType = "cox",
                                inversePtWeighting = TRUE)
outcomeModel

## Model type: cox
## Stratified: FALSE
## Use covariates: FALSE
## Use inverse probability of treatment weighting: TRUE
## Target estimand: att
## Status: OK
## 
##            Estimate lower .95 upper .95     logRr seLogRr
## treatment  0.982762  0.794931  1.219049 -0.017388  0.1091

Adding interaction terms

We may be interested whether the effect is different across different groups in the population. To explore this, we may include interaction terms in the model. In this example we include three interaction terms:

interactionCovariateIds <- c(8532001, 201826210, 21600960413) 
# 8532001 = Female
# 201826210 = Type 2 Diabetes
# 21600960413 = Concurent use of antithrombotic agents
outcomeModel <- fitOutcomeModel(population = matchedPop,
                                modelType = "cox",
                                stratified = TRUE,
                                interactionCovariateIds = interactionCovariateIds)
outcomeModel

## Model type: cox
## Stratified: TRUE
## Use covariates: FALSE
## Use inverse probability of treatment weighting: FALSE
## Target estimand: ate
## Status: OK
## 
##                                                                                                                                         Estimate lower .95 upper .95     logRr seLogRr
## treatment                                                                                                                               0.868311  0.699857  1.076989 -0.141205  0.1100
## treatment * condition_era group (ConditionGroupEraLongTerm) during day -365 through 0 days relative to index: Type 2 diabetes mellitus  0.906303  0.716481  1.144418 -0.098381  0.1195
## treatment * gender = FEMALE                                                                                                             1.108776  0.892401  1.377832  0.103257  0.1108
## treatment * drug_era group (DrugGroupEraOverlapping) during day 0 through 0 days relative to index: ANTITHROMBOTIC AGENTS               0.794549  0.630726  1.000737 -0.229981  0.1178

Note that you can use the grepCovariateNames to find covariate IDs.

It is prudent to verify that covariate balance has also been achieved in the subgroups of interest. For example, we can check the covariate balance in the subpopulation of females:

balanceFemale <- computeCovariateBalance(population = matchedPop, 
                                         cohortMethodData = cohortMethodData, 
                                         subgroupCovariateId = 8532001)
plotCovariateBalanceScatterPlot(balanceFemale)

Adding covariates to the outcome model

One final refinement would be to use the same covariates we used to fit the propensity model to also fit the outcome model. This way we are more robust against misspecification of the model, and more likely to remove bias. For this we use the regularized Cox regression in the Cyclops package. (Note that the treatment variable is automatically excluded from regularization.)

outcomeModel <- fitOutcomeModel(population = matchedPop,
                                cohortMethodData = cohortMethodData,
                                modelType = "cox",
                                stratified = TRUE,
                                useCovariates = TRUE)
outcomeModel

## Model type: cox
## Stratified: TRUE
## Use covariates: TRUE
## Use inverse probability of treatment weighting: FALSE
## Target estimand: att
## Status: OK
## Prior variance: 0.00918146205529235
## 
##            Estimate lower .95 upper .95     logRr seLogRr
## treatment  0.989443  0.818475  1.192098 -0.010613  0.0959

Inspecting the outcome model

We can inspect more details of the outcome model:

exp(coef(outcomeModel))

## 4393168283078717 
##        0.9894435

exp(confint(outcomeModel))

## [1] 0.8184755 1.1920978

We can also see the covariates that ended up in the outcome model:

getOutcomeModel(outcomeModel, cohortMethodData)

##   coefficient           id      name
## 1 -0.01061263 4.393168e+15 Treatment

Kaplan-Meier plot

We can create the Kaplan-Meier plot:

plotKaplanMeier(matchedPop, includeZero = FALSE)

Note that the Kaplan-Meier plot will automatically adjust for any stratification, matching, or trimming that may have been applied.

Time-to-event plot

We can also plot time-to-event, showing both events before and after the index date, and events during and outside the defined time-at-risk window. This plot can provide insight into the temporal pattern of the outcome relative to the exposures:

plotTimeToEvent(cohortMethodData = cohortMethodData,
                outcomeId = 77,
                firstExposureOnly = FALSE,
                washoutPeriod = 0,
                removeDuplicateSubjects = "keep all",
                minDaysAtRisk = 1,
                riskWindowStart = 0,
                startAnchor = "cohort start",
                riskWindowEnd = 30,
                endAnchor = "cohort end")

Note that this plot does not show any adjustment for the propensity score.

Acknowledgments

Considerable work has been dedicated to provide the CohortMethod package.

citation("CohortMethod")

## To cite package 'CohortMethod' in publications use:
## 
##   Schuemie M, Suchard M, Ryan P (2024). _CohortMethod: New-User Cohort Method with Large Scale Propensity and Outcome Models_. R package version 5.4.0,
##   https://github.com/OHDSI/CohortMethod, <https://ohdsi.github.io/CohortMethod>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {CohortMethod: New-User Cohort Method with Large Scale Propensity and Outcome
## Models},
##     author = {Martijn Schuemie and Marc Suchard and Patrick Ryan},
##     year = {2024},
##     note = {R package version 5.4.0, 
## https://github.com/OHDSI/CohortMethod},
##     url = {https://ohdsi.github.io/CohortMethod},
##   }

Further, CohortMethod makes extensive use of the Cyclops package.

citation("Cyclops")

## To cite Cyclops in publications use:
## 
##   Suchard MA, Simpson SE, Zorych I, Ryan P, Madigan D (2013). "Massive parallelization of serial inference algorithms for complex generalized linear models." _ACM Transactions on
##   Modeling and Computer Simulation_, *23*, 10. <https://dl.acm.org/doi/10.1145/2414416.2414791>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {M. A. Suchard and S. E. Simpson and I. Zorych and P. Ryan and D. Madigan},
##     title = {Massive parallelization of serial inference algorithms for complex generalized linear models},
##     journal = {ACM Transactions on Modeling and Computer Simulation},
##     volume = {23},
##     pages = {10},
##     year = {2013},
##     url = {https://dl.acm.org/doi/10.1145/2414416.2414791},
##   }

This work is supported in part through the National Science Foundation grant IIS 1251151.

Martijn J. Schuemie, Marc A. Suchard and Patrick Ryan

2024-09-30