CohortSymmetry provides tools to perform Sequence Symmetry Analysis (SSA). Before using the package, it is highly recommended that this method is tested beforehand against well-known positive and negative controls. The details of SSA and the relevant controls could be found using Pratt et al (2015).
The functions you will interact with are:
generateSequenceCohortSet()
: this function will create a cohort with individuals present in both (the index and the marker) cohorts.summariseSequenceRatios()
: this function will calculate sequence ratios.tableSequenceRatios()
andplotSequenceRatios()
: these functions will help us to visualise the sequence ratio results.summariseTemporalSymmetry()
: this function will produce aggregated results based on the time difference between two cohort start dates.plotTemporalSymmetry()
: this function will help us to visualise the results from summariseTemporalSymmetry().
Below, you will find an example analysis that offers a brief and comprehensive overview of the package’s functionalities. More context and further examples for each of these functions are provided in later vignettes.
First, let’s load the relevant libraries.
library(CDMConnector)
library(dplyr)
library(DBI)
library(omock)
library(CohortSymmetry)
library(duckdb)
The CohortSymmetry package works with data mapped to the OMOP CDM. Hence, the initial step involves connecting to a database. As an example, we will be using Omock package to generate a mock database with two mock cohorts: the index_cohort and the marker_cohort.
cdm <- emptyCdmReference(cdmName = "mock") |>
mockPerson(nPerson = 1000) |>
mockObservationPeriod() |>
mockCohort(
name = "index_cohort",
numberCohorts = 1,
cohortName = c("index_cohort"),
seed = 1,
) |>
mockCohort(
name = "marker_cohort",
numberCohorts = 1,
cohortName = c("marker_cohort"),
seed = 2
)
con <- dbConnect(duckdb::duckdb())
cdm <- copyCdmTo(con = con, cdm = cdm, schema = "main", overwrite = T)
cdm$index_cohort |>
dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v1.1.1 [unknown@Linux 6.5.0-1025-azure:R 4.4.1/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <int> 1, 3, 3, 5, 5, 6, 7, 8, 8, 11, 13, 13, 15, 16, 16…
#> $ cohort_start_date <date> 2002-11-28, 1996-05-02, 1997-06-18, 2004-06-30, …
#> $ cohort_end_date <date> 2006-09-05, 1997-06-17, 1998-03-06, 2004-10-22, …
cdm$marker_cohort |>
dplyr::glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v1.1.1 [unknown@Linux 6.5.0-1025-azure:R 4.4.1/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <int> 1, 3, 4, 7, 8, 8, 9, 9, 9, 9, 10, 11, 12, 13, 13,…
#> $ cohort_start_date <date> 2002-07-27, 1993-02-02, 2019-07-08, 1996-08-14, …
#> $ cohort_end_date <date> 2006-10-16, 2007-04-16, 2019-07-26, 1999-10-08, …
Once we have established a connection to the database, we can use the
generateSequenceCohortSet()
function to find the
intersection of the two cohorts. This function will provide us with the
individuals who appear in both cohorts, which will be named
intersect - another cohort in the cdm reference.
cdm <- generateSequenceCohortSet(
cdm = cdm,
indexTable = "index_cohort",
markerTable = "marker_cohort",
name = "intersect",
combinationWindow = c(0, Inf)
)
See below that the generated cohort follows the format of an OMOP CDM cohort with the addition of two extra columns: index_date and marker_date. These columns correspond to the cohort_start_date in the index_cohort and the marker_cohort, respectively.
cdm$intersect |>
dplyr::glimpse()
#> Rows: ??
#> Columns: 6
#> Database: DuckDB v1.1.1 [unknown@Linux 6.5.0-1025-azure:R 4.4.1/:memory:]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id <int> 35, 54, 62, 85, 168, 194, 229, 235, 323, 326, 444…
#> $ cohort_start_date <date> 2002-11-12, 1990-07-21, 1962-03-28, 1993-04-19, …
#> $ cohort_end_date <date> 2003-05-27, 1991-03-02, 1962-10-10, 1993-05-19, …
#> $ index_date <date> 2002-11-12, 1991-03-02, 1962-03-28, 1993-04-19, …
#> $ marker_date <date> 2003-05-27, 1990-07-21, 1962-10-10, 1993-05-19, …
Once we have the intersect cohort, you are able to explore the
temporal symmetry by using summariseTemporalSymmetry
and
plotTemporalSymmetry()
:
result <- summariseTemporalSymmetry(cohort = cdm$intersect,
timescale = "year")
result |> dplyr::glimpse()
#> Rows: 18
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "index_name &&& marker_name", "index_name &&& marker_…
#> $ group_level <chr> "index_cohort &&& marker_cohort", "index_cohort &&& m…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "temporal_symmetry", "temporal_symmetry", "temporal_s…
#> $ variable_level <chr> "-5", "2", "3", "5", "-11", "4", "-3", "1", "-1", "-6…
#> $ estimate_name <chr> "count", "count", "count", "count", "count", "count",…
#> $ estimate_type <chr> "integer", "integer", "integer", "integer", "integer"…
#> $ estimate_value <chr> NA, "9", "14", NA, NA, "5", "14", "37", "125", NA, NA…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
plotTemporalSymmetry(result = result)
Next, we will use the summariseSequenceRatios()
function
to get the crude sequence ratios, adjusted sequence ratios, and the
corresponding confidence intervals.
result <- summariseSequenceRatios(cohort = cdm$intersect)
result |> dplyr::glimpse()
#> Rows: 10
#> Columns: 13
#> $ result_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
#> $ cdm_name <chr> "mock", "mock", "mock", "mock", "mock", "mock", "mock…
#> $ group_name <chr> "index_cohort_name &&& marker_cohort_name", "index_co…
#> $ group_level <chr> "index_cohort &&& marker_cohort", "index_cohort &&& m…
#> $ strata_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ strata_level <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ variable_name <chr> "crude", "adjusted", "crude", "crude", "adjusted", "a…
#> $ variable_level <chr> "sequence_ratio", "sequence_ratio", "sequence_ratio",…
#> $ estimate_name <chr> "point_estimate", "point_estimate", "lower_CI", "uppe…
#> $ estimate_type <chr> "numeric", "numeric", "numeric", "numeric", "numeric"…
#> $ estimate_value <chr> "1.17977528089888", "1475.1822721598", "0.96657938259…
#> $ additional_name <chr> "overall", "overall", "overall", "overall", "overall"…
#> $ additional_level <chr> "overall", "overall", "overall", "overall", "overall"…
Finally, we can visualise the results using
tableSequenceRatios()
:
tableSequenceRatios(result)
Database name | Index | Marker | Study population | Index first, N (%) | Marker first, N (%) | CSR (95% CI) | ASR (95% CI) |
---|---|---|---|---|---|---|---|
mock | Index cohort | Marker cohort | 388 | 210 (54.1 %) | 178 (45.9 %) | 1.18 (0.97 - 1.44) | 1,475.18 (1,208.60 - 1,802.02) |
Or create a plot with the adjusted sequence ratios:
plotSequenceRatios(result = result,
onlyaSR = T,
colours = "black")