synthetic_datasets • omock

Introduction

As seen in other vignettes omock provides you with functionality to build. synthetic datasets. omock also provides some prebuilt synthetics datasets, those datasets are widely available and were created by the OHDSI community.

library(omock)

Avialable datasets

The available datasets are listed below:

datasetName	CDM name	CDM version	Size	Number individuals	Number records	Number concepts	Link
GiBleed	GiBleed	5.3	6.44 MB	2,694	215,978	320	🔗
empty_cdm	empty_cdm	5.3	783. MB	0	0	0	🔗
synpuf-1k_5.3	synpuf-1k	5.3	566. MB	1,000	290,059	16,131	🔗
synpuf-1k_5.4	synpuf-1k	5.4	379. MB	1,000	290,059	16,131	🔗
synthea-allergies-10k	synthea-allergies-10k	5.3	801. MB	10,703	354,551	46	🔗
synthea-anemia-10k	synthea-anemia-10k	5.3	801. MB	10,679	354,713	46	🔗
synthea-breast_cancer-10k	synthea-breast_cancer-10k	5.3	802. MB	10,751	364,817	85	🔗
synthea-contraceptives-10k	synthea-contraceptives-10k	5.3	803. MB	10,728	367,118	72	🔗
synthea-covid19-10k	synthea-covid19-10k	5.3	802. MB	10,754	371,322	105	🔗
synthea-covid19-200k	synthea-covid19-200k	5.3	1.10 GB	213,953	7,304,754	110	🔗
synthea-dermatitis-10k	synthea-dermatitis-10k	5.3	801. MB	10,713	355,896	50	🔗
synthea-heart-10k	synthea-heart-10k	5.3	801. MB	10,683	354,908	46	🔗
synthea-hiv-10k	synthea-hiv-10k	5.3	801. MB	10,682	354,518	46	🔗
synthea-lung_cancer-10k	synthea-lung_cancer-10k	5.3	802. MB	10,756	374,965	69	🔗
synthea-medications-10k	synthea-medications-10k	5.3	801. MB	10,681	354,828	46	🔗
synthea-metabolic_syndrome-10k	synthea-metabolic_syndrome-10k	5.3	801. MB	10,682	354,599	46	🔗
synthea-opioid_addiction-10k	synthea-opioid_addiction-10k	5.3	803. MB	10,738	360,930	54	🔗
synthea-rheumatoid_arthritis-10k	synthea-rheumatoid_arthritis-10k	5.3	801. MB	10,734	356,966	49	🔗
synthea-snf-10k	synthea-snf-10k	5.3	801. MB	10,680	354,680	46	🔗
synthea-surgery-10k	synthea-surgery-10k	5.3	801. MB	10,679	354,775	46	🔗
synthea-total_joint_replacement-10k	synthea-total_joint_replacement-10k	5.3	801. MB	10,682	354,858	46	🔗
synthea-veteran_prostate_cancer-10k	synthea-veteran_prostate_cancer-10k	5.3	801. MB	10,718	356,324	46	🔗
synthea-veterans-10k	synthea-veterans-10k	5.3	801. MB	10,678	354,791	46	🔗
synthea-weight_loss-10k	synthea-weight_loss-10k	5.3	801. MB	10,677	354,689	46	🔗

For more details on those synthetic datasets you can check the OmopSketch ShinyApp: https://dpa-pde-oxford.shinyapps.io/OmopSketchCharacterisation/ that characterise those datasets.

You can also check programatically which are the synthetic datasets that you can use with:

availableMockDatasets()
#>  [1] "GiBleed"                             "empty_cdm"                          
#>  [3] "synpuf-1k_5.3"                       "synpuf-1k_5.4"                      
#>  [5] "synthea-allergies-10k"               "synthea-anemia-10k"                 
#>  [7] "synthea-breast_cancer-10k"           "synthea-contraceptives-10k"         
#>  [9] "synthea-covid19-10k"                 "synthea-covid19-200k"               
#> [11] "synthea-dermatitis-10k"              "synthea-heart-10k"                  
#> [13] "synthea-hiv-10k"                     "synthea-lung_cancer-10k"            
#> [15] "synthea-medications-10k"             "synthea-metabolic_syndrome-10k"     
#> [17] "synthea-opioid_addiction-10k"        "synthea-rheumatoid_arthritis-10k"   
#> [19] "synthea-snf-10k"                     "synthea-surgery-10k"                
#> [21] "synthea-total_joint_replacement-10k" "synthea-veteran_prostate_cancer-10k"
#> [23] "synthea-veterans-10k"                "synthea-weight_loss-10k"

Download a dataset

To prevent having to download the dataset everytime that you want to use a dataset, it is recommended to set up a permanent folder where the synthetic datasets are stored. This allows the user to have to download each dataset only once. To set up a permanent location for your dataset please create an environmental variable (usethis::edit_r_environ()) pointing to an existing folder like:

OMOP_DATA_FOLDER="path/to/my/folder"

This folder is in fact defined by omopgenerics and it is used also by other packages. You can check the folder by using the following function:

omopDataFolder()
#> [1] "/tmp/Rtmp5cT5gf/OMOP_DATASETS"

Note that if you would have set up an environment variable the message of temporary folder would not appear and you would see the path to you folder.

You can download a dataset using downloadMockDataset():

downloadMockDataset(datasetName = "GiBleed")
#> ℹ Deleting prior version of GiBleed.

This will download the dataset and store it as a zip file in you OMOP_DATA_FOLDER:

list.files(path = omopDataFolder(), recursive = TRUE)
#> [1] "mockDatasets/GiBleed.zip"

Note datasets are stored in a subfolder named mockDatasets to account for the fact that this folder is used also by other packages to store data.

Create a cdm reference of a mock dataset

You can easily create a mock dataset reference using the mockCdmFromDataset() function:

cdm <- mockCdmFromDataset(datasetName = "GiBleed")
#> ℹ Reading GiBleed tables.
#> ℹ Adding drug_strength table.
#> ℹ Creating local <cdm_reference> object.
cdm
#> 
#> ── # OMOP CDM reference (local) of GiBleed ─────────────────────────────────────
#> • omop tables: care_site, cdm_source, concept, concept_ancestor, concept_class,
#> concept_relationship, concept_synonym, condition_era, condition_occurrence,
#> cost, death, device_exposure, domain, dose_era, drug_era, drug_exposure,
#> drug_strength, fact_relationship, location, measurement, metadata, note,
#> note_nlp, observation, observation_period, payer_plan_period, person,
#> procedure_occurrence, provider, relationship, source_to_concept_map, specimen,
#> visit_detail, visit_occurrence, vocabulary
#> • cohort tables: -
#> • achilles tables: -
#> • other tables: -

Downloading the dataset before hand was not needed and that if you try to create a reference of a dataset that is not downloaded it will be downloaded in the process (in interactive sessions you will be asked):

cdm <- mockCdmFromDataset(datasetName = "GiBleed")
#> ℹ Reading GiBleed tables.
#> ℹ Adding drug_strength table.
#> ℹ Creating local <cdm_reference> object.
cdm
#> 
#> ── # OMOP CDM reference (local) of GiBleed ─────────────────────────────────────
#> • omop tables: care_site, cdm_source, concept, concept_ancestor, concept_class,
#> concept_relationship, concept_synonym, condition_era, condition_occurrence,
#> cost, death, device_exposure, domain, dose_era, drug_era, drug_exposure,
#> drug_strength, fact_relationship, location, measurement, metadata, note,
#> note_nlp, observation, observation_period, payer_plan_period, person,
#> procedure_occurrence, provider, relationship, source_to_concept_map, specimen,
#> visit_detail, visit_occurrence, vocabulary
#> • cohort tables: -
#> • achilles tables: -
#> • other tables: -

Finally, you can also insert the local dataset into a duckdb connection using the source argument:

cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")
#> ℹ Reading GiBleed tables.
#> ℹ Adding drug_strength table.
#> ℹ Creating local <cdm_reference> object.
#> ℹ Inserting <cdm_reference> into duckdb.
cdm
#> 
#> ── # OMOP CDM reference (duckdb) of GiBleed ────────────────────────────────────
#> • omop tables: care_site, cdm_source, concept, concept_ancestor, concept_class,
#> concept_relationship, concept_synonym, condition_era, condition_occurrence,
#> cost, death, device_exposure, domain, dose_era, drug_era, drug_exposure,
#> drug_strength, fact_relationship, location, measurement, metadata, note,
#> note_nlp, observation, observation_period, payer_plan_period, person,
#> procedure_occurrence, provider, relationship, source_to_concept_map, specimen,
#> visit_detail, visit_occurrence, vocabulary
#> • cohort tables: -
#> • achilles tables: -
#> • other tables: -

Note the local datasets can be inserted in many different sources using the function insertCdmTo() from omopgenerics.