Gaia

Introduction

TODO

Current Scope

TODO

Limitations

TODO

Overview

Gaia refers to the amalgamation of infrastructure, software, standards, tools, and the overall workflow that the OHDSI GIS Workgroup has developed to assist researchers with integrating place-based datasets into their patient-based health database and subsequent analyses.

Gaia includes multiple major elements: - gaiaCatalog: a functional metadata catalog containing references to publicly-hosted geospatial datasets and instructions for their download and standardization - gaiaCore: a Postgis database for managing harmonized data sources, the dockerized DeGauss geocoding tool, and gaiaR, an R package for managing interactions between gaiaCore, gaiaCatalog, and any of the Gaia “extensions” - Extensions: a broad suite of software packages that are powered by gaiaCore. The most relevant of these packages is gaiaOhdsi, an R package that contains operations specific to interacting with an OMOP CDM or external OHDSI software. Other example of extensions are the gaiaVis tools which provide a set of visualizations for data in gaiaCore.

Purpose

What is the purpose of Gaia? Why are we doing all of this?

Gaia provides a standardized, automated, reproducible, and easily shareable means for integrating place-based datasets into a database of longitudinal patient health data.

The Case for Gaia

Simplest case

The simplest case for Gaia is a single researcher looking to leverage place-based data. After standing up a local or cloud instance of GaiaCore, any researcher now has access to a wealth of curated sources of geospatial data ranging from environmental toxin data to one of many Social Determinants of Health Indexes derived from the US Census data. Instead of the countless hours of work typical to munging multiple disparate geospatial datasets, the researcher can simply use the functions from the gaiaR package to load datasets into their Postgis database all in a harmonized geospatial data format. They’ve now quickly enabled datasets across many domains, years, and regions in a single Postgis database to which they connect using the software of their choice and begin performing ad hoc exploratory data analyses, creating visualizations, or even powering their own geospatial applications.

Using GaiaCore with an OMOP-shaped database

Taking this scenario a step further, a researcher with an established OMOP CDM database may wish to incorporate a subset of geospatial variables into their CDM database alongside their patient health data. The steps necessary to perform this ingestion, which requires geocoding of patient address and a spatiotemporal join, are all handled by GaiaCore and the gaiaOhdsi extension. Thehe DeGauss geocoder, a lightweight geocoder that operates fully locally to ensure that patient information is not transmitted, is easily utilized through a gaiaR wrapper. standardized spatiotemporal joins from the gaiaOhdsi extension relate patient addresses to polygon, line, or point geometries. By transforming the place-based data into patient-level information, it is now ready to be inserted into the CDM extension table “exposure_occurrence”. The DDL and insert scripts for this table are also contained in the gaiaOhdsi extension. Once the data has been added to the CDM, it can be used to create cohort definitions, develop predictive models, and generally utilized by all relevant external OHDSI tooling.

Federated networks and research

Finally, Gaia enables standardized and reproducible workflows for federated data networks and studies. The process highlighted above to retrieve and harmonize geospatial datasets, perform spatiotemporal joins to transform place-based data to person-level information, and insert person-level information into an OMOP CDM and define cohorts, is fully reproducible. Each step of the process contains detailed, structured metadata focused on provenance of source data and rationale for transformation methods. By scripting and containerizing an entire Gaia workflow, the process of pairing place-based data, often handled using undocumented ad-hoc methods unique to single sites, can be packaged and shipped across an entire network with minimal effort.

The simplest case for Gaia is a single researcher looking to leverage place-based data. After standing up a local or cloud instance of gaiaCore, any researcher now has access to a wealth of curated sources of geospatial data ranging from environmental toxin data to one of many Social Determinants of Health Indexes derived from the US Census data. Instead of the countless hours of work typical to munging multiple disparate geospatial datasets, the researcher can simply use the functions from the gaiaR package to load datasets into their Postgis database all in a harmonized geospatial data format. They’ve now quickly enabled datasets across many domains, years, and regions in a single Postgis database to which they connect using the software of their choice and begin performing ad hoc exploratory data analyses, creating visualizations, or even powering their own geospatial applications.

Taking this scenario a step further, a researcher with an established OMOP CDM database may wish to incorporate a subset of geospatial variables into their CDM database alongside their patient health data. The steps necessary to perform this ingestion, which requires geocoding of patient address and a spatiotemporal join, are all handled by gaiaCore and the gaiaOhdsi extension. Thehe DeGauss geocoder, a lightweight geocoder that operates fully locally to ensure that patient information is not transmitted, is easily utilized through a gaiaR wrapper. standardized spatiotemporal joins from the gaiaOhdsi extension relate patient addresses to polygon, line, or point geometries. By transforming the place-based data into patient-level information, it is now ready to be inserted into the CDM extension table “exposure_occurrence”. The DDL and insert scripts for this table are also contained in the gaiaOhdsi extension. Once the data has been added to the CDM, it can be used to create cohort definitions, develop predictive models, and generally utilized by all relevant external OHDSI tooling.

Finally, Gaia enables standardized and reproducible workflows for federated data networks and studies. The process highlighted above to retrieve and harmonize geospatial datasets, perform spatiotemporal joins to transform place-based data to person-level information, and insert person-level information into an OMOP CDM and define cohorts, is fully reproducible. Each step of the process contains detailed, structured metadata focused on provenance of source data and rationale for transformation methods. By scripting and containerizing an entire Gaia workflow, the process of pairing place-based data, often handled using undocumented ad-hoc methods unique to single sites, can be packaged and shipped across an entire network with minimal effort.