Adding a Data Source

Background

Data sources in gaiaDB are stored in the gaiaDB.backbone.data_source table in a Postgres database. Each row in this table is a single “data source”, though the name may be a bit misleading. All “data sources” in gaiaDB are actually bits of descriptive, technical, and operational metadata that refer to an externally hosted dataset. For example, the gaiaDB data source “daily_aqi_by_county_2020” is actually just a single row of data that contains information on how to extract, transform, and load (ETL) the EPA dataset located on their data catalog. The gaiaDB data source includes all the information necessary to acquire and use this dataset (including the dataset URL, documentation URL, and R code to transform it to gaia’s own data standard), without actually storing the dataset itself. Not only does this keep gaiaDB very lightweight (a data source takes only about 1 KB of storage), it also allows gaiaDB to handle a boundless amount of variability in source datasets.

The trade-off to this design is that each data source requires a certain degree of custom tailoring. Fortunately, many well-curated sources of geospatial data publish all of their own datasets with the same schema. So, while it may take a bit of effort to create the gaiaDB data source for “daily_aqi_by_county_2020”, the gaiaDB data source for “daily_aqi_by_county_2019”, and any other years, can be created in a matter of minutes by simply redirecting the data source to other years’ data.

Creating a Data Source

As of now, there are two main ways that gaiaDB data sources can be created:

  1. Manually, by writing INSERT INTO SQL statements. Usually this is most easily done with a combination of spreadsheet software like MS Excel (for writing the original record), an R environment for creating the geom_spec (detailed below), and some sort of automation tool for recycling the original data source across multiple datasets.
  2. Using the gaiaSourceCreator RShiny form, which can help guide the creation of data sources.

Creating the geom_spec

TODO

Sharing a Data Source

Motivation

If you have taken the time and effort to add a data source to your own local instance of gaiaDB, consider sharing it back to the rest of OHDSI GIS community. By sharing your data source you effectively reduce the duplication of effort, help to canonize a single “gaia version” of a web-hosted resource and increase the size of the public gaiaDB repository. All of this helps to nourish the OHDSI GIS system of tools and data sources.

GitHub Pull Request

At this time, the most straightforward way to share your local data source is via a GitHub Pull Request:

  1. The source file for the gaiaDB.backbone.data_source table is on the OHDSI Github GIS/source. Make sure that your local gaiaDB instance reflects this source when you add data sources locally.
  2. After you have added one or more data sources to your local gaiaDB instance, export the gaiaDB.backbone.data_source table as a CSV file.
  3. Create a pull request using this template, attaching the CSV file that should replace the current source csv.
  4. The gaiaDB software maintainers will accept and merge your contributions or reply to your issue with feedback.

Geocoding

TODO

Using a Local Dataset

TODO