For background information on Gaia, see Design.
For details on the Gaia Data Model, see Data Model.
The quickest way to start using gaiaDB is by building or downloading the docker image and running a container. Alternatively, you can execute the init.sql script against your own PostgreSQL database.
You can build and run gaiaDB as a docker container with the following commands:
git clone https://github.com/OHDSI/gaiaDB.git
cd gaiaDB
docker build -t gaia-db .
docker run -itd --rm -e POSTGRES_PASSWORD=SuperSecret -e POSTGRES_USER=postgres --network gaia -p 5432:5432 --name gaia-db gaia-db
Once deployed and (automatically) initialized, the containerized
Postgres database includes: - GIS Catalog (backbone
schema)
- Constrained GIS vocabulary tables (vocabulary
schema) -
postgis tools (native to image, tiger
schema)
Once you have a postgres database with PostGIS installed, the helper
script initialize.sql
in the inst directory
will do the rest of the setup. This includes:
library(gaiaCore)
connectionDetails <- DatabaseConnector::createConnectionDetails(
dbms = "postgresql",
server = "localhost/gaiaDB",
port = 5432,
user="postgres",
password = "mysecretpassword")
Data sources in gaiaDB are stored in the gaiaDB.backbone.data_source table in a Postgres database. Each row in this table is a single “data source”, though the name may be a bit misleading. All “data sources” in gaiaDB are actually bits of descriptive, technical, and operational metadata that refer to an externally hosted dataset. For example, the gaiaDB data source “daily_aqi_by_county_2020” is actually just a single row of data that contains information on how to extract, transform, and load (ETL) the EPA dataset located on their data catalog. The gaiaDB data source includes all the information necessary to acquire and use this dataset (including the dataset URL, documentation URL, and R code to transform it to gaia’s own data standard), without actually storing the dataset itself. Not only does this keep gaiaDB very lightweight (a data source takes only about 1 KB of storage), it also allows gaiaDB to handle a boundless amount of variability in source datasets.
The trade-off to this design is that each data source requires a certain degree of custom tailoring. Fortunately, many well-curated sources of geospatial data publish all of their own datasets with the same schema. So, while it may take a bit of effort to create the gaiaDB data source for “daily_aqi_by_county_2020”, the gaiaDB data source for “daily_aqi_by_county_2019”, and any other years, can be created in a matter of minutes by simply redirecting the data source to other years’ data.
As of now, there are two main ways that gaiaDB data sources can be created:
INSERT INTO
SQL statements.
Usually this is most easily done with a combination of spreadsheet
software like MS Excel (for writing the original record), an R
environment for creating the geom_spec (detailed below), and some sort
of automation tool for recycling the original data source across
multiple datasets.TODO
TODO
TODO