Data sources in gaiaDB are stored in the gaiaDB.backbone.data_source table in a Postgres database. Each row in this table is a single “data source”, though the name may be a bit misleading. All “data sources” in gaiaDB are actually bits of descriptive, technical, and operational metadata that refer to an externally hosted dataset. For example, the gaiaDB data source “daily_aqi_by_county_2020” is actually just a single row of data that contains information on how to extract, transform, and load (ETL) the EPA dataset located on their data catalog. The gaiaDB data source includes all the information necessary to acquire and use this dataset (including the dataset URL, documentation URL, and R code to transform it to gaia’s own data standard), without actually storing the dataset itself. Not only does this keep gaiaDB very lightweight (a data source takes only about 1 KB of storage), it also allows gaiaDB to handle a boundless amount of variability in source datasets.
The trade-off to this design is that each data source requires a certain degree of custom tailoring. Fortunately, many well-curated sources of geospatial data publish all of their own datasets with the same schema. So, while it may take a bit of effort to create the gaiaDB data source for “daily_aqi_by_county_2020”, the gaiaDB data source for “daily_aqi_by_county_2019”, and any other years, can be created in a matter of minutes by simply redirecting the data source to other years’ data.
As of now, there are two main ways that gaiaDB data sources can be created:
INSERT INTO
SQL statements.
Usually this is most easily done with a combination of spreadsheet
software like MS Excel (for writing the original record), an R
environment for creating the geom_spec (detailed below), and some sort
of automation tool for recycling the original data source across
multiple datasets.TODO