Architecture & Data Flows

This page describes the technical architecture of the Gaia toolchain, including component interactions, data flows, and design principles.


System Architecture Overview

The Gaia toolchain consists of multiple interconnected components that work together to integrate geospatial data with the OMOP Common Data Model.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    External Data Sources                         │
│  (Census, EPA, Weather, Social Services, Geographic Boundaries)  │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                       gaiaCatalog                                │
│  • Data source discovery                                         │
│  • Metadata management                                           │
│  • Schema.org compliance                                         │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                        gaiaDb                                    │
│  • PostgreSQL/PostGIS staging database                           │
│  • Transformation recipes                                        │
│  • Spatial indexing                                              │
│  • Raw geospatial data storage                                   │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                       gaiaCore                                   │
│  • Multi-language connector framework                            │
│  • PostgREST API access                                          │
│  • Database connection orchestration                             │
│  • Language-specific client libraries                            │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                  OMOP CDM + GIS Extensions                       │
│  • Location_History table                                        │
│  • External_Exposure table                                       │
│  • Integrated with standard OMOP tables                          │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│                     HADES Analytics                              │
│  • Cohort definition                                             │
│  • Patient-level prediction                                      │
│  • Population-level estimation                                   │
│  • Evidence generation                                           │
└─────────────────────────────────────────────────────────────────┘


Component Details

1. gaiaCatalog

Purpose: Data source discovery and metadata management

Technology: Schema.org, JSON-LD

I/O: External data URLs/metadata → Searchable catalog with standardized metadata


2. gaiaDb

Purpose: Core data repository with all geospatial processing logic and OMOP integration

Technology: PostgreSQL 12+, PostGIS 3.0+, PostgREST, LinkML

Processing: SQL routines, PostGIS functions, exposure calculations, privacy-preserving aggregation, temporal alignment

I/O: Raw geospatial data + patient locations → OMOP External_Exposure and Location_History tables

Schema: backbone/ (core), working/ (OMOP tables), data_sources/, transformations/, functions/, api/


3. gaiaCore

Purpose: Multi-language connector framework (access layer only - NO processing logic)

Technology: PostgREST, language-specific connectors (R, Python)

I/O: API/database requests → Routed to gaiaDb → Results in client format

Key: gaiaCore orchestrates access to gaiaDb functions. All processing happens in gaiaDb.


Data Flow Details

End-to-End Data Flow

Phase 1: Data Discovery

External Source → gaiaCatalog
    Input: Dataset URL, documentation
    Process: Metadata extraction, cataloging
    Output: Cataloged data source with metadata

Phase 2: Data Ingestion

Cataloged Source → gaiaDb
    Input: Raw geospatial files
    Process: Data loading, spatial indexing, transformation
    Output: Staged spatial tables in PostGIS

Phase 3: Exposure Calculation

Patient Locations → gaiaDb (via gaiaCore)
    Input:
        - Patient geocoded addresses
        - Residence time periods
        - Variable specifications
        - Location_History records
    Process (in gaiaDb):
        - Spatial joins (PostGIS)
        - Temporal alignment (SQL)
        - Aggregation (e.g., average exposure during residence)
        - Privacy preservation (no raw addresses exported)
    Output:
        - External_Exposure records

Phase 4: OMOP Integration

gaiaDb Output → OMOP CDM
    Input: Processed exposure data
    Process:
        - Map to OMOP vocabulary concepts
        - Link to Location table (privacy-preserved)
        - Link to Person via Location_History
        - Populate External_Exposure table
    Output: OMOP CDM with integrated geospatial exposures

Phase 5: Analytics

OMOP CDM + GIS Extensions → HADES
    Input: Integrated clinical + geospatial data
    Process: Standard OHDSI analytics
    Output: Evidence generation


Privacy Architecture

Gaia enables geospatial analysis without sharing sensitive patient addresses:

  1. Geocoding at site - Patient addresses geocoded locally, only geographic IDs (e.g., Census block) leave site
  2. Aggregated units - Link to census blocks/ZIP codes, sufficient for exposure calculation
  3. Exposure values only - Numeric values (e.g., PM2.5), no reverse geocoding possible
  4. OMOP integration - Location table contains aggregated geography only

Workflow: Raw Address (protected) → Geographic ID (can leave site) → Exposure Values → OMOP External_Exposure


Design Principles

  1. Separation of Concerns: gaiaCatalog (discovery), gaiaDb (processing), gaiaCore (access), gaiaDocker (deployment)
  2. Standards: OMOP CDM, Schema.org, OGC geospatial, OHDSI vocabulary
  3. Privacy: No PHI sharing, aggregated geographies, local geocoding, exposure values only
  4. Modularity: Independent components, clear interfaces, pluggable sources
  5. Reproducibility: Version-controlled recipes, documented provenance, open source


Inputs and Outputs by Component

gaiaCatalog

Input Format Output Format
Dataset URL String Catalog entry JSON-LD
Metadata Schema.org Searchable index Database
Variables CSV/JSON Variable catalog JSON

gaiaDb

Input Format Output Format
Shapefiles .shp PostGIS tables SQL
GeoJSON .geojson Indexed geometries PostGIS
Raster .tif, .nc Raster tables PostGIS
CSV with coords .csv Point geometries PostGIS
Transformation recipe SQL Transformed data PostGIS
Patient locations and residence periods Geocoded coords External_Exposure OMOP table
Variable IDs Integer Exposure values Numeric
SQL function calls SQL/PostgREST Query results JSON/Tabular

gaiaCore

Input Format Output Format
API requests HTTP/JSON Exposure data JSON
Database queries SQL Query results Tabular
Function calls REST/SQL Processed data Client format
Connection params Config Database access Connection

gaiaDocker

Input Format Output Format
docker-compose.yaml YAML Running containers Docker services
Profile selection CLI flag Stack configuration Deployed services
Environment config .env Service parameters Runtime config
Component images Docker Orchestrated stack Running Gaia toolchain


OMOP Integration

Extension Tables: Location_History (patient-location-time), External_Exposure (place-based exposures)

HADES: Standard packages work with extended CDM, cohort definitions can include exposure criteria

See Schema Extensions for complete DDL.


Performance & Scalability

  • Indexing: GIST spatial indexes, materialized views
  • Caching: Datasets, transformations, API responses
  • Scaling: Horizontal gaiaCore instances, connection pooling, batch processing

See Deployment Strategies for deployment options.


Future Enhancements

Planned: Real-time data streams, ML integration, multi-site federation, enhanced catalog, unified API gateway

Research: Differential privacy, federated learning, spatiotemporal modeling