1. OMOP GIS Vocabulary Package

The OMOP GIS (Geographic Information System) Vocabulary Package is designed to elevate data-driven healthcare research by enabling the integration of spatial, environmental, behavioral, socioeconomic, phenotypic, and toxin-related determinants of health into standardized data structures. This comprehensive framework facilitates a multi-dimensional understanding of health outcomes, accounting for both external environmental exposures and intrinsic patient characteristics.

This package is a vital extension of the OMOP CDM, addressing the growing need to contextualize healthcare data with external environmental and societal factors. Developed and maintained by the GIS Working Group, this package provides vocabularies, scripts, and documentation for terminology integration into existing OHDSI vocabularies.

  • Objective: To provide a comprehensive, standardized framework that enables the incorporation of geographic, toxicological, healthcare, behavioral, and socioeconomic terminology into the OMOP Common Data Model (CDM).
  • Application: Ideal for terminologists, ontologists, researchers, epidemiologists, and data analysts.
  • Integration:
    • The package is compatible with existing OMOP CDM.
    • All vocabularies, scripts, and mappings conform to OMOP standards.
    • Designed to scale with evolving data needs and expanding global health contexts.

Find the Delta Vocabulary files for the Vocabulary Package here


1.1. Vocabularies

  • OMOP GIS Vocabulary Standardizes geographical terminologies and spatial data, supporting geospatial epidemiology, healthcare accessibility studies, and population health research. The OMOP GIS vocabulary is a compilation of terminologies related to geography, boundaries, and spatial elements. The vocabulary encompasses 159 concepts of the Observation OMOP domain.
  • OMOP Exposome Vocabulary Integrates environmental and toxicological factors (exposomes) into the OMOP CDM, providing a taxonomy of environmental pollutants, toxins, and chemical agents. Offers a comprehensive taxonomy and classification system centered on environmental substances (exposomes). Designed to facilitate structured data capture, analysis, and interpretation, this vocabulary forms the foundation for toxicological studies within the observational health data paradigm. The concepts within the OMOP Exposome Vocabulary belong to the ‘Observation’ domain and the ‘Substance’ concept class in OMOP.
  • OMOP SDOH (Social Determinants of Health) Vocabulary Encapsulates a refined set of terminologies delineating the multifaceted environmental and societal factors that significantly influence individual and community health outcomes. The vocabulary boasts a comprehensive structure, organized hierarchically to facilitate precise categorization and effective data navigation. Within this structure, the SDOH vocabulary seamlessly integrates key components from recognized standards such as the Social Vulnerability Index (SVI), the Agency for Healthcare Research and Quality (AHRQ) frameworks, and the Social Determinants of Health Ontology (SDOHO) nodes. This integration ensures a rich, multi-dimensional perspective, capturing a wide spectrum of determinants from socioeconomic status to healthcare access, and from educational opportunities to neighborhood and built environment.

Examples of Main Nodes of the SDOH Hierarchy

* Element Relevant To Demographics
* Element Relevant To Education
* Element Relevant To Geographic Location
* Element Relevant To Health
* Element Relevant To Physical Environment
* Element Relevant To Population
* Element Relevant To Social And Community Context

1.2. Sources

The following data sources have been processed to enrich and integrate spatial, environmental, behavioral, socioeconomic, phenotypic, and toxin-related entities into the OMOP GIS Vocabulary Package.

  1. Area Deprivation Index (ADI)
  • Link: ADI Story Map
  • Use Case: Issue #290
  • Description: The ADI measures socioeconomic deprivation at the neighborhood level, which is a critical factor in understanding disparities in health outcomes. The dataset helps identify areas with high social and economic challenges that impact population health.
  1. AHRQ Social Determinants of Health (SDOH) Database
  • Link: AHRQ SDOH Data Sources Documentation
  • Use Case: Issue #175
  • Description: This database, provided by the Agency for Healthcare Research and Quality (AHRQ), offers a collection of data sources on social determinants of health (SDOH). It includes key indicators such as income, education, employment, and healthcare access, which influence health outcomes across populations.
  1. Child Opportunity Index (COI)
  • Link: COI Database
  • Use Case: Issue #288
  • Description: The COI provides insights into the quality of resources and opportunities available to children across different neighborhoods. The index covers multiple dimensions, including education, health, and social environment, critical for analyzing the impact of socioeconomic factors on child development.
  1. Environmental Justice Index (EJI)
  • Link: EJI Data Dictionary
  • Use Case: Issue #307
  • Description: The EJI measures the cumulative impacts of environmental injustice on health, particularly in vulnerable communities. It integrates environmental, social, and health data to highlight areas where health outcomes are disproportionately affected by environmental hazards.
  1. Sustainable Development Goals (SDG)
  • Link: SDG Overview
  • Use Case: Issue #288
  • Description: The United Nations’ 17 Sustainable Development Goals (SDGs) and their indicators aim to address global challenges, including poverty, inequality, climate change, environmental degradation, and health.
  1. Social Determinants of Health Ontology (SDOHO)
  • Link 1: SDOHO PubMed Article
  • Link 2: SDOHO OWL
  • Use Case: Issue #198
  • Description: SDOHO is an ontology that models the social determinants of health, providing a structured framework to represent the complex relationships between social, economic, and environmental factors that influence individual and population health outcomes.
  1. Social and Environmental Determinants of Health (SEDH)
  • Link: SEDH Report
  • Use Case: Issue #307
  • Description: This dataset focuses on the social and environmental determinants of health, providing a rich source of data on the external factors influencing public health, including physical environment, socioeconomic status, and behavioral factors.
  1. Social Vulnerability Index (SVI)
  • Link: SVI Documentation
  • Use Case: Issue #9
  • Description: The Social Vulnerability Index (SVI) identifies communities that may need support before, during, or after disasters. It uses census data to evaluate the vulnerability of different areas based on factors like poverty, housing, and access to transportation.
  1. Toxin and Toxin Target Database (T3DB)
  • Link: T3DB Downloads
  • Use Case: Issue #194
  • Description: The Toxin and Toxin Target Database (T3DB) is a comprehensive resource that catalogs toxins and their biological targets.

1.3. Concept Names

Concept names adhere to source data specifications or are guided by relevant literature.

1.4. Concept Codes

Concept codes were either adopted directly from source data or autogenerated for newly developed terms (start with ‘GIS’ prefix).

1.5. Domains

domain_id Definition Example Exists in OMOP
Behavioral Feature Refers to actions or behaviors by individuals that influence health outcomes, often related to lifestyle choices. Element Relevant To Physical Activity NO
Demographic Feature Describes the characteristics of a population, such as age, gender, race, or ethnicity. Race In Population NO
Environmental Feature Involves natural or man-made environmental factors that affect health and well-being. Air Quality Index (AQI) NO
Geographic Feature Describes physical locations, spatial relationships, and geographical characteristics relevant to health studies. State-County FIPS Code (5-Digit) NO
Healthcare Feature Involves elements directly related to healthcare services, access, and utilization. Element Relevant To Health Care NO
Observation Captures clinical or non-clinical observations relevant to the contextual data. Total Number Of Households YES
Phenotypic Feature Refers to observable traits or characteristics of an individual, influenced by genetic and environmental factors. Element Relevant To Depression NO
Socioeconomic Feature Involves social and economic factors that impact health, such as income, education, or employment status. Employment In Population NO
Type Concept A categorization used to define where the record comes from. Air Quality Database YES

1.6. Concept Classes

concept_class_id Definition Example
ADI Construct Represents a high-level conceptual framework within the Area Deprivation Index (ADI) for analyzing socioeconomic factors. Area Deprivation Index (ADI)
ADI Item Refers to specific elements or data points that make up the ADI framework. % Families Below Federal Poverty Level
AHRQ Construct A conceptual framework from the Agency for Healthcare Research and Quality (AHRQ) related to social determinants of health. Food Access
AHRQ Determinant A measurable factor derived from AHRQ’s data, influencing health outcomes. Crime And Violence
AHRQ Item Specific data points or elements within the AHRQ framework. Total Number Of Households
COI Construct A framework within the Child Opportunity Index (COI) for assessing child well-being and opportunity. Economic Resource Index
COI Determinant A measurable factor in the COI influencing child development and health. Access To Green Spaces
COI Item Specific data points or factors that make up the COI. Mean Estimated 8-Hour Average Ozone Concentration
EJI EBM Item Refers to an Environmental Burden Measure (EBM) within the Environmental Justice Index (EJI). Ambient Concentrations Of Diesel PM/M3
EJI HVM Item Refers to a Health Vulnerability Measure (HVM) within the EJI. Percentage Of Individuals With Cancer
EJI Item A general item from the EJI, integrating environmental and health vulnerability data. Census Tract Code
Exposome Target Represents specific biological targets within the exposome (a measure of all environmental exposures across a lifetime). Tissue-type plasminogen activator
Exposome Transporter Refers to biological transporters related to the exposome, responsible for moving substances within an organism. SLCO2B1 (OATP2B1, OATP-B)
Exposure Type Concept A category defining types of exposures relevant to toxicology and environmental health studies. Census Data
Geometry Relationship Refers to spatial relationships within geographic data, such as spatial proximity or overlap. Near/Proximity to
Geometry Type Defines the type of geometry used in spatial data. Polygon
GIS Measure A specific metric or quantitative value derived from GIS data. Estimate
Location Refers to the specific geographic location or spatial point in GIS data. Administrative Boundary
SDG Goal Represents one of the United Nations’ Sustainable Development Goals (SDGs) related to health, environment, and equity. Significantly reduce all forms of violence and related death rates everywhere
SDG Indicator A measurable indicator for tracking progress toward SDG goals. Proportion of bodies of water with good ambient water quality
SDOH Construct A high-level framework for understanding social determinants of health (SDOH) and their impact on population health. Neighborhood Quality
SDOH Determinant A specific social or economic factor that directly influences health outcomes. Air Quality Index (AQI)
SDOH Item A specific data element within the SDOH framework. The Air Quality Index For The Day For PM2.5
SDOHO Construct A conceptual framework from the Social Determinants of Health Ontology (SDOHO) focused on categorizing health determinants. Smoking
SDOHO Determinant A measurable factor in the SDOHO framework affecting health. Alcohol Use
SDOHO Item A specific measurable element in the SDOHO framework. Occupational Prestige Score
SDOHO Value A specific value or outcome within the SDOHO framework that reflects health disparities or social conditions. Intersex
SEDH Construct A conceptual framework for Social and Environmental Determinants of Health (SEDH). Social Capital Index
SEDH Item A data element within the SEDH framework. Veteran Segments By Census Block Group
Substance A chemical or biological substance relevant to environmental exposure or toxicology. Nicotine
SVI Construct A conceptual framework for the Social Vulnerability Index (SVI), representing factors that make communities vulnerable. Household Characteristics
SVI Determinant A specific factor in the SVI that directly impacts a community’s resilience to health risks or environmental hazards. Housing Type & Transportation
SVI Item A specific data point within the SVI framework. Persons Below 150% Poverty Estimate MOE

1.6.1 Additional Glossary for Concept Classes Understanding:

  • Construct: Represents conceptual or behavioral elements that are often measured through subjective or indirect means. Constructs are used to characterize complex or abstract social, psychological, or environmental phenomena that contribute to understanding health outcomes but are not necessarily directly measurable or causal by themselves.

    Examples include:

    • Social Norms And Attitude (SDOHO Construct)
    • Sexual Orientation (SDOHO Construct)
    • Occupational Hazards (SDOH Construct)
    • Neighborhood Safety (SDOH Construct)

    These examples reflect behaviors, relationships, and environmental factors that influence health outcomes, but they are abstract and typically involve interpretation, surveys, or proxies for measurement.

  • Determinant: A specific, measurable factor that has a more direct influence on health outcomes. Determinants are often quantifiable and can be linked more concretely to causes or risk factors affecting a person’s health, such as economic status, education, or access to healthcare.

    Examples include:

    • Veteran Status (SDOHO Determinant)
    • Healthcare Provider Availability (SDOH Determinant)
    • Income (AHRQ Determinant)
    • Tobacco Use (SDOHO Determinant)
    • Vaccination Status (SDOH Determinant)
  • Item: A specific, measurable data point or element that is used to evaluate a larger construct or determinant.

    Examples include:

    • % Households Without A Motor Vehicle (ADI Item)
    • All Cause Readmissions Per 100 Male Admissions (AHRQ Item)
    • Ambient Concentrations Of Diesel PM/M3 (EJI EBM Item)
    • Civilian (Age 16+) Unemployed Estimate (SVI Item)
  • Item Value: A specific measurable value or state that the item can take.
    The item values might include “employed,” “unemployed,” “self-employed,” “retired”, etc.

1.7. Concept Status (Standardness)

If a full semantic match is identified in OMOP, GIS codes are mapped to the corresponding standard concepts and reclassified as non-standard. If no match is found, GIS codes are retained as standard concepts.

1.8. Valid Start Date

vocabulary_id valid_start_date
OMOP Exposome source field updated_at in ‘MM-DD-YYYY’ format
OMOP Exposome (cas is null) 09-14-2024
OMOP SDOH (concept_class_id_1 ~ ‘SDOHO’) 01-01-2022
OMOP SDOH (concept_class_id_1 ~ ‘ADI’) 01-01-2018
OMOP SDOH (concept_class_id_1 ~ ‘AHRQ’) 01-01-2022
OMOP SDOH (concept_class_id_1 ~ ‘COI’) 01-01-2020
OMOP SDOH (concept_class_id_1 ~ ‘EJI’) 01-01-2022
OMOP SDOH (concept_class_id_1 ~ ‘SEDH’) 01-01-2021
OMOP SDOH (concept_class_id_1 ~ ‘SVI’) 01-01-2018
OMOP SDOH (concept_class_id_1 ~ ‘SDG’) 03-01-2017
All other cases 09-14-2024

1.9. Relationships

relationship_id reverse_relationship_id Meaning
Locates in cell Cell contains Indicates that a certain agent or substance is found within or targets a cellular entity.
Locates in tissue Tissue contains Suggests that a certain agent or substance is present within or targets a specific tissue type.
Impacts on process Impacted by Signifies that an agent or substance exerts an influence on a specific process.
Affects biostructure Affected by Suggests that an agent or substance has an impact on a certain biological structure.
Maps to Mapped from Indicates a relationship where a concept is equated to or represented as a standard OMOP concept.
Is a Subsumes Hierarchical relationship where a concept is a subset or instance of a more general concept.
Has associated finding Asso finding of Indicates a relationship between a concept and an associated finding related to it.
Has relat context Relat context of Describes the contextual relationship between two related concepts.
Has geometry Is geometry of Represents the spatial or geometric relationship between an entity and its geographic or spatial structure.

Examples: * Hierarchical relationships: ‘Is a’ - ‘Subsumes’: ‘Polygon’ - ‘Is a’ - ‘2D (Two-Dimensional) Geometry’ / ‘2D (Two-Dimensional) Geometry’ - ‘Subsumes’ - ‘Polygon’ * Supplemental GIS-specific relationships: e.g. ‘Is geometry of’ - ‘Has geometry’: ‘LineString’ - ‘Is geometry of’ - ‘International Border’ / ‘International Border’ - ‘Has geometry’ - ‘LineString’

1.10. Mapping and Hierarchy

target_vocabulary_id number of associations
OMOP Exposome 82,150
OMOP SDOH 6,738
RxNorm 4,418
RxNorm Extension 2,221
SNOMED 1,769
LOINC 776
OMOP GIS 423
ICD10CM 122
PPI 56
OMOP Genomic 49
OMOP Extension 32
OSM 25
UK Biobank 24
Type Concept 24
CPT4 10
HCPCS 9
Nebraska Lexicon 3
Race 2
ATC 2

1.11. Future work

  • Test the vocabulary with more use cases.
  • Fix hidden errors.
  • Build additional hierarchical relationships.
  • Enrich and refine the vocabulary.

2. Vocabulary and Mapping Source

The OMOP GIS Vocabulary Package is built and maintained through a structured Google Spreadsheet that supports collaborative editing, centralized curation, and version control. This spreadsheet functions as the backbone of the vocabulary development process, enabling distributed subject-matter experts, curators, and developers to participate in real-time. It is composed of multiple interrelated tabs that each fulfill a specialized role in the construction of a standardized, computable terminology layer.

2.1. Overview of Source Spreadsheet Components

The structure adheres to the principles of transparency, auditability, and semantic alignment with the OMOP CDM. The spreadsheet is logically organized into several functional layers:

2.1.1. Source Terms Layer

Captures raw terminology originating from environmental, geographic, exposomic, or socio-behavioral data sources. Each record includes a unique source code, human-readable description, vocabulary ID, domain assignment, concept class identifier, and provenance information such as date of review, expert attribution, ORCID ID, and review status.

2.1.2. Mapping Layer

Establishes the semantic correspondences between the collected source terms and OMOP standard concepts. Each mapping contains:

  • A relationship type (e.g., Maps to, Is a)
  • A predicate aligned with SSSOM (e.g., skos:exactMatch, skos:narrowMatch, skos:broadMatch, skos:relatedMatch)
  • An author-assigned confidence score (0.0–1.0)
  • Metadata for mapping validation, status, and reviewer feedback (if exists)

2.1.3. Hierarchy Layer

Supports parent-child relationships among concepts and extends the OMOP CDM’s ontology-like capabilities. This is particularly important for representing aggregate social constructs (e.g., Area Deprivation Index) and nested features.

2.1.4. Semantic Extensions Layer

Defines custom Domains, Concept Classes, Vocabularies, and Relationships that expand the CDM’s expressivity in the context of real-world data. These extensions are consistently registered and versioned (e.g., OMOP GIS || 20250424).

Each term progresses through a structured lifecycle: initial entry, expert validation, decision logging, and integration readiness. Fields such as change_required, author_comment, and status support prioritization and triage workflows.

2.2. Alignment with SSSOM

The mapping layer leverages SSSOM-style predicates, enabling:

  • Semantic Precision: Mappings clearly define relationship types for improved analytic reliability
  • Bidirectional Navigation: Reverse mappings (e.g., Is geometry of ↔︎ Has geometry) support symmetric reasoning
  • Crosswalk Potential: Mappings bridge environmental, clinical, and social data sources for integrated analysis

2.3. Automation and Deployment

Google Apps Script automates the parsing, change detection, and transformation of the spreadsheet into vocabulary delta tables consumable by OMOP ETL workflows. This enables continuous deployment of vocabulary updates without manual intervention. Extensions are serialized into OMOP-compatible formats and managed according to OHDSI governance protocols.

2.4. Contribution Workflow

The spreadsheet functions as both a collaborative workspace and a vocabulary staging environment. Contributors may propose new terms or mappings by adding rows to designated tabs. Each entry is subject to transparent peer review, with review states tracked via controlled values. Reviewers are encouraged to document decisions with ORCID and institutional affiliation.

For contributions, questions, or access requests, please contact the GIS Vocabulary Coordination Team:

3. Vocabulary Implementation

This section outlines the implementation framework for the OMOP GIS Vocabulary Package, detailing both the underlying ontology architecture and the practical processes for vocabulary ingestion and deployment. By combining semantic formalism with operational scalability, the implementation ensures that spatial and contextual vocabularies are conceptually aligned with the OMOP CDM and readily usable in real-world analytics environments.

3.1 OMOP GIS Ontology Design

The OMOP GIS Ontology utilizes the GIS Vocabulary Package as its foundational layer, which is collaboratively maintained through a Google Spreadsheet system integrated with Google Apps Scripts and GitHub-based automation pipelines. This ontology serves as a semantic scaffold for spatial and contextual reasoning in health data, enabling structured representation and analysis of geographically-linked, environmental, social, and behavioral determinants of health. It also functions as a machine-interpretable layer that supports standardized analytics, ontology-informed feature generation, and federated ETL workflows across distributed data networks.

3.1.1. Ontology Definition in Context

In general, an ontology is a structured framework for representing knowledge as a set of concepts within a domain and the relationships between those concepts. Ontologies enable formal semantics, reasoning, and integration across diverse datasets by providing consistent definitions and hierarchical structure.

Within the OMOP GIS framework, the ontology performs a similar function - defining and categorizing geographic features, environmental exposures, social determinants of health, and their interrelationships - using the language and constraints of the OMOP CDM. It extends the existing OMOP vocabulary model to support location-aware analyses and geospatial semantics without breaking conformance with OMOP’s relational architecture.

3.1.2. Architecture and Workflow

The GIS Ontology is constructed through the following components and processes:

  • Source Definition Layer: Concepts and relationships are entered and curated in a structured Google Spreadsheet format. This spreadsheet includes validated fields for source code, concept class, domain, mappings, predicates, and metadata. Collaborative access and semantic protections are enforced using Google Apps Scripts, which regulate row-level editability.

  • Version-Controlled Vocabulary Pipeline: Approved concepts and mappings are automatically synchronized from Google Sheets to GitHub using scheduled Apps Script tasks. This process creates a persistent and auditable version history while simultaneously preparing mapping data for downstream transformation.

  • Ontology Transformation Pipeline: A GitHub Action orchestrates a multi-step workflow that converts the spreadsheet-based mappings into relational OMOP-compatible vocabulary tables. This includes:

    1. Ingestion of mapping rows into a PostgreSQL instance hosted in Azure
    2. Syntactic validation of concept fields, metadata, and predicate consistency
    3. Differencing logic to detect novel or updated concepts relative to the baseline OMOP vocabularies
    4. Construction of staging tables, including new concept, concept_relationship, and concept_synonym records with assigned concept_ids in the reserved space (>2,000,000,000)
    5. Insertion of validated concepts into constrained tables in a controlled schema
    6. Export of delta tables (vocabulary overlays) to GitHub for use in external ETL workflows

Two Azure-hosted components support this automation: a Container App acting as a virtual GitHub runner and a Flexible Postgres Server that stores the ontology’s relational tables. These components ensure that updates can be executed securely and reproducibly.

3.1.3. Ontological Table Structure

The ontology is materialized through a suite of relational “delta” tables. Each table mirrors a specific component of the OMOP vocabulary schema, while systematically extending it to accommodate geospatial logic, including location-referenced features, environmental indices, and spatially-resolved determinants. For example:

  • concept_delta.csv: Defines both standard and non-standard GIS concepts, including representatives of new domains like Geographic Feature, Environmental Feature, and Socioeconomic Feature.
  • concept_relationship_delta.csv: Encodes semantic links using relationships such as Has geometry, Affects biostructure, and Locates in cell, facilitating ontology-driven inferences.
  • concept_ancestor_delta.csv: Reconstructs hierarchical ancestry for reasoning across spatial or categorical groupings.
  • concept_synonym_delta.csv: Includes synonyms to support flexible querying across GIS, public health, and environmental terminology variants.

This table set collectively reproduces an ontological graph within a relational schema, enabling semantic linkage between OMOP-standard concepts and domain-specific enhancements required for contextualized health research.

3.1.4. Summary

The OMOP GIS Ontology integrates community-based term curation, semantic standardization via SSSOM predicates, and automated deployment pipelines to construct a modular, versioned vocabulary system. This infrastructure supports not only geospatial analysis but also cross-domain reasoning on determinants of health, exposures, and environment. It positions the OMOP CDM for expanded utility in real-world evidence generation that incorporates place, population context, and environmental burden.

3.2 OMOP GIS Ontology Download and Installation Instructions

The OMOP GIS Ontology can be integrated into a local OMOP CDM instance by combining standard vocabulary files obtained from Athena OHDSI with curated delta tables provided via GitHub. This integration enables structured support for spatial, environmental, and contextual reasoning through GIS-aligned concepts and relationships, while preserving OMOP CDM conformance. The process leverages relational structures familiar to OMOP implementers and is compatible with federated ETL workflows and AI-driven pipelines.

3.2.1. Required Tools and Access

To begin the installation, ensure you have:

  • An Athena OHDSI account to download standard vocabularies.
  • SQL client (e.g., DBeaver, pgAdmin) with write access to the OMOP CDM vocabulary schema.
  • Access to a PostgreSQL-compatible OMOP instance.
  • GIS delta files from Tufts CTSI GitHub.
  • Basic familiarity with OMOP CDM vocabulary architecture.
  • A dedicated development schema (e.g., dev_gis) separate from your production environment, to safely test integration and validate results before promotion.

3.2.2. Workspace Preparation

Your OMOP schema must contain all required core tables:

  • concept, concept_ancestor, concept_class, concept_relationship, concept_synonym
  • domain, relationship, vocabulary, drug_strength

If missing, create them using the OMOP CDM DDL.

To prepare for GIS enrichment, create delta tables via:


3.2.3. Download Standard OMOP Vocabularies (Athena)

All vocabularies listed below are mandatory. Do not skip any.
These vocabularies are referenced in the delta tables and are essential for resolving mappings and relationships. Partial ingestion will result in structural or referential integrity errors.

Select the following vocabularies from Athena OHDSI Download section, ensuring any license-restricted vocabularies (e.g., CPT4) are only selected if your organization holds a valid license:

Required Vocabulary
ATC
CPT4*
HCPCS
ICD10CM
LOINC
Nebraska Lexicon
OMOP Extension
OSM
PPI
RxNorm
RxNorm Extension
SNOMED
UK Biobank

After selecting the vocabularies, click Download Vocabularies, name the bundle, and download the resulting ZIP file directly from the Athena website once it is ready. Unzip the archive and confirm that the following files are present:

Expected File
CONCEPT.csv
CONCEPT_ANCESTOR.csv
CONCEPT_CLASS.csv
CONCEPT_RELATIONSHIP.csv
CONCEPT_SYNONYM.csv
DOMAIN.csv
DRUG_STRENGTH.csv
RELATIONSHIP.csv
VOCABULARY.csv

3.2.4. Download GIS Delta Tables

Download the delta tables from the GIS Vocabulary GitHub repository. These include:

Delta Table
CONCEPT_DELTA.CSV
CONCEPT_ANCESTOR_DELTA.CSV
CONCEPT_CLASS_DELTA.CSV
CONCEPT_RELATIONSHIP_DELTA.CSV
CONCEPT_SYNONYM_DELTA.CSV
DOMAIN_DELTA.CSV
RELATIONSHIP_DELTA.CSV
VOCABULARY_DELTA.CSV
MAPPING_METADATA.CSV
SOURCE_TO_CONCEPT_MAP.CSV

Note: Files such as restore.sql and update_log.csv are not required for ingestion.

3.2.5. Ingest Standard Vocabularies (Athena → OMOP)

Import all downloaded Athena .csv files into the corresponding OMOP vocabulary tables using your preferred SQL client.

Recommended tools: Use PostgreSQL COPY command via psql, or GUI tools such as DBeaver or pgAdmin for loading the files.

Important formatting requirements: - Files must use UTF-8 character encoding. - Comma should be used as the delimiter. - Text fields should be enclosed in double quotes.

Match the CSV files to OMOP tables as follows:

CSV File → OMOP Table
CONCEPT.csv → CONCEPT
CONCEPT_ANCESTOR.csv → CONCEPT_ANCESTOR
CONCEPT_CLASS.csv → CONCEPT_CLASS
CONCEPT_RELATIONSHIP.csv → CONCEPT_RELATIONSHIP
CONCEPT_SYNONYM.csv → CONCEPT_SYNONYM
DOMAIN.csv → DOMAIN
DRUG_STRENGTH.csv → DRUG_STRENGTH
RELATIONSHIP.csv → RELATIONSHIP
VOCABULARY.csv → VOCABULARY

After upload, run QA checks.

3.2.6. Ingest GIS Delta Content

Insert delta rows into the already existing tables using: - insert_delta_tables_into_omop.sql

3.2.7. Integrate GIS Delta Tables Into Basic OMOP Vocabulary Tables

This step inserts data from the GIS delta files into the corresponding OMOP vocabulary tables. The mapping between each delta file and its target table is shown below:

Delta File → Target Table
concept_delta.csv → CONCEPT
concept_ancestor_delta.csv → CONCEPT_ANCESTOR
concept_class_delta.csv → CONCEPT_CLASS
concept_relationship_delta.csv → CONCEPT_RELATIONSHIP
concept_synonym_delta.csv → CONCEPT_SYNONYM
domain_delta.csv → DOMAIN
relationship_delta.csv → RELATIONSHIP
vocabulary_delta.csv → VOCABULARY
mapping_metadata.csv → MAPPING_METADATA
source_to_concept_map.csv → SOURCE_TO_CONCEPT_MAP

Important: Always validate your integration in a development schema before applying changes to a production vocabulary schema. Ensure referential integrity and uniqueness constraints are preserved.

3.2.8. Validate Integration

Use check_delta_tables_inserts.sql to verify the successful application of the delta content. This includes validation of record counts, relationship integrity, and domain coverage.

3.2.9. Outcome

After completing this workflow, your OMOP CDM instance will:

  • Contain both standard and GIS-extended vocabularies.
  • Support population of EXTERNAL_EXPOSURE and SOURCE_TO_CONCEPT_MAP tables.
  • Enable structured representation of spatial and environmental health data.
  • Be interoperable with OHDSI tools, federated queries, and AI pipelines.

Use information in the Vocabulary QA section to confirm completeness and correctness of the loaded data.

For feedback or bug reports, please open an issue on GitHub.

4. Vocabulary QA

This checklist is designed to validate new GIS Vocabulary Package releases in OMOP CDM format.


4.1. General QA for Core Vocabulary Tables

4.1.1 Table Row Counts

Ensure the following delta tables are populated unless explicitly expected to be empty:

  • concept_delta
  • concept_relationship_delta
  • concept_synonym_delta (if synonyms are defined)
  • concept_ancestor_delta (if hierarchical terms are used)
  • source_to_concept_map (if supplemental mappings are included)

4.1.2 Vocabulary Metadata

  • Confirm vocabulary_delta, domain_delta, relationship_delta, and concept_class_delta only contain new or modified entries.
  • Verify required fields exist (e.g., vocabulary_id, vocabulary_name, vocabulary_reference, etc.).

4.2. Semantic Consistency Checks

4.2.1. Mapping Validity

  • Concepts with standard_concept IS NULL should have at least one outbound Maps to or Mapped from relationship.
  • Concepts with standard_concept = 'S' should not map to other standard concepts unless it’s a self-map.
  • Non-standard to non-standard mappings are permitted for auxiliary relationships (e.g., Has geometry, Locates in cell) but must be flagged for review.

4.2.2. Duplicate Mappings

  • No duplicate (concept_id_1, relationship_id, concept_id_2) combinations.
  • Multiple Maps to targets for the same source must be clinically or hierarchically justified.

4.2.3. Invalid Target Concepts

Ensure all target_concept_id values from the source mapping table:

  • Exist in the current concept or concept_delta table
  • Have invalid_reason IS NULL

4.2.4. Domain Compatibility

  • Detect domain inconsistencies (e.g., source Procedure → target Measurement) unless justified.
  • Validate that concepts in specific GIS domains consistently map to expected domains.

4.3. Syntactic Integrity

4.3.1. Concept Code Naming

  • Validate concept_code format. Ensure they match declared source_code values.

4.3.2. Concept Names

  • No duplicate concept_name entries unless codes differ.
  • Avoid placeholder or malformed names.

4.3.3. Field Completeness

Ensure these fields are always populated:

  • concept_id
  • concept_name
  • domain_id
  • vocabulary_id
  • concept_code
  • valid_start_date
  • valid_end_date

4.4. Source Mapping QA

4.4.1. Coverage

  • Every source_code in the source mapping table should exist in concept_delta.
  • Every source_description should match a concept_name.
  • Every source_description_synonym should match a concept_synonym_name.

4.4.2. Source-to-Concept Map QA

  • source_to_concept_map must contain complete mappings:
    • source_code
    • source_concept_id
    • target_concept_id

4.5. Concept Synonym Validation

  • Each concept_synonym_delta entry must:
    • Reference a valid concept_id from concept_delta
    • Include a non-empty, distinct concept_synonym_name
    • Not duplicate the main concept_name
  • Cross-check expected synonyms (from source mapping table) are present.

4.6. Concept Ancestor Integrity

  • Ensure all transitive or hierarchical relationships exist in concept_ancestor_delta.
  • No cycles or broken ancestry chains.
  • Confirm taxonomic hierarchies are correctly represented.
  • Only standard concept participate in the hierarchy

4.7. Naming & Formatting Edge Cases

  • Validate encoding of Unicode characters.
  • Ensure special characters (e.g., brackets, slashes) do not break parsing.
  • Detect changes in concept_name between source and derived vocabularies.

4.8. Summary and Reporting

  • Provide counts of:
    • New and modified rows per table
    • Failed validations (e.g., invalid mappings, duplicates)
    • Mapping coverage (e.g., % source codes mapped)
  • Generate a .csv or .md QA report to accompany the vocabulary release.

These checks complement automated scripts and should be validated by vocabulary experts and domain specialists, especially for novel environmental or spatial concepts.

5. Vocabulary Usability Validation

This section describes a structured approach for validating the usability of the GIS Vocabulary Package in real-world OMOP CDM integration scenarios. The process focuses on semantic coverage, geospatial linkage, and practical implementation using environmental exposure data such as EJI, EPA air quality, or other standardized GIS sources.


5.1. Step-by-Step Validation Workflow

5.1.1. Define Validation Objectives

  • What is being tested?
    • Does the GIS Vocabulary cover all environmental exposure concepts of interest?
    • Can it support integration of specific datasets (e.g., EJI, SVI)?
  • What decisions will the results support?
    • Use in OMOP CDM EXTERNAL_EXPOSURE analytics.
    • Extension to GIS-enabled observational studies.

5.1.2. Acquire and Preprocess GIS Data

  • Select an authoritative GIS dataset (e.g., EJI, EPA AirNow, WHO noise pollution index).
  • Ensure spatial identifiers are available:
    • GEOID (Census Tract), ZIP, or latitude/longitude.
  • Normalize location granularity:
    • Convert ZIP+4 to Census Tract.
    • Crosswalk coordinates to GEOID using reverse geocoding or lookup tables.
  • Assess temporal granularity (e.g., 3-year average, annual, monthly).

5.1.3. Link GIS Data to OMOP Locations

  • Extract location attributes from OMOP:
    • location_id, state, county, zip, lat, lon, location_source_value.
  • Match GIS-to-OMOP using hierarchical fallback:
    • GEOID → direct match on location_source_value.
    • (state, county, zip) → match against parsed GEOID.
    • lat/lon → reverse geocode to determine tract or area.
  • Validate spatial linkage:
    • Detect inconsistencies or missing mappings.
    • If location_history is implemented (CDM v6.0+), consider temporal changes.

5.1.4. Map GIS Variables to OMOP Concepts

  • For each environmental variable (e.g., PM2.5 days above threshold):
    • Map to exposure_concept_id in the GIS Vocabulary.
    • If no match exists, flag the concept for vocabulary extension.
  • Map measurement units (e.g., percent, μg/m³) to unit_concept_id via OHDSI Athena.
  • Build a mapping table:
    source_variableconcept_id, unit_concept_id, value_as_concept_id/value_as_number.

5.1.5. Populate the external_exposure Table

  • Ensure that each patient in OMOP has a geospatial identifier that can be linked to GIS datasets.
  • For each matched patient-location pair assign exposure_concept_id based on the mapped GIS variable.
  • Set exposure_start_date (reference date from the GIS dataset)
  • Populate value_as_number and unit_concept_id.
  • Populate other fields if applicable.
Field Name Description Data Example
exposure_occurrence_id Unique identifier for each exposure record 123456
location_id Foreign key linking to the location table, indicating where exposure occurred 789
person_id Foreign key linking to the person table, identifying the individual exposed 100234
cohort_definition_id (Optional) Links to a defined cohort in research studies 25
exposure_concept_id Standard OMOP concept_id representing the type of exposure 2052498173 (Percentile Rank Of Annual Mean Days Above PM2.5 Regulatory Standard - 3-Year Average)
exposure_start_date Date when the exposure event started 2024-01-15
exposure_end_date Date when the exposure event ended (NULL if ongoing exposure) NULL (ongoing)
exposure_type_concept_id Concept ID defining the origin of the exposure record 2052499258 (Government Data)
exposure_relationship_concept_id Concept ID describing how exposure relates to the person NULL
exposure_source_concept_id Source-specific concept ID before standardization to OMOP 90000001
exposure_source_value Raw exposure value from source data "EPL_PM"
exposure_relationship_source_value Raw value describing the exposure-person relationship NULL
dose_unit_source_value Source unit before standardization NULL
quantity Number of exposure occurrences (if applicable) 1
modifier_source_value (Optional) Modifier describing the exposure type or intensity NULL
operator_concept_id Concept ID defining operator logic (e.g., <, >, =) NULL
value_as_number Numerical value of the exposure (e.g., concentration level) 0.8503
value_as_concept_id Concept ID for categorical exposure values NULL
unit_concept_id Concept ID representing the measurement unit NULL

Ensure complete and consistent population of required fields. Non-null exposure values and units are critical for downstream analytics.


5.2. Proposed Usability Validation Criteria

5.2.1. Coverage

  • Do GIS concepts cover all variables in your dataset?
  • Are spatial scales represented (e.g., tract, zip, grid, regional)?
  • Are distinctions like satellite vs. ground sensor captured?
  • Are measurement units and result values standardized?

5.2.2. Interoperability

  • Can patient-level OMOP locations link unambiguously to GIS identifiers?
  • Does the external_exposure table schema support all relevant metadata?

5.2.3. Practical Usability

  • Are mapping guidelines and examples provided?
  • Are variable-to-concept lookup tables available or easy to construct?
  • Can you populate external_exposure using real-world datasets without data loss or transformation ambiguity?

5.3. Additional QA Recommendations

  • Check temporal coherence: Do exposure dates align with person observation periods?
  • Validate concept synonym coverage: Are source variable names discoverable via synonyms?
  • Spot check semantic overlap: Are concepts with similar meaning unnecessarily duplicated?
  • Ensure data pipeline readiness: Can the vocabulary support ETL workflows using PostgreSQL, Spark, or Python?

5.4. Suggested Output

  • Mapping dictionary (source_variableconcept_id)
  • Table population sample (external_exposure)
  • Coverage and failure summary (% mapped, unmapped terms)
  • Recommendations for vocabulary enhancement

This usability validation should be iteratively improved and coordinated across GIS WG stakeholders. Contributions welcome!