OHDSI Databricks User Group Developer How-tos: Implementing OHDSI on Databricks Using Broadsea

Introduction

Getting Started

Install Ponos
Optional: Install Eunomia CDM
Shutdown PostgreSql
Start Docker
Clean Docker
Confirm Docker is Running

Clone Broadsea from Github

Update Broadsea

Start Broadsea

Confirm Broadsea

Connect to OHDSI

Install Ponos
Configure Parametes
Configure SSL and URL
Restart Docker and Launch Atlas
Configuration of Vocabularies

Introduction

This guide will walk through the creation of a new OHDSI stack using Broadsea and using Databricks for the CDM on Windows. This guide assumes that you have Docker installed and a CDM instance you can point to in Databricks.

Getting Started

Install Ponos

Ponos is a java application that can be used to automate certian tasks associated with setting up a new OHDSI instance in databricks including tasks such as creating an instance of the Eunomia CDM test data set in Databricks and connecting an existing instance of the CDM in Databricks to OHDSI tools.

Ponos can be downloaded directly from Github at https://github.com/NACHC-CAD/ponos. After downloading the zip file, unzip and update the ./auth/bs-databricks-public-demo.properties to use your specific configuration. More detailed instructions on downloading and installing Ponos are available on the Ponos Install page.

Optional: Install Eunomia CDM

If you do not have an existing OHDSI CDM in Databricks you would like to use, you can install the Enomia CDM to use as a test case.

To create a test instance of the CDM in Databricks, Download and Install Ponos and the run the following:

run-ponos.bat db-demo

The code that creates the demo_cdm in Databricks can be found in the fhir-to-omop BuildDemoCdmInDatabricks class. The demo_cdm from the Broadsea distribution is created in Databricks using the Ponos tool. Data are sourced from .csv files included in the Ponos project that were created as a extract from a PostgreSql instance of the demo_cdm. This install includes the following:

Upload of .csv files for the CDM to the Databricks FileStore
Creation of the CDM database in Databricks using the DDL files from the Common Data Model (CDM) (version 5.3 is used).
Population of the CDM (including vocabulary tables) from the uploaded .csv files

Shutdown PostgreSql

Before you get started, make sure you do not have a local instance of PostgreSql running as a service.

Start Docker

If you do not have Docker Desktop you can download and install it from https://www.docker.com/products/docker-desktop/. Start Docker Desktop from either the short cut if it was installed or from the Windows search menu.

Clean Docker

To get started, it’s not a bad idea to clear out your Docker instance. The two scripts below can be run from PowerShell. The first script will delete all volumes, containers, and images from your Docker instance. The second script will show you if there are any volumes, containers, or images remaining in your Docker instance.
clean-docker.sh
show-docker.bat

The clean script will launch a bash window and ask for comfimation that you really want to delete everything.

The show script can be used to confirm there’s nothing left on Docker.

Confirm Docker is Running

After Docker has been started and cleaned you should see something like what is shown below in Docker Desktop.

Clone Broadsea from Github

Clone Broadsea using:

git clone https://github.com/OHDSI/Broadsea

Update Broadsea

Next, copy the Spark JDBC Jar File and Update the docker-compose.yml file:
Paste a copy of the spark JDBC driver you are using into your Broadsea directory (the directory that has the docker-compose.yml file) Replace the existing docker-compose.yml file with this docker-compose.yml file. (I usually backup the original as shown below). This new docker-compose.yml file simply adds the following lines to the ohdsi-webapi-from-image section.

    volumes:
      - ./SparkJDBC42.jar:/var/lib/ohdsi/webapi/WEB-INF/lib/SparkJDBC42.jar

The Spark JDBC driver and new docker-compose.yml file should now be in the root directory of the Broadsea project as shown below.

Start Broadsea

Start Broadsea in the ususal way. Open a cmd prompt, navigate to the Broadsea directory and execute

docker compose pull && docker-compose --profile default up -d

Confirm Broadsea

When Broadsea is running you should see something like what is shown below in the Docker Desktop application.

Connect to OHDSI

The next step is to connect Your Databricks CDM to OHDSI. This can be done using Ponos bs-init.

Install Ponos

If you have not already, install and configure Ponos. Instructions for downloading and installing Ponos can be found on the Ponos Install page.

Configure Parametes

After downloading and installing Ponos, edit the file found as ./auth/bs-databricks-public-demo.properties to use your parameters.

Use the following cmd from the location where you installed Ponos to connect your Databricks CDM to OHDSI.

run-ponos.bat bs-init

The source for this process is in fhir-to-omop tool suite OhdsiEnableExistingBroadseaOnDatabricksCdm class.
This process will do the following:

Create the Achilles results database in Databricks
Create the Achilles tables in Databricks
Create the achilles_analysis table from the AchillesAnalysisDetails.csv file
Run Achilles to populate the Achilles results tables
Create the appropriate source and source_daimon records in the PostgreSql instance of webapi included with Broadsea (existing records for the key in the properties file will be overwritten).

Configure SSL and URL

The next step is to configure the JDBC URL (UseNativeQuery) and SSL for Databricks:

Most Databricks instances use SSL by default. A Databricks instance that uses SSL will have “ssl=1;” as a parameter in the JDBC URL. To enable a connection that uses SSL, follow the instructions in the Notes on the Databricks JDBC URL, SSL, and UseNativeQuery page.

It should also be noted that the parameter “UseNativeQuery=1;” needs to be added to the url that is inserted into the webapi.source table. If you are using the Ponos application to create your OHDSI on Databricks instance, the Ponos application will add this to the URL if it is not already there. This is also described in the Notes on the Databricks JDBC URL, SSL, and UseNativeQuery page.

Restart Docker and Launch Atlas

Restart Docker:

docker compose --profile default down 
docker compose pull && docker-compose --profile default up -d

Open http://127.0.0.1/atlas in a browser and navigate to Data Sources. Select your data source (in this case “Databricks Demo” and then select a report, the Person report is shown in the screen shot below). When you do this you should not see any errors in the Docker output for webapi as shown below.

Configuration of Vocabularies

Atlas allows for the use of multiple data sources and each data source is generally associated with a vocabulary. Therefore, Atlas needs to know what vocabulary to use for certain operations such as concept searches and the creation of concept sets. At the very bottom of the left side menu there is a “Configuration” option. Select this option to indicate what vocabulary should be used.

This is important: If you skip this step you will have issues with vocabulary operations and errors/exceptions that will not necessarily immediately make you remember that this is causing the problem!!!

You need to do this step even if you only have ONE data source

You need to select both options: Vocabulary Version and Record Counts (RC/DRC)

Select the Configuration option from the left side menu and then select the radio buttons for Vocabulary Version and Record Counts (RC/DRC).

OHDSI Databricks User Group Developer How-tos:Implementing OHDSI on Databricks Using Broadsea

Introduction

Getting Started

Install Ponos

Optional: Install Eunomia CDM

Shutdown PostgreSql

Start Docker

Clean Docker

Confirm Docker is Running

Clone Broadsea from Github

Update Broadsea

Start Broadsea

Confirm Broadsea

Connect to OHDSI

Install Ponos

Configure Parametes

Configure SSL and URL

Restart Docker and Launch Atlas

Configuration of Vocabularies

OHDSI Databricks User Group Developer How-tos:
Implementing OHDSI on Databricks Using Broadsea