WhiteRabbit is a software tool to help prepare for ETLs (Extraction, Transformation, Loading) of longitudinal health care databases into the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). The source data can be in comma-separated text files, SAS files, or in a database (MySQL, SQL Server, Oracle, PostgreSQL, Microsoft APS, Microsoft Access, Amazon RedShift, PDW, Teradata, Google BigQuery). Note that for support of the OHDSI analytical tooling, the OMOP CDM will need to be in one of a limited set of database platforms (SQL Server, Oracle, PostgreSQL, Microsoft APS, Amazon RedShift, Google BigQuery, Impala).
WhiteRabbit’s main function is to perform a scan of the source data, providing detailed information on the tables, fields, and values that appear in a field. This scan will generate a report that can be used as a reference when designing the ETL, for instance by using the Rabbit-In-a-Hat tool. White Rabbit differs from standard data profiling tools in that it attempts to prevent the display of personally identifiable information (PII) data values in the generated output data file.
The typical sequence for using this software to scan source data in preparation of developing an ETL into an OMOP CDM: 1. Set working folder, the location on the local desktop computer where results will be exported. 2. Connect to the source database or CSV text file and test connection. 3. Select the tables of interest for the scan and scan the tables. 4. WhiteRabbit creates an export of information about the source data.
Once the scan report is created, this report can then be used in the Rabbit-In-a-Hat tool or as a stand-alone data profiling document.
X.X.Xis the latest version).
Note: on releases earlier than version 0.8.0, open the respective WhiteRabbit.jar or RabbitInAHat.jar files instead.
All source code, descriptions and input/output examples are available on GitHub: https://github.com/OHDSI/WhiteRabbit
Any bugs/issues/enhancements should be posted to the GitHub repository: https://github.com/OHDSI/WhiteRabbit/issues
Any questions/comments/feedback/discussion can be posted on the OHDSI Developer Forum: http://forums.ohdsi.org/c/developers
Any files that WhiteRabbit creates will be exported to this local folder. Use the “Pick Folder” button to navigate in your local environment where you would like the scan document to go.
Here you can specify the location of the source data. The following source types are supported: delimited text files, SAS files, MySQL, SQL Server, Oracle, PostgreSQL, Microsoft APS, Microsoft Access, Amazon RedShift, PDW, Teradata, Google BigQuery. Below are connection instructions for each data type of data source. Once you have entered the necessary information, the “Test connection” button can ensure a connection can be made.
WhiteRabbit will look for the files to scan in the same folder you set up as a working directory.
WhiteRabbit will look for
.sas7bdat files to scan in the same folder you set up as a working directory.
Note that it is currently not possible to produce fake data for SAS files from a scan report.
<host>:<port>), which defaults to 3306.
<host>/< service name >, or
<host>:<port>), which defaults to 1433.
When the SQL Server JDBC drivers are installed, you can also use Windows authentication. In this case, user name and password should be empty. 1. Download the .exe from http://msdn.microsoft.com/en-us/sqlserver/aa937724.aspx. 2. Run it, thereby extracting its contents to a folder. 3. In the extracted folder you will find the file
_sqljdbc_4.0/enu/auth/x64/sqljdbc_auth.dll_ (64-bits) or
_sqljdbc_4.0/enu/auth/x86/sqljdbc_auth.dll_ (32-bits), which needs to be moved to a location on the system path, for example to
Google BigQuery (GBQ) supports two different connection/authentication methods – application default credentials and service account authentication. The former method is considered more secure because it writes auditing events to stackdriver. The specific method used is determined by the arguments provided to the configuration panel as described below.
Authentication via application default credentials:
When using application default credentials authentication, you must run the following gcloud command in the user account only once:
gcloud auth application-default login (do not include the single quote characters). An application key is written to
Authentication via service account credentials:
A scan generates a report containing information on the source data that can be used to help design the ETL. Using the Scan tab in WhiteRabbit you can either select individual tables in the selected source database by clicking on ‘Add’ (Ctrl + mouse click), or automatically select all tables in the database by clicking on ‘Add all in DB’.
There are a few setting options as well with the scan:
Once all settings are completed, press the “Scan tables” button. After the scan is completed the report will be written to the working folder.
For various reasons one could prefer to run WhiteRabbit from the command line. This is possible by specifying all the options one would normally select in the user interface in an .ini file. An example ini file can be found in the iniFileExamples folder. Then, we can reference the ini file when calling WhiteRabbit from the command line:
java -jar WhiteRabbit.jar -ini WhiteRabbit.ini
After the scan is completed, a “ScanReport” Excel document will be created in the working folder location selected earlier. The document will have multiple tabs, one as an “Overview” and then one tab for each database table or delimited text files selected for the scan. The “Overview” tab will tell you about each table selected, what the columns in each table are, the data type of the columns, the number of rows within the table and other statistics.
Below is an example image of the “Overview” tab. - Column A will list what table the information is about - Column B the column name - Column C the data type - Column D the maximum length of the values (number of characters/digits) - Column E is the number of rows (with text files it will return - 1) - Column F will tell you how many rows of the N rows were scanned - Column G shows how many of the checked rows are empty - Column H shows a count of the unique values within the checked rows. This number is sometimes an upper limit of the unique values, indicated by a
<= sign (v0.9.0) - Column I shows the percentage of unique values among all (0% = constant value, 100% = unique column) (v0.9.0)
For a tab that describes a single table, the columns names from the source table (or delimited text file) will be across the columns of the Excel tab. Each source table column will generate two columns in the Excel. One column will list all distinct values that have a “Min cell count” greater than what was set at time of the scan (Section - Performing the Scan). If a list of unique values was truncated, the last value in the list will be “List truncated”; this indicates that there are one or more additional unique source values that appear less than the number entered in the “Min cell count” described in Section - Performing the Scan. Next to each distinct value will be a second column that contains the frequency, or the number of times that value occurs in the data. These two columns (distinct values and frequency) will repeat for all the source columns in the table profiled in the workbook.
The report is powerful in understanding your source data by highlighting what exists. For example, the above results were given back on the “SEX” column within one of the tables scanned, we can see that there were two common values (1 and 2) that appeared 61,491 and 35,401 times respectively. WhiteRabbit will not define 1 as male and 2 as female; the data holder will typically need to define source codes unique to the source system. However these two values (1 & 2) are not the only values present in the data because we see this list was truncated. These other values appear with very low frequency (defined by “Min cell count”) and often represent incorrect or highly suspicious values. When generating an ETL we should not only plan to handle the high-frequency gender concepts 1 and 2 but the other low-frequency values that exist within this column.
If the option for numerical statistics is checked, then a set of statistics is calculated for all integer, real and date data types. The following statistics are added to the Overview sheet:
If the number of values is smaller than the set reservoir size, then the standard deviation and three quartile boundaries are the exact population statistics. Otherwise, the statistics are approximated based on a representative sample. The average, minimum and maximum are always true population statistics. For dates, the numerical statistics are calculated by using epoch days. The standard deviation of dates is given in days, the other statistics are converted to a date representation.
This feature allows one to create a fake dataset based on a WhiteRabbit scan report. Fake data can be generated outputted directly to database tables or as csv. The resulting dataset could be used to develop ETL code when direct access to the data is not available. This feature is currently still experimental.