Using python for postrgresql uploads

Introduction

Note, this feature should be considered experimental until further notice.

The use of DatabaseConnector with postgresql without bulk uploads functionality will often be slow and require the installation and configuration of postgresql binaries on the system. This may be challenging or restrictred in many environments. Similarly, this method requires writing a data.frame to disk which will be prohibitively slow if data is already in a csv format.

As a consequence, this package supports bulk uploading through python with a small amount of configuration. This uses no r-memory; csv files are transfered directly through python and will be considerably faster

This process uses the psycopg2 python library, which can be installed via compilation or in binary form. This process demonstrates usage with the psycopg2-binary package.

Installing psycopg2

Using a virtualenv

Result model manager provides an interactive function for enabling the python library. If psycopg2 this function will do nothing. However, if there is no available binary (and the reticulate package is not installed) you will be asked to install these packages. Do do this run the following:

ResultModelManager::enablePythonUploads()

Alternatively you can specify this manually

ResultModelManager::install_psycopg2()

Using conda or system python installs

Please consult the reticulate documentation on how to install the psycopg2-binary package.

Usage within functions

By default, this functionality will not be enabled when uploading tables and the function pyUploadCsv will fail. To enable, and directly upload a csv, try the following example code.

ResultModelManager::enablePythonUploads()
connectionDetails <- DabaseConnector::createConnectionDetails(
  dbms = "postgreql",
  server = "myserver.com",
  port = 5432,
  password = "s",
  user = "me",
  database = "some_db"
)
connection <- DatabaseConnector::connect(connectionDetails)
readr::write_csv(
  data.frame(
    id = 1:1e6,
    paste(1:1e6, "bottle(s) on the wall")
  ),
  "my_massive_csv.csv"
)

ResultModelManager::pyUploadCsv(connection,
  table = "my_table",
  filepath = "my_massive_csv.csv",
  schema = "my_schema"
)

Note that you are not required to call ResultModelManager::enablePythonUploads() every time. As an alternative, add the following line to your .Renviron file (note that this will automatically assume that setup of python libraries has been completed)

RMM_USE_PYTHON_UPLOADS=TRUE

The astute reader will realize that this approach requires an IO call, writing the CSV to disk. In many situations this will be a major bottleneck. A much more sensible approach is to use a string buffer. Thankfully, the author of this package has provided such an interface!

ResultModelManager::pyUploadDataframe(connection,
  table = "my_table",
  filepath = "my_massive_csv.csv",
  schema = "my_schema"
)

Note - that this approach is actually already implemented for you when you use uploadResults functionality. That’s right - if you call ResultModelManager::enablePythonUploads() (and you are using postgres) you will be able to upload your large R data.frames to postgres!

ResultModelManager::enablePythonUploads()

ResultModelManager::uploadResults(
  connectionDetails,
  schema = "my_schema",
  resultsFolder = "my_results_folder",
  tablePrefix = "cm_",
  purgeSiteDataBeforeUploading = FALSE,
  specifications = getResultsDataModelSpec()
)

Better yet, calling ResultModelManager::enablePythonUploads() before uploading results from any OHDSI package will automatically give you this fast(er) upload functionality.

James P. Gilbert

2025-10-01

Introduction

Installing psycopg2

Using a virtualenv

Using conda or system python installs

Usage within functions