Chapter 3 Open Science

Chapter lead: Kees van Bochove

From the inception of the OHDSI community, the goal was to establish an international collaborative by building on open-science values, such as the use of open-source software, public availability of all conference proceedings and materials, and transparent, open-access publication of generated medical evidence. But what exactly is open-science? And how could OHDSI build an open-science or open-data strategy around medical data, which is very privacy-sensitive and typically not open for good reasons? Why is it so important to have reproducibility of analysis, and how does the OHDSI community aim to achieve this? These are some of the questions that we touch on in this chapter.

3.1 Open Science

The term ‘open science’ has been used since the nineties, but it really gained traction in the 2010s, during the same period OHDSI was born. Wikipedia (Wikipedia 2019 a) defines it as “the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of an inquiring society, amateur or professional,” and goes on to state that it is typically developed through collaborative networks. Although the OHDSI community never positioned itself explicitly as an ‘open-science’ collective or network, the term is frequently used to explain the driving concepts and principles behind OHDSI. For example, in 2015, Jon Duke presented OHDSI as “An Open Science Approach to Medical Evidence Generation,”⁷ and in 2019, the EHDEN consortium’s introductory webinar hailed the OHDSI network approach as “21st Century Real World Open Science.”⁸ Indeed, as we shall see in this chapter, many of the practices of open-science can be found in today’s OHDSI community. One could argue that the OHDSI community is a grassroots open-science collective driven by a shared desire for improving the transparency and reliability of medical evidence generation.

Open-science or “Science 2.0” (Wikipedia 2019 b) approaches mean to address a number of perceived problems within the current scientific practice. Information technology has led to an explosion of data generation and analysis methods, and for individual researchers, it is very hard to keep up with all literature published in their area of expertise. This holds even more true for medical doctors who have a practice to run as their day job, but still need to keep abreast of the latest medical evidence. In addition, there is growing concern that many experiments may suffer from poor statistical designs, publication bias, p-hacking and similar statistical problems, and are hard to reproduce. The traditional method of correcting these concerns, peer review of published articles, often fails to identify and tackle these problems. The special 2018 Nature edition on “Challenges in irreproducible research”⁹ includes several examples of this. A group of authors attempting to apply systematic peer review on the articles in their field found that, for various reasons, it was very hard to get the errors they identified rectified. Experiments that have a flawed design to begin with are especially hard to correct. In the words of Ronald Fisher: “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.” (Wikiquote 2019) The authors encountered common statistical problems such as poor randomization designs leading to false conclusions about statistical significance, miscalculations in meta-analyses, and inappropriate baseline comparisons. (Allison et al. 2016) Another paper from the same collection, taking experiences from physics as an example, argues that it is critical to not only provide access to the underlying data, but also to publish and properly document the data processing and analysis scripts to achieve full reproducibility. (Chen et al. 2018)

The OHDSI community addresses these challenges in its own way, and it puts significant emphasis on the importance of generating medical evidence at scale. As stated in Schuemie, Ryan, et al. (2018), while the current paradigm “centers on generating one estimate at a time using a unique study design with unknown reliability and publishing (or not) one estimate at a time,” the OHDSI community “advocates for high-throughput observational studies using consistent and standardized methods, allowing evaluation, calibration and unbiased dissemination to generate a more reliable and complete evidence base.” This is achieved by a combination of a network of medical data sources that map their data to the OMOP common data model, open source analytics code that can be used and verified by all, and large-scale baseline data such as the condition occurrences published at howoften.org. In the following paragraphs, concrete examples are provided and the open-science approach of OHDSI is detailed further using the four principles of Open Standards, Open Source, Open Data and Open Discourse as a guide. The chapter is concluded with a brief reference to the FAIR principles and outlook for OHDSI from an open-science perspective.

3.2 Open-Science in Action: the Study-a-Thon

A recent development in the community is the emergence of ‘study-a-thons’: short, concentrated face-to-face gatherings of a multidisciplinary group of scientists aimed at answering an important, clinically relevant research question using the OMOP data model and the OHDSI tools. A nice example is the 2018 Oxford study-a-thon, which is explained in an EHDEN webinar¹⁰ that provides a walkthrough of the process and also highlights the openly available results. In the period leading up to the study-a-thon, the participants propose medically relevant research questions to study, and one or more research questions are selected to study during the study-a-thon itself. Data is provided through participants that have access to patient-level data in OMOP format and are able to run queries on these data sources. Much of the actual study-a-thon time is devoted to discussing the statistical approach (see also chapter 2), the suitability of the data sources, the results which are interactively produced and the follow-up questions that are inevitably raised by these results. In the case of the Oxford study-a-thon, the questions centered around studying adverse post-surgical effects of different knee replacement methods, and the results were published interactively during the study-a-thon using the OHDSI forums and tools (see chapter 8). The OHDSI tools such as ATLAS facilitate rapid creation, exchange, discussion and tests of cohort definitions, which greatly speeds up the initial process of achieving consensus on problem definition and choice of methods. Thanks to the usage of the OMOP Common Data Model by the involved data sources and the availability of the OHDSI open source patient level prediction packages 13, it was possible to create a prediction model for 90-day post-operative mortality in one day, and validate the model externally in several large data sources the day after. The study-a-thon also resulted in a traditional scholarly paper (Development and validation of patient-level prediction models for adverse outcomes following total knee arthroplasty, Ross Williams, Daniel Prieto-Alhambra et al., manuscript in preparation), which took months to process through peer review. But the fact that the analysis scripts and results for several healthcare databases covering hundreds of millions of patient records were conceived, produced and published from scratch within a week illustrates the fundamental improvements OHDSI can bring to medical science, reducing the turnaround time for evidence to become available from months to days.

3.3 Open Standards

A very significant community resource that is maintained in the OHDSI community is the OMOP Common Data Model (see chapter 4) and associated Standardized Vocabularies (see chapter 5). The model itself is scoped to capture observational healthcare data, and it was originally meant to analyze associations between exposures such as drugs, procedures, devices, etc., and outcomes such as conditions and measurements. It has been extended for various analysis use cases (see also 7). However, harmonizing healthcare data worldwide from a wide variety of coding systems, healthcare paradigm and different types of healthcare sources requires a massive amount of ‘mappings’ between source codes and their closest standardized counterparts. The OMOP Standardized Vocabulary is further described in chapter 7 and includes mappings from hundreds of medical coding systems that are used worldwide, and is browsable through the OHDSI Athena tool. By providing these vocabularies and mappings as a freely available community resource, OMOP and the OHDSI community make a significant contribution to healthcare data analytics and is, by several accounts, the most comprehensive model for this purpose, representing approximately 1.2 billion healthcare records worldwide.¹¹ (Garza et al. 2016)

3.4 Open Source

Another key resource the OHDSI community provides are open source programs. These can be divided in several categories, such as the helper tools to map data to OMOP (see chapter 6), the OHDSI Methods Library which contain a powerful suite of commonly used statistical methods, open source code for published observational studies, and ATLAS, Athena and other infrastructure-related software which underpins the OHDSI ecosystem (see chapter 8). From an open-science perspective, one of the most important resources is the code for the actual execution of studies, such as studies from the OHDSI Research Network (see chapter 20). In turn, these programs leverage the fully open source OHDSI stack, which can be inspected, reviewed and contributed to via GitHub. For example, network studies often build on the Methods Library, which ensures a consistent re-use of statistical methods across analytical use cases. See chapter 17 for a more detailed overview of how the use of and collaboration on open source software in OHDSI ultimately underpins the quality and reliability of the generated evidence.

3.5 Open Data

Because of the privacy-sensitive nature of healthcare data, fully open, comprehensive patient-level datasets are typically not available. However, it is possible to leverage OMOP mapped datasets to publish important aggregated data and results sets, such as the earlier mentioned http://howoften.org and other public result sets that are published to http://data.ohdsi.org. Also, the OHDSI community provides simulated datasets such as SynPUF for testing and development purposes, and the OHDSI Research Network (see 20) can be leveraged to run studies in a network of available datasources that have mapped their data to OMOP. In order to make the mapping between the source data and the OMOP CDM transparent, it is encouraged for data sources to re-use the OHDSI ETL or ‘mapping’ tools and publish their mapping code as open source as well.

3.6 Open Discourse

Open standards, open source and open data are great assets, but left by themselves, they will not impact medical practice. Key to the open-science practice and impact of OHDSI is the implementation of medical evidence generation and the translation of the science to medical practice. The OHDSI community has several annual OHDSI Symposia, held in the United States, Europe, and Asia as well as dedicated communities of practice in, amongst others, China and Korea. These symposia discuss the advancements in statistical methods, data and software tooling, the standardized vocabularies, and all other aspects of the OHDSI open source community. The OHDSI forums¹² and wiki¹³ facilitate thousands of researchers worldwide in practicing observational research. The community calls¹⁴ and the code, issues and pull requests in Github¹⁵ constantly evolve the open-community assets such as code and the CDM, and in the OHDSI Network Studies, global observational research is practiced in an open and transparent way using hundreds of millions of patient records worldwide. Openness and open discourse is encouraged throughout the community, and this very book is written via an open process facilitated by the OHDSI wiki, community calls and a GitHub repository.¹⁶ It needs to be stressed however that without all the OHDSI collaborators, the processes and tools would be empty shells. Indeed, one could argue that the true value of the OHDSI community is with its members, who share a vision of improving health through collaborative and open-science, as discussed in Chapter 1.

3.7 OHDSI and the FAIR Guiding Principles

3.7.1 Introduction

This last paragraph of the chapter takes a look at the current state of the OHDSI community and tooling, using the FAIR Data Guiding Principles published in Wilkinson et al. (2016).

3.7.2 Findability

Any healthcare database that is mapped to OMOP and used for analytics should, from a scientific perspective, persist for future reference and reproducibility. The use of persistent identifiers for OMOP databases is not yet widespread, partly because these databases are often contained behind firewalls and on internal networks and not necessarily connected to the internet. However, it is entirely possible to publish summaries of the databases as a descriptor record that can be referenced for e.g. citation purposes. This method is followed in for example the EMIF catalog¹⁷, which provides a comprehensive record of the database in terms of data-gathering purpose, sources, vocabularies and terms, access control mechanisms, license, consents, etc. (Oliveira, Trifan, and Silva 2019) This approach is further developed in the IMI EHDEN project.

3.7.3 Accessibility

Accessibility of OMOP mapped data through an open protocol is typically achieved through the SQL interface, which combined with the OMOP CDM provides a standardized and well-documented method for accessing OMOP data. However, as discussed above, OMOP sources are often not directly available over the internet for security reasons. Creating a secure worldwide healthcare data network that is accessible for researchers is an active research topic and operational goal of projects like IMI EHDEN. However, results of analyses in multiple OMOP databases, as shown through OHDSI initiatives such as LEGEND and http://howoften.org, can be openly published.

3.7.4 Interoperability

Interoperability is arguably the strong suit of the OMOP data model and OHDSI tooling. In order to build a strong network of medical data sources worldwide which can be leveraged for evidence generation, achieving interoperability between healthcare data sources is key, and this is achieved through the OMOP model and Standardized Vocabularies. However, by sharing cohort definitions and statistical approaches, the OHDSI community goes beyond code mapping and also provides a platform to build an interoperable understanding of the analysis methods for healthcare data. Since healthcare systems such as hospitals are often the source of record for OMOP data, the interoperability of the OHDSI approach could be further enhanced by alignment with operational healthcare interoperability standards such as HL7 FHIR, HL7 CIMI and openEHR. The same is true for alignment with clinical interoperability standards such as CDISC and biomedical ontologies. Especially in areas such as oncology, this is an important topic, and the Oncology Working Group and Clinical Trials Working Group in the OHDSI community provide good examples of forums where these issues are actively discussed. In terms of references to other data and specifically ontology terms, ATLAS and OHDSI Athena are important tools, as they allow the exploration of the OMOP Standardized Vocabularies in the context of other available medical coding systems.

3.7.5 Reusability

The FAIR principles around reusability focus on important issues such as the data license, provenance (clarifying how the data came in existence) and the link to relevant community standards. Data licensing is a complicated topic, especially across jurisdictions, and it would fall outside of the scope of this book to cover it extensively. However, it is important to state that if you intend for your data (e.g. analysis results) to be freely used by others, it is good practice to explicitly provide these permissions via a data license. This is not yet a common practice for most data that can be found on the internet, and the OHDSI community is unfortunately not an exception here. Concerning the data provenance of OMOP databases, potential improvements exist for making meta-data available in an automated way, including, for example, CDM version, Standardized Vocabularies release, custom code lists, etc. The OHDSI ETL tools do not currently produce this information automatically, but working groups such as the Data Quality Working Group and Metadata Working Group actively work on these. Another important aspect is the provenance of the underlying databases itself; it is important to know if a hospital or GP information system was replaced or changed, and when known data omissions or other data issues occurred historically. Exploring ways to attach this metadata systematically in the OMOP CDM is the domain of the Metadata Working Group.

The OHDSI community can be seen as an open-science community that is actively pursuing the interoperability and reproducibility of medical evidence generation.
It also advocates a paradigm shift from single study and single estimate medical research to large-scale systematic evidence generation, where facts such as baseline occurrence are known and the evidence focuses on statistically estimating the effects of interventions and treatments from real world healthcare sources.

References

Allison, D. B., A. W. Brown, B. J. George, and K. A. Kaiser. 2016. “Reproducibility: A tragedy of errors.” Nature 530 (7588): 27–29.

Chen, Xiaoli, Sünje Dallmeier-Tiessen, Robin Dasler, Sebastian Feger, Pamfilos Fokianos, Jose Benito Gonzalez, Harri Hirvonsalo, et al. 2018. “Open Is Not Enough.” Nature Physics 15 (2): 113–19. https://doi.org/10.1038/s41567-018-0342-2.

Garza, M., G. Del Fiol, J. Tenenbaum, A. Walden, and M. N. Zozus. 2016. “Evaluating common data models for use with a longitudinal community registry.” J Biomed Inform 64 (December): 333–41.

Oliveira, José Luı́s, Alina Trifan, and Luı́s A. Bastião Silva. 2019. “EMIF Catalogue: A Collaborative Platform for Sharing and Reusing Biomedical Data.” International Journal of Medical Informatics 126 (June): 35–45. https://doi.org/10.1016/j.ijmedinf.2019.02.006.

Schuemie, M. J., P. B. Ryan, G. Hripcsak, D. Madigan, and M. A. Suchard. 2018. “Improving reproducibility by using high-throughput observational studies with empirical calibration.” Philos Trans A Math Phys Eng Sci 376 (2128).

Wikipedia. 2019a. “Open science — Wikipedia, the Free Encyclopedia.” http://en.wikipedia.org/w/index.php?title=Open%20science&oldid=900178688.

Wikipedia. 2019b. “Science 2.0 — Wikipedia, the Free Encyclopedia.” http://en.wikipedia.org/w/index.php?title=Science%202.0&oldid=887565958.

Wikiquote. 2019. “Ronald Fisher — Wikiquote,” \url{https://en.wikiquote.org/w/index.php?title=Ronald_Fisher&oldid=2638030}.

Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, et al. 2016. “The FAIR Guiding Principles for scientific data management and stewardship.” Sci Data 3 (March): 160018.