OPTUM EHR Oncology Module ETL documentation

Background

The Optum® Enriched Oncology Data set is a group of tables that can supplement the Optum® de-identified Electronic Health Record dataset. It contains additional cancer-specific information on a subset of the EHR population that has at least one solid or non-solid tumor diagnosis. In addition to structured data obtained from medical records, the enriched oncology tables provide clinical information extracted from provider notes using natural language processing (NLP) machine learning models (for example, tumor progression, histology, etc.).

Optum oncology data initiatives include enriching Optum EHR data by extracting essential information from the oncology patient’s medical records and making it usable for researchers. Specific oncology concepts important for understanding the progression of the disease are often not available in structured formats, particularly the tumor, node, and metastasis (TNM) values, stage information and biomarkers.

As of 2022, there are approximately 1.9 million patients with at least one solid tumor ICD-9 or ICD-10 diagnosis included in the data set. Given the potential for research, the dataset was converted to the OMOP CDM to allow integration with standard OHDSI methods and tools.

How to run this conversion

This conversion is run after the main corpus of the data is converted, then the results are inserted in CDM tables.

Overall logic – linkage between source and CDM tables

alt_text

Table name: condition_occurrence

Reading from onc_neoplasm_histology, Onc_characteristic

Onc_neoplasm_histology contains neoplasm type (topography) and histology, which are the typical axes for the OMOP cancer diagnosis, it also has ‘direction’ field corresponding to the laterality which is a part of diagnosis in SNOMED as well.

Onc_characteristic table has neoplasm_characteristic and histology_characteristic fields that mostly describe typical cancer modifiers but also have ‘in situ’, ‘malignant’, ‘benign’ values that are a part of a diagnosis in OMOP.

If onc_characteristic.histology_characteristic = ‘in situ’ and at the same day onc_characteristic.histology_characteristic = ‘invasive’, do not use these onc_characteristic entries in the ETL.

Thus, Condition concept is a result of a precordination of 5 columns:

Onc_neoplasm_histology.neoplasm_type
Onc_neoplasm_histology.histology
Onc_neoplasm_histology.direction
Onc_characteristic.neoplasm_characteristic
Onc_characteristic.histology_characteristic

We put only positive actual values into OMOP CDM:

Read from onc_neoplasm_histology where (neoplasm_temporal_status  =’current’ OR  histology_temporal_status  =’current’)

AND (neoplasm_qualifier =’actual’ OR histology_qualifier =’actual’)

Read from the Onc_characteristic where (ch.neoplasm_char_temporal_status =’current’ OR histology_char_temporal_status =’current’)

Note, one of the histology or neoplasm column can have empty value, but the other one is always filled, so we use OR in these cases.

Once table is generated, de-dup it by (condition_source_value, person_id , condition_start_date)

You also have to create a crosswalk between condition_occurrence_id to neoplasm_histology_key with 1:1 to 1:M condition_occurrence_id: neoplasm_histology_key

Direction with o’clock goes to measurement

Direction with laterality goes to Condition

onc_neoplasm_histology diagram

alt_text

onc_characteristic diagram

alt_text

Destination Field Source Field Logic Comment
condition_occurrence_id
person_id ptid
condition_concept_id Precoordination of

onc_neoplasm_histology: neoplasm_type
direction
histology

+ onc_characteristic: neoplasm_characteristic histology_characteristic

See the script how to put them together

Mapping of the output of the script that concatenates these 5 fields and get rid of duplicates. Same logic is used to map the source data.

Note, for now not merge only histology or only neoplasm entries using a query because several cancers can be presents at one day

Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters:

Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCNHC’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero

condition_start_date onc_neoplasm_histology .note_date

onc_neoplasm_histology .neoplasm_dx_date

When neoplasm_dx_date =’exact’ then to_date (neoplasm_dx_date, ‘yyyyMMdd’)

When neoplasm_dx_date =’incomplete’ and day is unknown and

neoplasm_dx_date (yyMM) < note_date (yyMM) then to_date (neoplasm_dx_date, ‘yyyyMM(dd =15)’

When neoplasm_dx_date =’incomplete’ and day and month are unknown and

neoplasm_dx_date (yy) < note_date (yy) then to_date (neoplasm_dx_date, ‘yyyy(MM =06 dd =30)’

Else note_date

condition_start_datetime
condition_end_date
condition_end_datetime
condition_type_concept_id 32882 Standard algorithm from EHR
condition_status_concept_id
stop_reason
provider_id
visit_occurrence_id
visit_detail_id
condition_source_value See the logic in condition_concept_id
condition_source_concept_id 0
condition_status_source_value NULL

Table name: measurement

Reading from onc_stage

Where stage!=''

alt_text

Destination Field Source Field Logic Comment
measurement_id
person_id ptid
measurement_concept_id replace ( 'stage '|| stage_prefix || ' '|| stage, ' ', ' ') Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCSTG’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL \ If there is no mapping available, set concept_id to zero
measurement_date note_date
measurement_source_value replace ( 'stage '|| stage_prefix || ' '|| stage, ' ', ' ')
value_as_number
unit_concept_id
value_as_concept_id
meas_event_field_concept_id 1147127 'condition_occurrence.condition_occurrence_id'
measurement_event_id fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) to connect with the onc_neoplasm_histology table
measurement_source_concept_id 0
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_metastatic_location

alt_text

Rule: populate CDM only when mets_temporal_status = ‘current’, history of measurement is not supported by the model. Rows where metastasis_location is NULL (metastasis_location=’’) are included, they mean that metastasis is present but the exact location is unknown.

Destination Field Source Field Logic Comment
measurement_id
person_id ptid
measurement_concept_id ‘Metastasis to ‘|| metastasis_location Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCMET’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL

If there is no mapping available, set concept_id to zero

measurement_date note_date

metastasis_dx_date

When metastasis_dx_date_type=’exact’ then to_date (metastasis_dx_date, ‘yyyyMMdd’)

When metastasis_dx_date_type=’incomplete’ and day is unknown and metastasis_dx_date_type(yyMM) < note_date (yyMM) \ then to_date (metastasis_dx_date, ‘yyyyMM(dd =15)’

When metastasis_dx_date_type=’incomplete’ and day and month are unknown and

metastasis_dx_date_type(yy) < note_date (yy) \ then to_date (metastasis_dx_date, ‘yyyy(MM =06 dd =30)’

Else note_date

measurement_source_value ‘Metastasis to ‘|| metastasis_location
value_as_number
unit_concept_id
value_as_concept_id mets_qualifier When mets_qualifier = ‘actual’ then 4181412 -- Present

When mets_qualifier =’absent’ then 4132135 -- Absent

meas_event_field_concept_id 1147127 'condition_occurrence.condition_occurrence_id'
measurement_event_id fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date), this should be analyzed better, maybe we do not connect the events at all if neoplasm_histology_key is null
measurement_source_concept_id 0
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value mets_qualifier mets_qualifier

Reading from onc_tumor_node_metastasis

alt_text

Populate only if the main value (T, N or M) is not null:

Destination Field Source Field Logic Comment
measurement_id
person_id ptid
measurement_concept_id trim (stage_prefix||' '|| t_prefix|| 'T'||t|| t_suffix) trim (stage_prefix||' '||n_prefix||'N'||n|| n_suffix) trim (stage_prefix||' '|| m_prefix||'M'||m|| m_suffix) Before concatenation NULL values to be replaced with ‘’ Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCTNM’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero each T, N and M category makes a separate measurement entry
measurement_date note_date
measurement_source_value See measurement_concept_id
value_as_number
unit_concept_id
value_as_concept_id
meas_event_field_concept_id 1147127 'condition_occurrence.condition_occurrence_id'
measurement_event_id condition_occurrence.condition_occurrence_id fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better.
measurement_source_concept_id
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_tumor_grade

alt_text

Destination Field Source Field Logic Comment
measurement_id
person_id ptid
measurement_concept_id ‘Tumor grade: ’|| tumor_grade Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCTGR’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero each category makes a separate measurement entry
measurement_date note_date
measurement_source_value ‘Tumor grade: ’|| tumor_grade
value_as_number
unit_concept_id
value_as_concept_id
meas_event_field_concept_id 1147127 'condition_occurrence.condition_occurrence_id'
measurement_event_id condition_occurrence.condition_occurrence_id fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better.
measurement_source_concept_id 0
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_characteristic

Rule: Populate only with temporal_status ='current' and exclude characteristics that are the part of a condition or NULL:

Note, the char_temporal_status always accompanies its respective characteristic.

If onc_characteristic.histology_characteristic = ‘in situ’ and at the same day onc_characteristic.histology_characteristic = ‘invasive’, do not use these onc_characteristic entries in the ETL.

So, in SQL we can say (use similar logic to extract the concepts for mapping):
Select ptid, note_date, neoplasm_characteristic
from onc_characteristic
where neoplasm_char_temporal_status ='current'
and neoplasm_characteristic not in ('in situ', 'malignant', 'benign')
and neoplasm_characteristic !=''

Union –- deals with duplicates when neoplasm_characteristic = histology_characteristic

Select ptid, note_date, histology_characteristic
from onc_characteristic
where histology_char_temporal_status ='current'
and histology_characteristic not in ('in situ', 'malignant', 'benign')
and histology_characteristic !=''

alt_text

Destination Field Source Field Logic Comment
measurement_id
person_id ptid
measurement_concept_id neoplasm_characteristic

histology_characteristic

Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters:

Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCCHR’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL

If there is no mapping available, set concept_id to zero

Each field creates its own entry in measurement table except the cases when neoplasm_characteristic = histology_characteristic
measurement_date note_date
measurement_source_value neoplasm_characteristic

histology_characteristic

Each field creates its own entry in measurement table except the cases when neoplasm_characteristic = histology_characteristic
value_as_number
unit_concept_id
value_as_concept_id when the neoplasm_characteristic = ‘metastatic’ or histology_characteristic = ‘metastatic’ then

4181412 -- Present

meas_event_field_concept_id 1147127 'condition_occurrence.condition_occurrence_id'
measurement_event_id condition_occurrence.condition_occurrence_id fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better.
measurement_source_concept_id 0
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_biomarker

Problem 1. (to be discussed with the WG)

There are both numeric (1+, 2+, 3+) and narrative results such as negative, equivocal, positive.

Can we interpret 1+ as negative, 2+ as equivocal, 3+ as positive?

For example, 2+ has an equivalent number of equivocal and (positive+negative)

The query:

select  narrative_result, count(*) from onc_biomarker where  numeric_result ='2 +'
and biomarker in ('erb-b2 receptor tyrosine kinase 2 (ERBB2 or HER2/neu)', 'estrogen receptor (ER)')
group by  narrative_result
order by count(*) desc

alt_text

alt_text

  • Rule: Migrate only the rows where temporal_status=’current’ and biomarker is not NULL and with results (either numeric_result or narrative_result are populated).

    Don’t worry about The rows with biomarker_temporal_status is null, it’s ONLY for rows including test_name-s, like ‘Flow Cytometry’, etc. And we don’t map these as OMOP genomics model doesn’t support this. And it has small scientific Impact

Destination Field Source Field Logic Comment
measurement_id
person_id ptid
measurement_concept_id Biomarker||’ ‘||

gene_characteristics

Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters:

Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCBM’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL

If there is no mapping available, set concept_id to zero

Source names are different from the NLP_biomarker table, so a new source vocabulary_id is created
measurement_date note_date
measurement_source_value Biomarker, gene_characteristics Biomarker || ‘ ‘|| gene_characteristics
value_as_number numeric_result Extract number from numeric_result except of ‘num +’ cases, e.g. ‘1 +’, these will be mapped to concept.

When result is given as an interval, for example, 14-16%, create 2 rows in measurement table, one with >=14%, another with <=16% (populating operator_concept_id, value_as_number, unit_concept_id)

unit_concept_id numeric_result Extract ‘Muts/Mb’, ‘%’ from numeric_result and map using STCM with the source_vocabulary_id =’JNJ UNITS’
value_as_concept_id narrative_result|| ‘ ‘ || modifier_narrative

numeric_result \ where numeric_result like ‘num +’

Use the SOURCE_TO_STANDARD query with case insensitive match to map the code to standard concept(s) with the following filters:

Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_LABRES’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL

If there is no mapping available, set concept_id to zero.

meas_event_field_concept_id 1147127 'condition_occurrence.condition_occurrence_id'
measurement_event_id condition_occurrence.condition_occurrence_id Connect with the Condition table through onc_neopls_histology using (neoplasm_histology_key) or using (ptid, note_date)
measurement_source_concept_id 0
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id numeric_result Extract operators from numeric_result and map using CONCEPT.name where domain_id =’Meas Value Operator’
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value numeric_result Extract ‘Muts/Mb’, ‘%’ from numeric_result
unit_source_concept_id 0
value_source_value narrative_result

modifier_narrative

Numeric_result ||’ ‘||

narrative_result ||’ ‘||

modifier_narrative

If some field is null, use other fields to concat

Reading from onc_evaluation_system

alt_text

  • Rule: Migrate only the rows where temporal_status=’current’,

We omit rows with a NULL temporal_status since they also have a NULL system_name, and we don’t want to guess both the temporal_status and the system_name.

Destination Field Source Field Comment
measurement_id
person_id ptid
measurement_concept_id concept obtained using the rules:

Rules 1 – 3 note: do not populate if the result (column name in red) is null

  1. when system_name like 'Nottingham%'

    then system_name and related columns make new concepts:

    system_name ||' grade tubular '|| grade_tubular system_name ||' grade nuclear '|| grade_nuclear system_name ||' grade mitotic '|| grade_mitotic

    For example, “Nottingham Score grade tubular 3”

  2. when system_name like 'Gleason%'

    then system_name and related columns make new concepts:

    system_name ||' grade primary '|| grade_primary system_name ||' grade secondary '|| grade_secondary system_name ||' grade tertiary '|| grade_tertiary system_name ||' result numeric '|| result_numeric

    For example, “Gleason score grade primary 2”

  3. When system_name in (

    'FIGO grade',

    'Durie/Salmon Stage',

    'International Staging System Stage',

    'RISS Stage',

    'Rai Stage',

    'FIGO Stage',

    'Binet Stage')

    Then system_name||' ' ||result_numeric

    Rule 4 note: do not populate when result_numeric is null

  4. when system_name in ('ECOG performance status',

    'Karnofsky Performance Status',

    'Oncotype DX recurrence score', 'Recurrence score' )

    Then system_name

In the other cases do not populate CDM, for example entries with system = 'Performance Status'are excluded because we don’t know what type of 'Performance Status it is.

Use the  SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: \ \ Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCEVS’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL \ \ If there is no mapping available, set concept_id to zero.
measurement_date note_date
measurement_source_value See the rules in measurement_concept_id
value_as_number result_numeric Values like 0/5, 1/5 – 5/5 are parsed to respective 0, 1, 5

Those stand for ECOG score out of 5

Populate when system_name in ('ECOG performance status',

'Performance Status',

'Karnofsky Performance Status',

'Oncotype DX recurrence score',

'Recurrence score')

In the other cases result_numeric precoordinates with the other fields and maps to measurement_concept_id

unit_concept_id
value_as_concept_id result_narrative Use the  SOURCE_TO_STANDARD query with case insensitive match to map the code to standard concept(s) with the following filters :

Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_LABRES’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL

If there is no mapping available, set concept_id to zero.

meas_event_field_concept_id 1147127 condition_occurrence.condition_occurrence_id
measurement_event_id condition_occurrence.condition_occurrence_id Connect with the Condition table through onc_neopls_histology using (neoplasm_histology_key) or using (ptid, note_date)
measurement_source_concept_id 0
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value result_narrative

Reading from onc_tumor_size

alt_text

Destination Field Source Field Logic Comment
measurement_id
person_id ptid
measurement_event_id neoplasm_histology_key fill by CONDITION.condition_id using neoplasm_histology_key
measurement_concept_id Rule 1: 36768255

Rule 2: 36768664

Rule 1: Largest Dimension of Tumor

Rule 2: Dimension of Tumor

measurement_date note_date
measurement_source_value Rule 1: Largest Dimension of Tumor

Rule 2: Dimension of Tumor

value_as_number Rule 1: Greatest (tumor_size_1, tumor_size_2, tumor_size_3)

Rule 2: another size, not defined as greatest

See query here
unit_concept_id 8582 centimeter
unit_source_value ‘cm’
value_as_concept_id
meas_event_field_concept_id
measurement_source_concept_id
measurement_datetime
measurement_time
measurement_type_concept_id 32882 Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_concept_id
value_source_value

Table name: episode

Reading from onc_lines_of_therapy

alt_text

Destination Field Source Field Logic Comment
episode_id Auto-increment
person_id ptid
episode_concept_id 32531 Treatment Regimen
episode_start_date initiation
episode_start_datetime
episode_end_date last_administration
episode_end_datetime
episode_parent_id The episode corresponding to onc_lines_of_therapy.cancer_type. (can be connected to the actual diagnosis, to investigate after generation of CONDITION occurrence as described above)

episode_concept_id = 32533 (Disease Episode)

episode_object = cancer_type mapped to a standard condition concept

Can be implemented in the next run
episode_number lot
episode_object_concept_id regimen_name Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters:

Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCLOT’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL

If there is no mapping available, set concept_id to zero or Consider creating new concepts.

episode_type_concept_id 32882 Standard algorithm from EHR
episode_source_value regimen_name
episode_source_concept_id NULL

Table name: episode_event

Connects episode table and drug_exposure

connect with the drug_exposure, with a query like this:

#

Table name: observation

Reading from onc_treatment_response

alt_text

Destination Field Source Field Logic Comment
observation_id
person_id ptid
observation_date note_date
observation_concept_id 0 Do not even create a source concept for it.
observation_datetime
observation_type_concept_id 32882 Standard algorithm from EHR
value_as_number
value_as_string
value_as_concept_id
qualifier_concept_id
unit_concept_id
provider_id
visit_occurrence_id
visit_detail_id
observation_source_value treatment_response || ' to ' || treatment Populate it as is in the first run, if someone wants to use it the data will be available in the CDM
observation_source_concept_id 0
unit_source_value
qualifier_source_value
value_source_value
observation_event_id Event of a drug regimen the response is recorded to Can be done in the next iteration
obs_event_field_concept_id EPISODE Can be done in the next iteration

Reading from onc_tumor_progression

This table has to be mapped to an Episode table, but in the first version it can sit in the Observation.

alt_text

Destination Field Source Field Logic Comment
observation_id
person_id ptid
observation_date note_date
observation_event_id neoplasm_histology_key Connect with the corresponding condition
observation_source_value progression
observation_concept_id
observation_datetime
observation_type_concept_id 32882 Standard algorithm from EHR
value_as_number
value_as_string
value_as_concept_id
qualifier_concept_id
unit_concept_id
provider_id
visit_occurrence_id
visit_detail_id
observation_source_concept_id
unit_source_value
qualifier_source_value
value_source_value
obs_event_field_concept_id

Useful queries

Get Condition concepts for the mapping

select distinct
case when neoplasm_characteristic = histology_characteristic then
regexp_replace (trim (coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '|| coalesce (histology, '')), ' +',' ')
else regexp_replace (trim (coalesce (neoplasm_characteristic, '') ||' '|| coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '
|| coalesce (histology, 'neoplasm')), ' +',' ') end as source_code,
0 as source_concept_id,
'JNJ_OPTUM_EHR_ONCNHC'   as source_vocabulary_id,
case when neoplasm_characteristic = histology_characteristic then
regexp_replace (trim (coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '|| coalesce (histology, '')), ' +',' ')
else regexp_replace (trim (coalesce (neoplasm_characteristic, '') ||' '|| coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '
|| coalesce (histology, 'neoplasm')), ' +',' ') end as source_code_description,
0 as target_concept_id,
to_date ('10-10-2022', 'dd-MM-yyyy') as valid_start_date,
to_date ('31-12-2099', 'dd-MM-yyyy') as valid_end_date,
null as invalid_reason
from onc_neoplasm_histology nh
left join onc_characteristic ch on  nh.neoplasm_histology_key= ch.neoplasm_histology_key And (ch.neoplasm_char_temporal_status ='current' OR ch.histology_char_temporal_status ='current')
and (neoplasm_characteristic in ('in situ','malignant', 'benign') or histology_characteristic in ('in situ', 'malignant', 'benign') )
where (neoplasm_temporal_status  ='current' OR  histology_temporal_status  ='current')
AND (neoplasm_qualifier ='actual' OR histology_qualifier ='actual')

Extracting the Largest Dimension of Tumor

--convert tumor size
--convert tumor size
create view
 ts_values as
 with
ts_clean as (
select * from (
select ptid, note_date, neoplasm_histology_key,
case when tumor_size_unit_1 ='mm' then tumor_size_1::float/10  when tumor_size_unit_1 ='cm' then tumor_size_1::float  end as tumor_size_1 ,
case when tumor_size_unit_2 ='mm' then tumor_size_2::float/10  when tumor_size_unit_2 ='cm' then tumor_size_2::float end as tumor_size_2,
case when tumor_size_unit_3 ='mm' then tumor_size_3::float/10  when tumor_size_unit_3 ='cm' then tumor_size_3::float end as tumor_size_3
 from (
select ptid, note_date, neoplasm_histology_key, tumor_size_1, coalesce (tumor_size_unit_1,tumor_size_unit_2,tumor_size_unit_3) as tumor_size_unit_1,
tumor_size_2, coalesce (tumor_size_unit_2,tumor_size_unit_3) as tumor_size_unit_2 , tumor_size_3, tumor_size_unit_3
from native_optum_ehr_v2438.onc_tumor_size)
)
where tumor_size_1 is not null -- when unit is empty it becomes null or when the first value is null, it's still looks like a missing data
)
select  CASE
    WHEN tumor_size_1 >= coalesce (tumor_size_2,0) AND tumor_size_1 >= coalesce (tumor_size_3,0) THEN tumor_size_1
    WHEN coalesce (tumor_size_2,0) >= tumor_size_1 AND coalesce (tumor_size_2,0) >= coalesce (tumor_size_3,0) THEN tumor_size_2
    ELSE tumor_size_3
    END as max_value,
    ptid, note_date, neoplasm_histology_key, tumor_size_1, tumor_size_2, tumor_size_3
    from ts_clean
;
--rule 1
    select  ptid, note_date, neoplasm_histology_key, max_value as value_as_number, 36768255 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
;
--rule 2
    select  ptid, note_date, neoplasm_histology_key, tumor_size_1 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
    where tumor_size_1 !=max_value
    union
    select  ptid, note_date, neoplasm_histology_key, tumor_size_2 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
    where tumor_size_2 !=max_value
    union
    select  ptid, note_date, neoplasm_histology_key, tumor_size_3 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
    where tumor_size_3 !=max_value;
--draft of a query that merges onc_neoplasm_histology when one of the fields is empty and at the same date this field is populated
with bb as (
(
select distinct
b.neoplasm_histology_key
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
),
aa as (
select distinct
a.neoplasm_histology_key
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
),
aa_bb as (
select distinct
a.neoplasm_histology_key ||'/'||b.neoplasm_histology_key , --list colums needed, that usually the table has as a result of merge
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
)
select * from onc_neoplasm_histology where neoplasm_histology_key not in (select aa.neoplasm_histology_key) and neoplasm_histology_key not in (select bb.neoplasm_histology_key)
union
select * from aa_bb

Connect Drug Exposure and lines of therapy

--this is just one patient, need to check how it works with all the patients after we map Episodes
select ma.drug_name, admin_date, lt.* from medication_administrations ma
join onc_lines_of_therapy lt on lt.ptid = ma.ptid and ma.admin_date between lt.initiation and lt.last_administration
where ma.ptid = 'PT306269791'
and ma.drug_name ~'BEVACIZUMAB|FLUOROURACIL|LEUCOVORIN|OXALIPLATIN|IRINOTECAN|CETUXIMAB|PANITUMUMAB|SODIUM IODINE I-131' -- maybe
order by admin_date

Future development

To connect onc_treatment_response and the actual treatment

We need to connect Episode table made from the Lines of therapy to the Observational table made from treatment_response, we can use the query like this: (it returns 18007 out 74698 onc_treatment_response patients has, so these tables are not very consistent)

THUS it is a lower priority

select count(distinct ptid) 

from onc_treatment_response r 

join onc_lines_of_therapy olot using (ptid) 

where r.note_date between olot.initiation and olot.last_administration +30 

and treatment not like ‘%radiotherapy%’ – exclude these as we work with the drugs 

The treatment response can be stored in Observation table and connected through the observation_event_id to the EPISODE with drug regimen.

Add the Lymph Nodes table to the cancer modifiers

Add empty system name entries

*If the system name is not documented in the table it means, the system name was not mentioned in the notes; Null SYSTEM_NAME with grades for GRADE_TUBULAR, GRADE_NUCLEAR and GRADE_MITOTIC are typically associated with mammary neoplasms, but also can be applied to other NEOPLASM_TYPES.

Justification of decisions

Omit entries where histology_characteristic = ‘in situ’ and invasive at the same day

Since there are mostly same diagnoses associated with both modifiers, we treat these entries as erroneous and remove such entries.

Answer from OPTUM:

There could be multiple reasons that you see more than one characteristics for a neoplasm or histology.

Using the examples provided, ‘in situ’ and ‘invasive’ were mentioned in the same notes but the NLP model didn’t accurately capture the context to make sense of them. For example, the temporality for ‘in situ’ might actually be ‘history’ or a more granular term ‘5 months ago’.

Another reason is one of the terms belongs to another cancer if the patient has multiple cancers. For example on row 3 of the stats table, the patients being counted have both ‘ductal carcinoma’ and ‘carcinoma’ without a linked neoplasm. While ‘ductal carcinoma’ indicates breast cancer, ‘carcinoma’ might be for other cancers.

The workaround we recommend is using ICD code to corroborate. ‘in situ’ is a ‘D%’ code and ‘invasive’ is ‘C%’. So when a patient had been coded as ‘C%’ while having both ‘in situ’ and ‘invasive’ mentions, they are more likely already an ‘invasive’.

I’ve noticed  1 308 343 entries or 86 550 distinct patients when onc_characteristic.histology_characteristic = 'in situ' and at the same day it's equal to 'invasive'

select –1308343
count(1)
from onc_characteristic a
join onc_characteristic b using(ptid , note_date)
where a.histology_characteristic ='in situ' and b.histology_characteristic = 'invasive'
and (a.neoplasm_char_temporal_status  ='current' OR  a.histology_char_temporal_status  ='current') and (b.neoplasm_char_temporal_status  ='current' OR  b.histology_char_temporal_status  ='current')
;

And if we look at the diagnoses associated, it’s the same diagnoses mostly:

select oa.neoplasm_type , oa.histology , ob.neoplasm_type , ob.histology, count(1)
from onc_characteristic a
join onc_characteristic b using(ptid , note_date)
join onc_neoplasm_histology oa on oa.neoplasm_histology_key =a.neoplasm_histology_key
join onc_neoplasm_histology ob on ob.neoplasm_histology_key =b.neoplasm_histology_key
where a.histology_characteristic ='in situ' and b.histology_characteristic = 'invasive'
and (a.neoplasm_char_temporal_status  ='current' OR  a.histology_char_temporal_status  ='current') and (b.neoplasm_char_temporal_status  ='current' OR  b.histology_char_temporal_status  ='current')
group by oa.neoplasm_type , oa.histology , ob.neoplasm_type , ob.histology
order by count(1) desc
;

Here are some top diagnoses, the left associated with ‘in situ’, the right one associated with ‘invasive’ characteristic.

neoplasm_type histology neoplasm_type histology count
[NULL] ductal carcinoma [NULL] ductal carcinoma

475,240

[NULL] ductal carcinoma mammary ductal carcinoma

204,608

[NULL] ductal carcinoma [NULL] carcinoma

148,016

mammary ductal carcinoma [NULL] ductal carcinoma

42,496

[NULL] lobular carcinoma [NULL] lobular carcinoma

41,146

[NULL] lobular carcinoma [NULL] ductal carcinoma

34,411

mammary ductal carcinoma mammary ductal carcinoma

27,113

[NULL] lobular carcinoma [NULL] carcinoma

25,848

[NULL] ductal carcinoma [NULL] lobular carcinoma

22,801

[NULL] lobular carcinoma mammary lobular carcinoma

18,299

[NULL] ductal carcinoma mammary carcinoma

17,361

[NULL] carcinoma [NULL] carcinoma

16,622

[NULL] carcinoma [NULL] ductal carcinoma

16,081

[NULL] lobular carcinoma mammary ductal carcinoma

13,753


Please contact Clair Blacketer (https://github.com/clairblacketer) if you have any questions