OPTUM EHR Oncology Module ETL documentation

Background

The Optum® Enriched Oncology Data set is a group of tables that can supplement the Optum® de-identified Electronic Health Record dataset. It contains additional cancer-specific information on a subset of the EHR population that has at least one solid or non-solid tumor diagnosis. In addition to structured data obtained from medical records, the enriched oncology tables provide clinical information extracted from provider notes using natural language processing (NLP) machine learning models (for example, tumor progression, histology, etc.).

Optum oncology data initiatives include enriching Optum EHR data by extracting essential information from the oncology patient’s medical records and making it usable for researchers. Specific oncology concepts important for understanding the progression of the disease are often not available in structured formats, particularly the tumor, node, and metastasis (TNM) values, stage information and biomarkers.

As of 2022, there are approximately 1.9 million patients with at least one solid tumor ICD-9 or ICD-10 diagnosis included in the data set. Given the potential for research, the dataset was converted to the OMOP CDM to allow integration with standard OHDSI methods and tools.

How to run this conversion

This conversion is run after the main corpus of the data is converted, then the results are inserted in CDM tables.

Overall logic – linkage between source and CDM tables

alt_text

Table name: condition_occurrence

Reading from onc_neoplasm_histology, Onc_characteristic

Onc_neoplasm_histology contains neoplasm type (topography) and histology, which are the typical axes for the OMOP cancer diagnosis, it also has ‘direction’ field corresponding to the laterality which is a part of diagnosis in SNOMED as well.

Onc_characteristic table has neoplasm_characteristic and histology_characteristic fields that mostly describe typical cancer modifiers but also have ‘in situ’, ‘malignant’, ‘benign’ values that are a part of a diagnosis in OMOP.

If onc_characteristic.histology_characteristic = ‘in situ’ and at the same day onc_characteristic.histology_characteristic = ‘invasive’, do not use these onc_characteristic entries in the ETL.

Thus, Condition concept is a result of a precordination of 5 columns:

Onc_neoplasm_histology.neoplasm_type
Onc_neoplasm_histology.histology
Onc_neoplasm_histology.direction
Onc_characteristic.neoplasm_characteristic
Onc_characteristic.histology_characteristic

We put only positive actual values into OMOP CDM:

Read from onc_neoplasm_histology where (neoplasm_temporal_status =’current’ OR histology_temporal_status =’current’)

AND (neoplasm_qualifier =’actual’ OR histology_qualifier =’actual’)

Read from the Onc_characteristic where (ch.neoplasm_char_temporal_status =’current’ OR histology_char_temporal_status =’current’)

Note, one of the histology or neoplasm column can have empty value, but the other one is always filled, so we use OR in these cases.

Once table is generated, de-dup it by (condition_source_value, person_id , condition_start_date)

You also have to create a crosswalk between condition_occurrence_id to neoplasm_histology_key with 1:1 to 1:M condition_occurrence_id: neoplasm_histology_key

Direction with o’clock goes to measurement

Direction with laterality goes to Condition

onc_neoplasm_histology diagram

alt_text

onc_characteristic diagram

alt_text

Destination Field	Source Field	Logic	Comment
condition_occurrence_id
person_id	ptid
condition_concept_id	Precoordination of onc_neoplasm_histology: neoplasm_type direction histology + onc_characteristic: neoplasm_characteristic histology_characteristic See the script how to put them together	Mapping of the output of the script that concatenates these 5 fields and get rid of duplicates. Same logic is used to map the source data. Note, for now not merge only histology or only neoplasm entries using a query because several cancers can be presents at one day Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCNHC’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero
condition_start_date	onc_neoplasm_histology .note_date onc_neoplasm_histology .neoplasm_dx_date	When `neoplasm_dx_date` =’exact’ then to_date (neoplasm_dx_date, ‘yyyyMMdd’) When `neoplasm_dx_date` =’incomplete’ and day is unknown and `neoplasm_dx_date` `(`yyMM) < note_date (yyMM) then to_date (neoplasm_dx_date, ‘yyyyMM(dd =15)’ When `neoplasm_dx_date` =’incomplete’ and day and month are unknown and `neoplasm_dx_date` `(`yy) < note_date (yy) then to_date (neoplasm_dx_date, ‘yyyy(MM =06 dd =30)’ Else note_date
condition_start_datetime
condition_end_date
condition_end_datetime
condition_type_concept_id	32882		Standard algorithm from EHR
condition_status_concept_id
stop_reason
provider_id
visit_occurrence_id
visit_detail_id
condition_source_value	See the logic in condition_concept_id
condition_source_concept_id	0
condition_status_source_value		NULL

Table name: measurement

Reading from onc_stage

Where stage!=''

alt_text

Destination Field	Source Field	Logic	Comment
measurement_id
person_id	ptid
measurement_concept_id	`replace ( 'stage '\|\| stage_prefix \|\| ' '\|\| stage, ' ', ' ')`	Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCSTG’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL \ If there is no mapping available, set concept_id to zero
measurement_date	note_date
measurement_source_value	`replace ( 'stage '\|\| stage_prefix \|\| ' '\|\| stage, ' ', ' ')`
value_as_number
unit_concept_id
value_as_concept_id
meas_event_field_concept_id	`1147127`	`'condition_occurrence.condition_occurrence_id'`
measurement_event_id		fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) to connect with the onc_neoplasm_histology table
measurement_source_concept_id	0
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_metastatic_location

alt_text

Rule: populate CDM only when mets_temporal_status = ‘current’, history of measurement is not supported by the model. Rows where metastasis_location is NULL (metastasis_location=’’) are included, they mean that metastasis is present but the exact location is unknown.

Destination Field	Source Field	Logic	Comment
measurement_id
person_id	ptid
measurement_concept_id	‘Metastasis to ‘\|\| metastasis_location	Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCMET’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero
measurement_date	note_date metastasis_dx_date	When `metastasis_dx_date_type`=’exact’ then to_date (metastasis_dx_date, ‘yyyyMMdd’) When `metastasis_dx_date_type`=’incomplete’ and day is unknown and `metastasis_dx_date_type(`yyMM) < note_date (yyMM) \ then to_date (metastasis_dx_date, ‘yyyyMM(dd =15)’ When `metastasis_dx_date_type`=’incomplete’ and day and month are unknown and `metastasis_dx_date_type(`yy) < note_date (yy) \ then to_date (metastasis_dx_date, ‘yyyy(MM =06 dd =30)’ Else note_date
measurement_source_value	‘Metastasis to ‘\|\| metastasis_location
value_as_number
unit_concept_id
value_as_concept_id	mets_qualifier	When mets_qualifier = ‘actual’ then 4181412 -- Present When mets_qualifier =’absent’ then 4132135 -- Absent
meas_event_field_concept_id	`1147127`	`'condition_occurrence.condition_occurrence_id'`
measurement_event_id		fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date), this should be analyzed better, maybe we do not connect the events at all if neoplasm_histology_key is null
measurement_source_concept_id	0
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value	mets_qualifier	mets_qualifier

Reading from onc_tumor_node_metastasis

alt_text

Populate only if the main value (T, N or M) is not null:

Destination Field	Source Field	Logic	Comment
measurement_id
person_id	ptid
measurement_concept_id	`trim (stage_prefix\|\|' '\|\| t_prefix\|\| 'T'\|\|t\|\| t_suffix)` `trim (stage_prefix\|\|' '\|\|n_prefix\|\|'N'\|\|n\|\| n_suffix)` `trim (stage_prefix\|\|' '\|\| m_prefix\|\|'M'\|\|m\|\| m_suffix)` Before concatenation NULL values to be replaced with ‘’	Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCTNM’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero	each T, N and M category makes a separate measurement entry
measurement_date	note_date
measurement_source_value	See measurement_concept_id
value_as_number
unit_concept_id
value_as_concept_id
meas_event_field_concept_id	`1147127`		`'condition_occurrence.condition_occurrence_id'`
measurement_event_id	`condition_occurrence.condition_occurrence_id`		fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better.
measurement_source_concept_id
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_tumor_grade

alt_text

Destination Field	Source Field	Logic	Comment
measurement_id
person_id	ptid
measurement_concept_id	‘Tumor grade: ’\|\| tumor_grade	Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCTGR’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero	each category makes a separate measurement entry
measurement_date	note_date
measurement_source_value	‘Tumor grade: ’\|\| tumor_grade
value_as_number
unit_concept_id
value_as_concept_id
meas_event_field_concept_id	`1147127`		`'condition_occurrence.condition_occurrence_id'`
measurement_event_id	`condition_occurrence.condition_occurrence_id`		fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better.
measurement_source_concept_id	0
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_characteristic

Rule: Populate only with temporal_status ='current' and exclude characteristics that are the part of a condition or NULL:

Note, the char_temporal_status always accompanies its respective characteristic.

So, in SQL we can say (use similar logic to extract the concepts for mapping):
Select ptid, note_date, neoplasm_characteristic
from onc_characteristic
where neoplasm_char_temporal_status ='current'
and neoplasm_characteristic not in ('in situ', 'malignant', 'benign')
and neoplasm_characteristic !=''

Union –- deals with duplicates when neoplasm_characteristic = histology_characteristic

Select ptid, note_date, histology_characteristic
from onc_characteristic
where histology_char_temporal_status ='current'
and histology_characteristic not in ('in situ', 'malignant', 'benign')
and histology_characteristic !=''

alt_text

Destination Field	Source Field	Logic	Comment
measurement_id
person_id	ptid
measurement_concept_id	neoplasm_characteristic histology_characteristic	Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCCHR’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero	Each field creates its own entry in measurement table except the cases when neoplasm_characteristic = histology_characteristic
measurement_date	note_date
measurement_source_value	neoplasm_characteristic histology_characteristic		Each field creates its own entry in measurement table except the cases when neoplasm_characteristic = histology_characteristic
value_as_number
unit_concept_id
value_as_concept_id		when the neoplasm_characteristic = ‘metastatic’ or histology_characteristic = ‘metastatic’ then 4181412 -- Present
meas_event_field_concept_id	`1147127`		`'condition_occurrence.condition_occurrence_id'`
measurement_event_id	`condition_occurrence.condition_occurrence_id`		fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better.
measurement_source_concept_id	0
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value

Reading from onc_biomarker

Problem 1. (to be discussed with the WG)

There are both numeric (1+, 2+, 3+) and narrative results such as negative, equivocal, positive.

Can we interpret 1+ as negative, 2+ as equivocal, 3+ as positive?

For example, 2+ has an equivalent number of equivocal and (positive+negative)

The query:

select  narrative_result, count(*) from onc_biomarker where  numeric_result ='2 +'
and biomarker in ('erb-b2 receptor tyrosine kinase 2 (ERBB2 or HER2/neu)', 'estrogen receptor (ER)')
group by  narrative_result
order by count(*) desc

alt_text

Rule: Migrate only the rows where temporal_status=’current’ and biomarker is not NULL and with results (either numeric_result or narrative_result are populated).

Don’t worry about The rows with biomarker_temporal_status is null, it’s ONLY for rows including test_name-s, like ‘Flow Cytometry’, etc. And we don’t map these as OMOP genomics model doesn’t support this. And it has small scientific Impact

Destination Field	Source Field	Logic	Comment
measurement_id
person_id	ptid
measurement_concept_id	Biomarker\|\|’ ‘\|\| gene_characteristics	Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCBM’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero	Source names are different from the NLP_biomarker table, so a new source vocabulary_id is created
measurement_date	note_date
measurement_source_value	Biomarker, gene_characteristics	Biomarker \|\| ‘ ‘\|\| gene_characteristics
value_as_number	numeric_result	Extract number from numeric_result except of ‘num +’ cases, e.g. ‘1 +’, these will be mapped to concept. When result is given as an interval, for example, 14-16%, create 2 rows in measurement table, one with >=14%, another with <=16% (populating operator_concept_id, value_as_number, unit_concept_id)
unit_concept_id	numeric_result	Extract ‘Muts/Mb’, ‘%’ from numeric_result and map using STCM with the source_vocabulary_id =’JNJ UNITS’
value_as_concept_id	narrative_result\|\| ‘ ‘ \|\| modifier_narrative numeric_result \ where numeric_result like ‘num +’	Use the SOURCE_TO_STANDARD query with case insensitive match to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_LABRES’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero.
meas_event_field_concept_id	`1147127`	`'condition_occurrence.condition_occurrence_id'`
measurement_event_id	`condition_occurrence.condition_occurrence_id`	Connect with the Condition table through onc_neopls_histology using (neoplasm_histology_key) or using (ptid, note_date)
measurement_source_concept_id	0
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id	numeric_result	Extract operators from numeric_result and map using CONCEPT.name where domain_id =’Meas Value Operator’
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value	numeric_result	Extract ‘Muts/Mb’, ‘%’ from numeric_result
unit_source_concept_id		0
value_source_value	narrative_result modifier_narrative	Numeric_result \|\|’ ‘\|\| narrative_result \|\|’ ‘\|\| modifier_narrative	If some field is null, use other fields to concat

Reading from onc_evaluation_system

alt_text

Rule: Migrate only the rows where temporal_status=’current’,

We omit rows with a NULL temporal_status since they also have a NULL system_name, and we don’t want to guess both the temporal_status and the system_name.

Destination Field	Source Field	Comment
measurement_id
person_id	ptid
measurement_concept_id	`concept obtained using the rules:` `Rules 1 – 3 note: do not populate if the result (column name in red) is null` `when system_name like 'Nottingham%'` `then system_name and related columns make new concepts:` `system_name \|\|' grade tubular '\|\| grade_tubular system_name \|\|' grade nuclear '\|\| grade_nuclear system_name \|\|' grade mitotic '\|\| grade_mitotic` For example, “Nottingham Score grade tubular 3” `when system_name like 'Gleason%'` `then` `system_name and related columns make new concepts:` `system_name \|\|' grade primary '\|\| grade_primary system_name \|\|' grade secondary '\|\| grade_secondary system_name \|\|' grade tertiary '\|\| grade_tertiary system_name \|\|' result numeric '\|\| result_numeric` For example, “Gleason score grade primary 2” `When system_name in (` `'FIGO grade',` `'Durie/Salmon Stage',` `'International Staging System Stage',` `'RISS Stage',` `'Rai Stage',` `'FIGO Stage',` `'Binet Stage')` `Then system_name\|\|' ' \|\|result_numeric` Rule 4 note: do not populate when result_numeric is null `when system_name in ('ECOG performance status',` `'Karnofsky Performance Status',` `'Oncotype DX recurrence score', 'Recurrence score' )` `Then system_name` In the other cases do not populate CDM, for example entries with system = `'Performance Status'`are excluded because we don’t know what type of 'Performance Status it is.		Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: \ \ Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCEVS’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL \ \ If there is no mapping available, set concept_id to zero.
measurement_date	note_date
measurement_source_value	See the rules in measurement_concept_id
value_as_number	`result_numeric`	Values like 0/5, 1/5 – 5/5 are parsed to respective 0, 1, 5 Those stand for ECOG score out of 5	Populate when `system_name in ('ECOG performance status',` `'Performance Status',` `'Karnofsky Performance Status',` `'Oncotype DX recurrence score',` `'Recurrence score')` In the other cases result_numeric precoordinates with the other fields and maps to measurement_concept_id
unit_concept_id
value_as_concept_id	`result_narrative`		Use the SOURCE_TO_STANDARD query with case insensitive match to map the code to standard concept(s) with the following filters : Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_LABRES’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero.
meas_event_field_concept_id	`1147127`		`condition_occurrence.condition_occurrence_id`
measurement_event_id	`condition_occurrence.condition_occurrence_id`		Connect with the Condition table through onc_neopls_histology using (neoplasm_histology_key) or using (ptid, note_date)
measurement_source_concept_id	0
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_value
unit_source_concept_id
value_source_value	`result_narrative`

Reading from onc_tumor_size

alt_text

Destination Field	Source Field	Logic	Comment
measurement_id
person_id	ptid
measurement_event_id	neoplasm_histology_key		fill by CONDITION.condition_id using neoplasm_histology_key
measurement_concept_id	Rule 1: 36768255 Rule 2: 36768664		Rule 1: Largest Dimension of Tumor Rule 2: Dimension of Tumor
measurement_date	note_date
measurement_source_value			Rule 1: Largest Dimension of Tumor Rule 2: Dimension of Tumor
value_as_number	Rule 1: Greatest (tumor_size_1, tumor_size_2, tumor_size_3) Rule 2: another size, not defined as greatest		See query here
unit_concept_id	8582		centimeter
unit_source_value	‘cm’
value_as_concept_id
meas_event_field_concept_id
measurement_source_concept_id
measurement_datetime
measurement_time
measurement_type_concept_id	32882		Standard algorithm from EHR
operator_concept_id
range_low
range_high
provider_id
visit_occurrence_id
visit_detail_id
unit_source_concept_id
value_source_value

Table name: episode

Reading from onc_lines_of_therapy

alt_text

Destination Field	Source Field	Logic	Comment
episode_id		Auto-increment
person_id	ptid
episode_concept_id	32531		Treatment Regimen
episode_start_date	initiation
episode_start_datetime
episode_end_date	last_administration
episode_end_datetime
episode_parent_id		The episode corresponding to onc_lines_of_therapy.cancer_type. (can be connected to the actual diagnosis, to investigate after generation of CONDITION occurrence as described above) episode_concept_id = 32533 (Disease Episode) episode_object = cancer_type mapped to a standard condition concept	Can be implemented in the next run
episode_number	lot
episode_object_concept_id	regimen_name	Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCLOT’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero or Consider creating new concepts.
episode_type_concept_id	32882		Standard algorithm from EHR
episode_source_value		regimen_name
episode_source_concept_id		NULL

Table name: episode_event

Connects episode table and drug_exposure

connect with the drug_exposure, with a query like this:

Table name: observation

Reading from onc_treatment_response

alt_text

Destination Field	Source Field	Logic	Comment
observation_id
person_id	ptid
observation_date	note_date
observation_concept_id	0		Do not even create a source concept for it.
observation_datetime
observation_type_concept_id	32882		Standard algorithm from EHR
value_as_number
value_as_string
value_as_concept_id
qualifier_concept_id
unit_concept_id
provider_id
visit_occurrence_id
visit_detail_id
observation_source_value	`treatment_response \|\| ' to ' \|\| treatment`		Populate it as is in the first run, if someone wants to use it the data will be available in the CDM
observation_source_concept_id	0
unit_source_value
qualifier_source_value
value_source_value
observation_event_id	Event of a drug regimen the response is recorded to		Can be done in the next iteration
obs_event_field_concept_id	EPISODE		Can be done in the next iteration

Reading from onc_tumor_progression

This table has to be mapped to an Episode table, but in the first version it can sit in the Observation.

alt_text

Destination Field	Source Field	Logic	Comment
observation_id
person_id	ptid
observation_date	note_date
observation_event_id	neoplasm_histology_key		Connect with the corresponding condition
observation_source_value	progression
observation_concept_id
observation_datetime
observation_type_concept_id	32882		Standard algorithm from EHR
value_as_number
value_as_string
value_as_concept_id
qualifier_concept_id
unit_concept_id
provider_id
visit_occurrence_id
visit_detail_id
observation_source_concept_id
unit_source_value
qualifier_source_value
value_source_value
obs_event_field_concept_id

Useful queries

Get Condition concepts for the mapping

select distinct
case when neoplasm_characteristic = histology_characteristic then
regexp_replace (trim (coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '|| coalesce (histology, '')), ' +',' ')
else regexp_replace (trim (coalesce (neoplasm_characteristic, '') ||' '|| coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '
|| coalesce (histology, 'neoplasm')), ' +',' ') end as source_code,
0 as source_concept_id,
'JNJ_OPTUM_EHR_ONCNHC'   as source_vocabulary_id,
case when neoplasm_characteristic = histology_characteristic then
regexp_replace (trim (coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '|| coalesce (histology, '')), ' +',' ')
else regexp_replace (trim (coalesce (neoplasm_characteristic, '') ||' '|| coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '
|| coalesce (histology, 'neoplasm')), ' +',' ') end as source_code_description,
0 as target_concept_id,
to_date ('10-10-2022', 'dd-MM-yyyy') as valid_start_date,
to_date ('31-12-2099', 'dd-MM-yyyy') as valid_end_date,
null as invalid_reason
from onc_neoplasm_histology nh
left join onc_characteristic ch on  nh.neoplasm_histology_key= ch.neoplasm_histology_key And (ch.neoplasm_char_temporal_status ='current' OR ch.histology_char_temporal_status ='current')
and (neoplasm_characteristic in ('in situ','malignant', 'benign') or histology_characteristic in ('in situ', 'malignant', 'benign') )
where (neoplasm_temporal_status  ='current' OR  histology_temporal_status  ='current')
AND (neoplasm_qualifier ='actual' OR histology_qualifier ='actual')

Extracting the Largest Dimension of Tumor

--convert tumor size
--convert tumor size
create view
 ts_values as
 with
ts_clean as (
select * from (
select ptid, note_date, neoplasm_histology_key,
case when tumor_size_unit_1 ='mm' then tumor_size_1::float/10  when tumor_size_unit_1 ='cm' then tumor_size_1::float  end as tumor_size_1 ,
case when tumor_size_unit_2 ='mm' then tumor_size_2::float/10  when tumor_size_unit_2 ='cm' then tumor_size_2::float end as tumor_size_2,
case when tumor_size_unit_3 ='mm' then tumor_size_3::float/10  when tumor_size_unit_3 ='cm' then tumor_size_3::float end as tumor_size_3
 from (
select ptid, note_date, neoplasm_histology_key, tumor_size_1, coalesce (tumor_size_unit_1,tumor_size_unit_2,tumor_size_unit_3) as tumor_size_unit_1,
tumor_size_2, coalesce (tumor_size_unit_2,tumor_size_unit_3) as tumor_size_unit_2 , tumor_size_3, tumor_size_unit_3
from native_optum_ehr_v2438.onc_tumor_size)
)
where tumor_size_1 is not null -- when unit is empty it becomes null or when the first value is null, it's still looks like a missing data
)
select  CASE
    WHEN tumor_size_1 >= coalesce (tumor_size_2,0) AND tumor_size_1 >= coalesce (tumor_size_3,0) THEN tumor_size_1
    WHEN coalesce (tumor_size_2,0) >= tumor_size_1 AND coalesce (tumor_size_2,0) >= coalesce (tumor_size_3,0) THEN tumor_size_2
    ELSE tumor_size_3
    END as max_value,
    ptid, note_date, neoplasm_histology_key, tumor_size_1, tumor_size_2, tumor_size_3
    from ts_clean
;
--rule 1
    select  ptid, note_date, neoplasm_histology_key, max_value as value_as_number, 36768255 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
;
--rule 2
    select  ptid, note_date, neoplasm_histology_key, tumor_size_1 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
    where tumor_size_1 !=max_value
    union
    select  ptid, note_date, neoplasm_histology_key, tumor_size_2 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
    where tumor_size_2 !=max_value
    union
    select  ptid, note_date, neoplasm_histology_key, tumor_size_3 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
    where tumor_size_3 !=max_value;
--draft of a query that merges onc_neoplasm_histology when one of the fields is empty and at the same date this field is populated
with bb as (
(
select distinct
b.neoplasm_histology_key
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
),
aa as (
select distinct
a.neoplasm_histology_key
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
),
aa_bb as (
select distinct
a.neoplasm_histology_key ||'/'||b.neoplasm_histology_key , --list colums needed, that usually the table has as a result of merge
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
)
select * from onc_neoplasm_histology where neoplasm_histology_key not in (select aa.neoplasm_histology_key) and neoplasm_histology_key not in (select bb.neoplasm_histology_key)
union
select * from aa_bb

Connect Drug Exposure and lines of therapy

--this is just one patient, need to check how it works with all the patients after we map Episodes
select ma.drug_name, admin_date, lt.* from medication_administrations ma
join onc_lines_of_therapy lt on lt.ptid = ma.ptid and ma.admin_date between lt.initiation and lt.last_administration
where ma.ptid = 'PT306269791'
and ma.drug_name ~'BEVACIZUMAB|FLUOROURACIL|LEUCOVORIN|OXALIPLATIN|IRINOTECAN|CETUXIMAB|PANITUMUMAB|SODIUM IODINE I-131' -- maybe
order by admin_date

Future development

To connect onc_treatment_response and the actual treatment

We need to connect Episode table made from the Lines of therapy to the Observational table made from treatment_response, we can use the query like this: (it returns 18007 out 74698 onc_treatment_response patients has, so these tables are not very consistent)

THUS it is a lower priority

select count(distinct ptid)

from onc_treatment_response r

join onc_lines_of_therapy olot using (ptid)

where r.note_date between olot.initiation and olot.last_administration +30

and treatment not like ‘%radiotherapy%’ – exclude these as we work with the drugs

The treatment response can be stored in Observation table and connected through the observation_event_id to the EPISODE with drug regimen.

Add the Lymph Nodes table to the cancer modifiers

Add empty system name entries

*If the system name is not documented in the table it means, the system name was not mentioned in the notes; Null SYSTEM_NAME with grades for GRADE_TUBULAR, GRADE_NUCLEAR and GRADE_MITOTIC are typically associated with mammary neoplasms, but also can be applied to other NEOPLASM_TYPES.

Justification of decisions

Omit entries where histology_characteristic = ‘in situ’ and invasive at the same day

Since there are mostly same diagnoses associated with both modifiers, we treat these entries as erroneous and remove such entries.

Answer from OPTUM:

There could be multiple reasons that you see more than one characteristics for a neoplasm or histology.

Using the examples provided, ‘in situ’ and ‘invasive’ were mentioned in the same notes but the NLP model didn’t accurately capture the context to make sense of them. For example, the temporality for ‘in situ’ might actually be ‘history’ or a more granular term ‘5 months ago’.

Another reason is one of the terms belongs to another cancer if the patient has multiple cancers. For example on row 3 of the stats table, the patients being counted have both ‘ductal carcinoma’ and ‘carcinoma’ without a linked neoplasm. While ‘ductal carcinoma’ indicates breast cancer, ‘carcinoma’ might be for other cancers.

The workaround we recommend is using ICD code to corroborate. ‘in situ’ is a ‘D%’ code and ‘invasive’ is ‘C%’. So when a patient had been coded as ‘C%’ while having both ‘in situ’ and ‘invasive’ mentions, they are more likely already an ‘invasive’.

I’ve noticed 1 308 343 entries or 86 550 distinct patients when onc_characteristic.histology_characteristic = 'in situ' and at the same day it's equal to 'invasive'

select –1308343
count(1)
from onc_characteristic a
join onc_characteristic b using(ptid , note_date)
where a.histology_characteristic ='in situ' and b.histology_characteristic = 'invasive'
and (a.neoplasm_char_temporal_status  ='current' OR  a.histology_char_temporal_status  ='current') and (b.neoplasm_char_temporal_status  ='current' OR  b.histology_char_temporal_status  ='current')
;

And if we look at the diagnoses associated, it’s the same diagnoses mostly:

select oa.neoplasm_type , oa.histology , ob.neoplasm_type , ob.histology, count(1)
from onc_characteristic a
join onc_characteristic b using(ptid , note_date)
join onc_neoplasm_histology oa on oa.neoplasm_histology_key =a.neoplasm_histology_key
join onc_neoplasm_histology ob on ob.neoplasm_histology_key =b.neoplasm_histology_key
where a.histology_characteristic ='in situ' and b.histology_characteristic = 'invasive'
and (a.neoplasm_char_temporal_status  ='current' OR  a.histology_char_temporal_status  ='current') and (b.neoplasm_char_temporal_status  ='current' OR  b.histology_char_temporal_status  ='current')
group by oa.neoplasm_type , oa.histology , ob.neoplasm_type , ob.histology
order by count(1) desc
;

Here are some top diagnoses, the left associated with ‘in situ’, the right one associated with ‘invasive’ characteristic.

neoplasm_type	histology	neoplasm_type	histology	count
[NULL]	ductal carcinoma	[NULL]	ductal carcinoma	475,240
[NULL]	ductal carcinoma	mammary	ductal carcinoma	204,608
[NULL]	ductal carcinoma	[NULL]	carcinoma	148,016
mammary	ductal carcinoma	[NULL]	ductal carcinoma	42,496
[NULL]	lobular carcinoma	[NULL]	lobular carcinoma	41,146
[NULL]	lobular carcinoma	[NULL]	ductal carcinoma	34,411
mammary	ductal carcinoma	mammary	ductal carcinoma	27,113
[NULL]	lobular carcinoma	[NULL]	carcinoma	25,848
[NULL]	ductal carcinoma	[NULL]	lobular carcinoma	22,801
[NULL]	lobular carcinoma	mammary	lobular carcinoma	18,299
[NULL]	ductal carcinoma	mammary	carcinoma	17,361
[NULL]	carcinoma	[NULL]	carcinoma	16,622
[NULL]	carcinoma	[NULL]	ductal carcinoma	16,081
[NULL]	lobular carcinoma	mammary	ductal carcinoma	13,753