OPTUM EHR Oncology Module ETL documentation
Background
The Optum® Enriched Oncology Data set is a group of tables that can supplement the Optum® de-identified Electronic Health Record dataset. It contains additional cancer-specific information on a subset of the EHR population that has at least one solid or non-solid tumor diagnosis. In addition to structured data obtained from medical records, the enriched oncology tables provide clinical information extracted from provider notes using natural language processing (NLP) machine learning models (for example, tumor progression, histology, etc.).
Optum oncology data initiatives include enriching Optum EHR data by extracting essential information from the oncology patient’s medical records and making it usable for researchers. Specific oncology concepts important for understanding the progression of the disease are often not available in structured formats, particularly the tumor, node, and metastasis (TNM) values, stage information and biomarkers.
As of 2022, there are approximately 1.9 million patients with at least one solid tumor ICD-9 or ICD-10 diagnosis included in the data set. Given the potential for research, the dataset was converted to the OMOP CDM to allow integration with standard OHDSI methods and tools.
How to run this conversion
This conversion is run after the main corpus of the data is converted, then the results are inserted in CDM tables.
Overall logic – linkage between source and CDM tables
Table name: condition_occurrence
Reading from onc_neoplasm_histology, Onc_characteristic
Onc_neoplasm_histology contains neoplasm type (topography) and histology, which are the typical axes for the OMOP cancer diagnosis, it also has ‘direction’ field corresponding to the laterality which is a part of diagnosis in SNOMED as well.
Onc_characteristic table has neoplasm_characteristic and histology_characteristic fields that mostly describe typical cancer modifiers but also have ‘in situ’, ‘malignant’, ‘benign’ values that are a part of a diagnosis in OMOP.
If onc_characteristic.histology_characteristic = ‘in situ’ and at the same day onc_characteristic.histology_characteristic = ‘invasive’, do not use these onc_characteristic entries in the ETL.
Thus, Condition concept is a result of a precordination of 5 columns:
Onc_neoplasm_histology.neoplasm_type
Onc_neoplasm_histology.histology
Onc_neoplasm_histology.direction
Onc_characteristic.neoplasm_characteristic
Onc_characteristic.histology_characteristic
We put only positive actual values into OMOP CDM:
Read from onc_neoplasm_histology where (neoplasm_temporal_status =’current’ OR histology_temporal_status =’current’)
AND (neoplasm_qualifier =’actual’ OR histology_qualifier =’actual’)
Read from the Onc_characteristic where (ch.neoplasm_char_temporal_status =’current’ OR histology_char_temporal_status =’current’)
Note, one of the histology or neoplasm column can have empty value, but the other one is always filled, so we use OR in these cases.
Once table is generated, de-dup it by (condition_source_value, person_id , condition_start_date)
You also have to create a crosswalk between condition_occurrence_id to neoplasm_histology_key with 1:1 to 1:M condition_occurrence_id: neoplasm_histology_key
Direction with o’clock goes to measurement
Direction with laterality goes to Condition
onc_neoplasm_histology diagram
onc_characteristic diagram
Destination Field | Source Field | Logic | Comment |
condition_occurrence_id | |||
person_id | ptid | ||
condition_concept_id | Precoordination of onc_neoplasm_histology: neoplasm_type + onc_characteristic: neoplasm_characteristic histology_characteristic See the script how to put them together | Mapping of the output of the script that concatenates these 5 fields and get rid of duplicates. Same logic is used to map the source data. Note, for now not merge only histology or only neoplasm entries using a query because several cancers can be presents at one day Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCNHC’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero | |
condition_start_date | onc_neoplasm_histology .note_date onc_neoplasm_histology .neoplasm_dx_date | When neoplasm_dx_date =’exact’ then to_date (neoplasm_dx_date, ‘yyyyMMdd’) When When Else note_date | |
condition_start_datetime | |||
condition_end_date | |||
condition_end_datetime | |||
condition_type_concept_id | 32882 | Standard algorithm from EHR | |
condition_status_concept_id | |||
stop_reason | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
condition_source_value | See the logic in condition_concept_id | ||
condition_source_concept_id | 0 | ||
condition_status_source_value | NULL |
Table name: measurement
Reading from onc_stage
Where stage!=''
Destination Field | Source Field | Logic | Comment |
measurement_id | |||
person_id | ptid | ||
measurement_concept_id | replace ( 'stage '|| stage_prefix || ' '|| stage, ' ', ' ') | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCSTG’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL \ If there is no mapping available, set concept_id to zero | |
measurement_date | note_date | ||
measurement_source_value | replace ( 'stage '|| stage_prefix || ' '|| stage, ' ', ' ') | ||
value_as_number | |||
unit_concept_id | |||
value_as_concept_id | |||
meas_event_field_concept_id | 1147127 | 'condition_occurrence.condition_occurrence_id' | |
measurement_event_id | fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) to connect with the onc_neoplasm_histology table | ||
measurement_source_concept_id | 0 | ||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | |||
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_value | |||
unit_source_concept_id | |||
value_source_value |
Reading from onc_metastatic_location
Rule: populate CDM only when mets_temporal_status = ‘current’, history of measurement is not supported by the model. Rows where metastasis_location is NULL (metastasis_location=’’) are included, they mean that metastasis is present but the exact location is unknown.
Destination Field | Source Field | Logic | Comment |
measurement_id | |||
person_id | ptid | ||
measurement_concept_id | ‘Metastasis to ‘|| metastasis_location | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCMET’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero | |
measurement_date | note_date metastasis_dx_date | When metastasis_dx_date_type =’exact’ then to_date (metastasis_dx_date, ‘yyyyMMdd’) When When Else note_date | |
measurement_source_value | ‘Metastasis to ‘|| metastasis_location | ||
value_as_number | |||
unit_concept_id | |||
value_as_concept_id | mets_qualifier | When mets_qualifier = ‘actual’ then 4181412 -- Present When mets_qualifier =’absent’ then 4132135 -- Absent | |
meas_event_field_concept_id | 1147127 | 'condition_occurrence.condition_occurrence_id' | |
measurement_event_id | fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date), this should be analyzed better, maybe we do not connect the events at all if neoplasm_histology_key is null | ||
measurement_source_concept_id | 0 | ||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | |||
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_value | |||
unit_source_concept_id | |||
value_source_value | mets_qualifier | mets_qualifier |
Reading from onc_tumor_node_metastasis
Populate only if the main value (T, N or M) is not null:
Destination Field | Source Field | Logic | Comment |
measurement_id | |||
person_id | ptid | ||
measurement_concept_id | trim (stage_prefix||' '|| t_prefix|| 'T'||t|| t_suffix) trim (stage_prefix||' '||n_prefix||'N'||n|| n_suffix) trim (stage_prefix||' '|| m_prefix||'M'||m|| m_suffix) Before concatenation NULL values to be replaced with ‘’ | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCTNM’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero | each T, N and M category makes a separate measurement entry |
measurement_date | note_date | ||
measurement_source_value | See measurement_concept_id | ||
value_as_number | |||
unit_concept_id | |||
value_as_concept_id | |||
meas_event_field_concept_id | 1147127 | 'condition_occurrence.condition_occurrence_id' | |
measurement_event_id | condition_occurrence.condition_occurrence_id | fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better. | |
measurement_source_concept_id | |||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | |||
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_value | |||
unit_source_concept_id | |||
value_source_value |
Reading from onc_tumor_grade
Destination Field | Source Field | Logic | Comment |
measurement_id | |||
person_id | ptid | ||
measurement_concept_id | ‘Tumor grade: ’|| tumor_grade | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCTGR’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero | each category makes a separate measurement entry |
measurement_date | note_date | ||
measurement_source_value | ‘Tumor grade: ’|| tumor_grade | ||
value_as_number | |||
unit_concept_id | |||
value_as_concept_id | |||
meas_event_field_concept_id | 1147127 | 'condition_occurrence.condition_occurrence_id' | |
measurement_event_id | condition_occurrence.condition_occurrence_id | fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better. | |
measurement_source_concept_id | 0 | ||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | |||
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_value | |||
unit_source_concept_id | |||
value_source_value |
Reading from onc_characteristic
Rule: Populate only with temporal_status ='current'
and
exclude characteristics that are the part of a condition or NULL:
Note, the char_temporal_status always accompanies its respective characteristic.
If onc_characteristic.histology_characteristic = ‘in situ’ and at the same day onc_characteristic.histology_characteristic = ‘invasive’, do not use these onc_characteristic entries in the ETL.
So, in SQL we can say (use similar logic to extract the concepts for mapping):
Select ptid, note_date, neoplasm_characteristic
from onc_characteristic
where neoplasm_char_temporal_status ='current'
and neoplasm_characteristic not in ('in situ', 'malignant', 'benign')
and neoplasm_characteristic !=''
Union –- deals with duplicates when
neoplasm_characteristic = histology_characteristic
Select ptid, note_date, histology_characteristic
from onc_characteristic
where histology_char_temporal_status ='current'
and histology_characteristic not in ('in situ', 'malignant', 'benign')
and histology_characteristic !=''
Destination Field | Source Field | Logic | Comment |
measurement_id | |||
person_id | ptid | ||
measurement_concept_id | neoplasm_characteristic histology_characteristic | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCCHR’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero | Each field creates its own entry in measurement table except the cases when neoplasm_characteristic = histology_characteristic |
measurement_date | note_date | ||
measurement_source_value | neoplasm_characteristic histology_characteristic | Each field creates its own entry in measurement table except the cases when neoplasm_characteristic = histology_characteristic | |
value_as_number | |||
unit_concept_id | |||
value_as_concept_id | when the neoplasm_characteristic = ‘metastatic’ or histology_characteristic = ‘metastatic’ then 4181412 -- Present | ||
meas_event_field_concept_id | 1147127 | 'condition_occurrence.condition_occurrence_id' | |
measurement_event_id | condition_occurrence.condition_occurrence_id | fill by CONDITION.condition_id using neoplasm_histology_key OR if neoplasm_histology_key is null, use (ptid, note_date) this should be analyzed better. | |
measurement_source_concept_id | 0 | ||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | |||
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_value | |||
unit_source_concept_id | |||
value_source_value |
Reading from onc_biomarker
Problem 1. (to be discussed with the WG)
There are both numeric (1+, 2+, 3+) and narrative results such as negative, equivocal, positive.
Can we interpret 1+ as negative, 2+ as equivocal, 3+ as positive?
For example, 2+ has an equivalent number of equivocal and (positive+negative)
The query:
select narrative_result, count(*) from onc_biomarker where numeric_result ='2 +'
and biomarker in ('erb-b2 receptor tyrosine kinase 2 (ERBB2 or HER2/neu)', 'estrogen receptor (ER)')
group by narrative_result
order by count(*) desc
-
Rule: Migrate only the rows where temporal_status=’current’ and biomarker is not NULL and with results (either numeric_result or narrative_result are populated).
Don’t worry about The rows with biomarker_temporal_status is null, it’s ONLY for rows including test_name-s, like ‘Flow Cytometry’, etc. And we don’t map these as OMOP genomics model doesn’t support this. And it has small scientific Impact
Destination Field | Source Field | Logic | Comment |
measurement_id | |||
person_id | ptid | ||
measurement_concept_id | Biomarker||’ ‘|| gene_characteristics | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCBM’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero | Source names are different from the NLP_biomarker table, so a new source vocabulary_id is created |
measurement_date | note_date | ||
measurement_source_value | Biomarker, gene_characteristics | Biomarker || ‘ ‘|| gene_characteristics | |
value_as_number | numeric_result | Extract number from numeric_result except of ‘num +’ cases, e.g. ‘1 +’, these will be mapped to concept. When result is given as an interval, for example, 14-16%, create 2 rows in measurement table, one with >=14%, another with <=16% (populating operator_concept_id, value_as_number, unit_concept_id) | |
unit_concept_id | numeric_result | Extract ‘Muts/Mb’, ‘%’ from numeric_result and map using STCM with the source_vocabulary_id =’JNJ UNITS’ | |
value_as_concept_id | narrative_result|| ‘ ‘ || modifier_narrative numeric_result \ where numeric_result like ‘num +’
| Use the SOURCE_TO_STANDARD query with case insensitive match to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_LABRES’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero. | |
meas_event_field_concept_id | 1147127 | 'condition_occurrence.condition_occurrence_id' | |
measurement_event_id | condition_occurrence.condition_occurrence_id | Connect with the Condition table through onc_neopls_histology using (neoplasm_histology_key) or using (ptid, note_date) | |
measurement_source_concept_id | 0 | ||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | numeric_result | Extract operators from numeric_result and map using CONCEPT.name where domain_id =’Meas Value Operator’ | |
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_value | numeric_result | Extract ‘Muts/Mb’, ‘%’ from numeric_result | |
unit_source_concept_id | 0 | ||
value_source_value | narrative_result modifier_narrative | Numeric_result ||’ ‘|| narrative_result ||’ ‘|| modifier_narrative | If some field is null, use other fields to concat |
Reading from onc_evaluation_system
- Rule: Migrate only the rows where temporal_status=’current’,
We omit rows with a NULL temporal_status since they also have a NULL system_name, and we don’t want to guess both the temporal_status and the system_name.
Destination Field | Source Field | Comment | |
measurement_id | |||
person_id | ptid | ||
measurement_concept_id | concept obtained using the rules:
In the other cases do not populate CDM, for example entries with system = | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: \ \ Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCEVS’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL \ \ If there is no mapping available, set concept_id to zero. | |
measurement_date | note_date | ||
measurement_source_value | See the rules in measurement_concept_id | ||
value_as_number | result_numeric | Values like 0/5, 1/5 – 5/5 are parsed to respective 0, 1, 5 Those stand for ECOG score out of 5 | Populate when system_name in ('ECOG performance status', In the other cases result_numeric precoordinates with the other fields and maps to measurement_concept_id |
unit_concept_id | |||
value_as_concept_id | result_narrative | Use the SOURCE_TO_STANDARD query with case insensitive match to map the code to standard concept(s) with the following filters : Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_LABRES’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero. | |
meas_event_field_concept_id | 1147127 | condition_occurrence.condition_occurrence_id | |
measurement_event_id | condition_occurrence.condition_occurrence_id | Connect with the Condition table through onc_neopls_histology using (neoplasm_histology_key) or using (ptid, note_date) | |
measurement_source_concept_id | 0 | ||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | |||
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_value | |||
unit_source_concept_id | |||
value_source_value | result_narrative |
Reading from onc_tumor_size
Destination Field | Source Field | Logic | Comment |
measurement_id | |||
person_id | ptid | ||
measurement_event_id | neoplasm_histology_key | fill by CONDITION.condition_id using neoplasm_histology_key | |
measurement_concept_id | Rule 1: 36768255 Rule 2: 36768664 | Rule 1: Largest Dimension of Tumor Rule 2: Dimension of Tumor | |
measurement_date | note_date | ||
measurement_source_value | Rule 1: Largest Dimension of Tumor Rule 2: Dimension of Tumor | ||
value_as_number | Rule 1: Greatest (tumor_size_1, tumor_size_2, tumor_size_3) Rule 2: another size, not defined as greatest | See query here | |
unit_concept_id | 8582 | centimeter | |
unit_source_value | ‘cm’ | ||
value_as_concept_id | |||
meas_event_field_concept_id | |||
measurement_source_concept_id | |||
measurement_datetime | |||
measurement_time | |||
measurement_type_concept_id | 32882 | Standard algorithm from EHR | |
operator_concept_id | |||
range_low | |||
range_high | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
unit_source_concept_id | |||
value_source_value |
Table name: episode
Reading from onc_lines_of_therapy
Destination Field | Source Field | Logic | Comment |
episode_id | Auto-increment | ||
person_id | ptid | ||
episode_concept_id | 32531 | Treatment Regimen | |
episode_start_date | initiation | ||
episode_start_datetime | |||
episode_end_date | last_administration | ||
episode_end_datetime | |||
episode_parent_id | The episode corresponding to onc_lines_of_therapy.cancer_type. (can be connected to the actual diagnosis, to investigate after generation of CONDITION occurrence as described above) episode_concept_id = 32533 (Disease Episode) episode_object = cancer_type mapped to a standard condition concept | Can be implemented in the next run | |
episode_number | lot | ||
episode_object_concept_id | regimen_name | Use the SOURCE_TO_STANDARD query to map the code to standard concept(s) with the following filters: Where source_vocabulary_id = ‘JNJ_OPTUM_EHR_ONCLOT’ and Target_standard_concept = ‘S’ and target_invalid_reason is NULL If there is no mapping available, set concept_id to zero or Consider creating new concepts. | |
episode_type_concept_id | 32882 | Standard algorithm from EHR | |
episode_source_value | regimen_name | ||
episode_source_concept_id | NULL |
Table name: episode_event
Connects episode table and drug_exposure
connect with the drug_exposure, with a query like this:
#
Table name: observation
Reading from onc_treatment_response
Destination Field | Source Field | Logic | Comment |
observation_id | |||
person_id | ptid | ||
observation_date | note_date | ||
observation_concept_id | 0 | Do not even create a source concept for it. | |
observation_datetime | |||
observation_type_concept_id | 32882 | Standard algorithm from EHR | |
value_as_number | |||
value_as_string | |||
value_as_concept_id | |||
qualifier_concept_id | |||
unit_concept_id | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
observation_source_value | treatment_response || ' to ' || treatment | Populate it as is in the first run, if someone wants to use it the data will be available in the CDM | |
observation_source_concept_id | 0 | ||
unit_source_value | |||
qualifier_source_value | |||
value_source_value | |||
observation_event_id | Event of a drug regimen the response is recorded to | Can be done in the next iteration | |
obs_event_field_concept_id | EPISODE | Can be done in the next iteration |
Reading from onc_tumor_progression
This table has to be mapped to an Episode table, but in the first version it can sit in the Observation.
Destination Field | Source Field | Logic | Comment |
observation_id | |||
person_id | ptid | ||
observation_date | note_date | ||
observation_event_id | neoplasm_histology_key | Connect with the corresponding condition | |
observation_source_value | progression | ||
observation_concept_id | |||
observation_datetime | |||
observation_type_concept_id | 32882 | Standard algorithm from EHR | |
value_as_number | |||
value_as_string | |||
value_as_concept_id | |||
qualifier_concept_id | |||
unit_concept_id | |||
provider_id | |||
visit_occurrence_id | |||
visit_detail_id | |||
observation_source_concept_id | |||
unit_source_value | |||
qualifier_source_value | |||
value_source_value | |||
obs_event_field_concept_id |
Useful queries
Get Condition concepts for the mapping
select distinct
case when neoplasm_characteristic = histology_characteristic then
regexp_replace (trim (coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '|| coalesce (histology, '')), ' +',' ')
else regexp_replace (trim (coalesce (neoplasm_characteristic, '') ||' '|| coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '
|| coalesce (histology, 'neoplasm')), ' +',' ') end as source_code,
0 as source_concept_id,
'JNJ_OPTUM_EHR_ONCNHC' as source_vocabulary_id,
case when neoplasm_characteristic = histology_characteristic then
regexp_replace (trim (coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '|| coalesce (histology, '')), ' +',' ')
else regexp_replace (trim (coalesce (neoplasm_characteristic, '') ||' '|| coalesce (histology_characteristic, '')||' '||coalesce (direction, '')||' '|| coalesce (neoplasm_type, '')||' '
|| coalesce (histology, 'neoplasm')), ' +',' ') end as source_code_description,
0 as target_concept_id,
to_date ('10-10-2022', 'dd-MM-yyyy') as valid_start_date,
to_date ('31-12-2099', 'dd-MM-yyyy') as valid_end_date,
null as invalid_reason
from onc_neoplasm_histology nh
left join onc_characteristic ch on nh.neoplasm_histology_key= ch.neoplasm_histology_key And (ch.neoplasm_char_temporal_status ='current' OR ch.histology_char_temporal_status ='current')
and (neoplasm_characteristic in ('in situ','malignant', 'benign') or histology_characteristic in ('in situ', 'malignant', 'benign') )
where (neoplasm_temporal_status ='current' OR histology_temporal_status ='current')
AND (neoplasm_qualifier ='actual' OR histology_qualifier ='actual')
Extracting the Largest Dimension of Tumor
--convert tumor size
--convert tumor size
create view
ts_values as
with
ts_clean as (
select * from (
select ptid, note_date, neoplasm_histology_key,
case when tumor_size_unit_1 ='mm' then tumor_size_1::float/10 when tumor_size_unit_1 ='cm' then tumor_size_1::float end as tumor_size_1 ,
case when tumor_size_unit_2 ='mm' then tumor_size_2::float/10 when tumor_size_unit_2 ='cm' then tumor_size_2::float end as tumor_size_2,
case when tumor_size_unit_3 ='mm' then tumor_size_3::float/10 when tumor_size_unit_3 ='cm' then tumor_size_3::float end as tumor_size_3
from (
select ptid, note_date, neoplasm_histology_key, tumor_size_1, coalesce (tumor_size_unit_1,tumor_size_unit_2,tumor_size_unit_3) as tumor_size_unit_1,
tumor_size_2, coalesce (tumor_size_unit_2,tumor_size_unit_3) as tumor_size_unit_2 , tumor_size_3, tumor_size_unit_3
from native_optum_ehr_v2438.onc_tumor_size)
)
where tumor_size_1 is not null -- when unit is empty it becomes null or when the first value is null, it's still looks like a missing data
)
select CASE
WHEN tumor_size_1 >= coalesce (tumor_size_2,0) AND tumor_size_1 >= coalesce (tumor_size_3,0) THEN tumor_size_1
WHEN coalesce (tumor_size_2,0) >= tumor_size_1 AND coalesce (tumor_size_2,0) >= coalesce (tumor_size_3,0) THEN tumor_size_2
ELSE tumor_size_3
END as max_value,
ptid, note_date, neoplasm_histology_key, tumor_size_1, tumor_size_2, tumor_size_3
from ts_clean
;
--rule 1
select ptid, note_date, neoplasm_histology_key, max_value as value_as_number, 36768255 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
;
--rule 2
select ptid, note_date, neoplasm_histology_key, tumor_size_1 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
where tumor_size_1 !=max_value
union
select ptid, note_date, neoplasm_histology_key, tumor_size_2 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
where tumor_size_2 !=max_value
union
select ptid, note_date, neoplasm_histology_key, tumor_size_3 as value_as_number, 36768664 as measurement_concept_id, 8582 as unit_as_concept_id from ts_values
where tumor_size_3 !=max_value;
--draft of a query that merges onc_neoplasm_histology when one of the fields is empty and at the same date this field is populated
with bb as (
(
select distinct
b.neoplasm_histology_key
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
),
aa as (
select distinct
a.neoplasm_histology_key
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
),
aa_bb as (
select distinct
a.neoplasm_histology_key ||'/'||b.neoplasm_histology_key , --list colums needed, that usually the table has as a result of merge
from onc_neoplasm_histology a
join onc_neoplasm_histology b on a.ptid =b.ptid
and a.note_date = b.note_date
and a.neoplasm_type ='' and b.neoplasm_type !='' and b.neoplasm_qualifier ='actual' and b.neoplasm_temporal_status ='current'
and a.histology !='' and b.histology='' and a.histology_qualifier ='actual' and a.histology_temporal_status ='current'
)
select * from onc_neoplasm_histology where neoplasm_histology_key not in (select aa.neoplasm_histology_key) and neoplasm_histology_key not in (select bb.neoplasm_histology_key)
union
select * from aa_bb
Connect Drug Exposure and lines of therapy
--this is just one patient, need to check how it works with all the patients after we map Episodes
select ma.drug_name, admin_date, lt.* from medication_administrations ma
join onc_lines_of_therapy lt on lt.ptid = ma.ptid and ma.admin_date between lt.initiation and lt.last_administration
where ma.ptid = 'PT306269791'
and ma.drug_name ~'BEVACIZUMAB|FLUOROURACIL|LEUCOVORIN|OXALIPLATIN|IRINOTECAN|CETUXIMAB|PANITUMUMAB|SODIUM IODINE I-131' -- maybe
order by admin_date
Future development
To connect onc_treatment_response and the actual treatment
We need to connect Episode table made from the Lines of therapy to the Observational table made from treatment_response, we can use the query like this: (it returns 18007 out 74698 onc_treatment_response patients has, so these tables are not very consistent)
THUS it is a lower priority
select count(distinct ptid)
from onc_treatment_response r
join onc_lines_of_therapy olot using (ptid)
where r.note_date between olot.initiation and olot.last_administration +30
and treatment not like ‘%radiotherapy%’ – exclude these as we work with the drugs
The treatment response can be stored in Observation table and connected through the observation_event_id to the EPISODE with drug regimen.
Add the Lymph Nodes table to the cancer modifiers
Add empty system name entries
*If the system name is not documented in the table it means, the system name was not mentioned in the notes; Null SYSTEM_NAME with grades for GRADE_TUBULAR, GRADE_NUCLEAR and GRADE_MITOTIC are typically associated with mammary neoplasms, but also can be applied to other NEOPLASM_TYPES.
Justification of decisions
Omit entries where histology_characteristic = ‘in situ’ and invasive at the same day
Since there are mostly same diagnoses associated with both modifiers, we treat these entries as erroneous and remove such entries.
Answer from OPTUM:
There could be multiple reasons that you see more than one characteristics for a neoplasm or histology.
Using the examples provided, ‘in situ’ and ‘invasive’ were mentioned in the same notes but the NLP model didn’t accurately capture the context to make sense of them. For example, the temporality for ‘in situ’ might actually be ‘history’ or a more granular term ‘5 months ago’.
Another reason is one of the terms belongs to another cancer if the patient has multiple cancers. For example on row 3 of the stats table, the patients being counted have both ‘ductal carcinoma’ and ‘carcinoma’ without a linked neoplasm. While ‘ductal carcinoma’ indicates breast cancer, ‘carcinoma’ might be for other cancers.
The workaround we recommend is using ICD code to corroborate. ‘in situ’ is a ‘D%’ code and ‘invasive’ is ‘C%’. So when a patient had been coded as ‘C%’ while having both ‘in situ’ and ‘invasive’ mentions, they are more likely already an ‘invasive’.
I’ve noticed 1 308 343 entries or 86 550 distinct patients when onc_characteristic.histology_characteristic = 'in situ' and at the same day it's equal to 'invasive'
select –1308343
count(1)
from onc_characteristic a
join onc_characteristic b using(ptid , note_date)
where a.histology_characteristic ='in situ' and b.histology_characteristic = 'invasive'
and (a.neoplasm_char_temporal_status ='current' OR a.histology_char_temporal_status ='current') and (b.neoplasm_char_temporal_status ='current' OR b.histology_char_temporal_status ='current')
;
And if we look at the diagnoses associated, it’s the same diagnoses mostly:
select oa.neoplasm_type , oa.histology , ob.neoplasm_type , ob.histology, count(1)
from onc_characteristic a
join onc_characteristic b using(ptid , note_date)
join onc_neoplasm_histology oa on oa.neoplasm_histology_key =a.neoplasm_histology_key
join onc_neoplasm_histology ob on ob.neoplasm_histology_key =b.neoplasm_histology_key
where a.histology_characteristic ='in situ' and b.histology_characteristic = 'invasive'
and (a.neoplasm_char_temporal_status ='current' OR a.histology_char_temporal_status ='current') and (b.neoplasm_char_temporal_status ='current' OR b.histology_char_temporal_status ='current')
group by oa.neoplasm_type , oa.histology , ob.neoplasm_type , ob.histology
order by count(1) desc
;
Here are some top diagnoses, the left associated with ‘in situ’, the right one associated with ‘invasive’ characteristic.
neoplasm_type | histology | neoplasm_type | histology | count |
[NULL] | ductal carcinoma | [NULL] | ductal carcinoma | 475,240 |
[NULL] | ductal carcinoma | mammary | ductal carcinoma | 204,608 |
[NULL] | ductal carcinoma | [NULL] | carcinoma | 148,016 |
mammary | ductal carcinoma | [NULL] | ductal carcinoma | 42,496 |
[NULL] | lobular carcinoma | [NULL] | lobular carcinoma | 41,146 |
[NULL] | lobular carcinoma | [NULL] | ductal carcinoma | 34,411 |
mammary | ductal carcinoma | mammary | ductal carcinoma | 27,113 |
[NULL] | lobular carcinoma | [NULL] | carcinoma | 25,848 |
[NULL] | ductal carcinoma | [NULL] | lobular carcinoma | 22,801 |
[NULL] | lobular carcinoma | mammary | lobular carcinoma | 18,299 |
[NULL] | ductal carcinoma | mammary | carcinoma | 17,361 |
[NULL] | carcinoma | [NULL] | carcinoma | 16,622 |
[NULL] | carcinoma | [NULL] | ductal carcinoma | 16,081 |
[NULL] | lobular carcinoma | mammary | ductal carcinoma | 13,753 |