vignettes/checks/plausibleValueHigh.Rmd
plausibleValueHigh.Rmd
Level: FIELD
Context: Verification
Category: Plausibility
Subcategory: Atemporal
Severity: Characterization ✔
The number and percent of records with a value in the @cdmFieldName field of the @cdmTableName table greater than @plausibleValueHigh.
PERSON.day_of_birth
(compared to 31)PERSON.month_of_birth
(compared to 12)PERSON.year_of_birth
(compared to this year + 1
year)DRUG_EXPOSURE.refills
(compared to 24)DRUG_EXPOSURE.days_supply
(compared to 365)DRUG_EXPOSURE.quantity
(compared to 1095)This check counts the number of records that have a value in the specified field that is higher than some expected value. Failures of this check might represent true data anomalies, but especially in the case when the failure percentage is high, something may be afoot in the ETL pipeline.
Use this query to inspect rows with an implausibly high value:
The investigation approach may differ by the field being checked. For
example, for CONDITION_OCURRENCE.condition_start_date
you
might look how much it differs in average, to find a clue as to what
happened:
SELECT
MEDIAN(DATEDIFF(day, condition_start_date, current_date))
FROM condition_occurrence
WHERE condition_start_date > current_date
;
Or the discrepancy be associated with specific attributes:
SELECT
co.condition_concept_id,
co.condition_type_concept_id,
co.condition_status_concept_id,
COUNT(1)
FROM condition_occurrence co
WHERE condition_start_date > current_date
GROUP BY co.condition_concept_id, co.condition_type_concept_id, co.condition_status_concept_id
ORDER BY COUNT(1) DESC
;
There might be several different causes of future dates: typos in the source data, wrong data format used in the conversion, timezone issues in the ETL environment and/or database, etc.
For the DRUG_EXPOSURE
values, there might be be typos,
data processing bugs (for example, if days supply is calculated), or
rare true cases when a prescription deviated from standard industry
practices.
If the issue is determined to be related to ETL logic, it must be
fixed. If it’s a source data issue, work with your data partners and
users to determine the best remediation approach. PERSON
rows with invalid birth dates should be removed from the CDM, as any
analysis relying on age will be negatively impacted. Other implausible
values should be explainable based on your understanding of the source
data if they are to be retained. In some cases event rows may need to be
dropped from the CDM if the implausible value is unexplainable and could
cause downstream quality issues. Be sure to clearly document any data
removal logic in your ETL specification.
The implication of a failure of this check depends on the count of errors and your need for the impacted columns. If it’s a small count, it might just be noise in the data which will unlikely impact an analysis. If the count is large, however, proceed carefully - events with future dates will likely be excluded from your analysis, and drugs with inflated supply values could throw off any analysis considering duration or patterns of treatment.