Summary
Level: Field check
Context: Validation
Category: Conformance
Subcategory: Relational
Severity: Fatal 💀
Description
The number and percent of records with a NULL value in the
cdmFieldName of the cdmTableName that
is considered not nullable
Definition
This check is meant to ensure that all NOT NULL constraints specified
in the CDM version are followed.
-
Numerator: The number of rows with a NULL value in the
column
-
Denominator: The total number of rows in the table
-
Related CDM Convention(s): “Required” flag in CDM table
specs
-
CDM Fields/Tables: By default, this check runs on all
Required fields in the CDM
-
Default Threshold Value: 0%
User Guidance
A failure in this check means that NULL values have ended up in a
column which should not contain any NULL values. There is a wide variety
of potential causes for this issue depending on the column in question;
your source data; and your ETL code. Regardless of its cause, it is
mandatory to fix the issue by ensuring there are no failures of this
check – OHDSI tools/analyses expect required columns to be non-NULL in
all rows.
Violated rows query
SELECT
'@cdmTableName.@cdmFieldName' AS violating_field,
cdmTable.*
FROM @schema.@cdmTableName cdmTable
WHERE cdmTable.@cdmFieldName IS NULL
ETL Developers
Recommended actions:
- To catch this issue further upstream, consider adding a not-null
constraint on the column in your database (if possible)
- Fill in the missing values:
- In some columns, placeholder values are acceptable to replace
missing values. For example, in rows for which there is no _source_value
or no standard concept mapping, the value 0 should be placed in the
_concept_id column
- Similarly, the CDM documentation suggests derivation/imputation
strategies for certain columns. For example, the visit_end_date column
is required but several options for deriving a placeholder are provided:
https://ohdsi.github.io/CommonDataModel/cdm54.html#VISIT_OCCURRENCE.
Consult the documentation for similar conventions on other columns
- For missing values in columns in which it is not acceptable to add a
placeholder or derived value (i.e. primary & foreign keys other than
concept IDs), there is likely a corresponding ETL error which needs to
be fixed
- If you are unable to fill in the missing value for a record
according to the CDM conventions, it is best to remove the record from
your database. It is recommended to document this action for data users,
especially if you need to do this for more than a handful of records
and/or if there is a pattern to the missing data
Data Users
This is a critical failure as it can impact joins and calculations
involving required fields which assume all values are non-NULL. Events
missing a concept, start date, or person ID will not be able to be
included in cohorts. Rows missing a primary key violate fundamental
database integrity principles and could cause a host of downstream
issues. It is also possible that some tools or analysis code have
assumptions around the availability of data in required columns which
may throw errors due to missing values.
If your data provider is unable or unwilling to address the issue and
only a small proportion of rows are affected, proceed at your own risk
with the dataset. If you do so, it is a best practice to interrogate
whether the affected rows could have played any role in your analysis.
If a large proportion of rows are affected, the dataset should not be
used until the issue is fixed.