Summary

Level: Field check
Context: Verification
Category: Conformance
Subcategory: Relational
Severity: Fatal 💀

Description

The number and percent of records that have a duplicate value in the cdmFieldName field of the cdmTableName.

Definition

This check will make sure that all primary keys as specified in the CDM version are truly unique keys. While this issue should generally be prevented by primary key database constraints, some database management systems such as Redshift do not enforce these constraints.

  • Numerator: The number of values in the column that appear in more than 1 row
  • Denominator: The total number of rows in the table
  • Related CDM Convention(s): Primary Key flag in CDM table specs
  • CDM Fields/Tables: By default, this check runs on all primary key columns in the CDM
  • Default Threshold Value: 0%

User Guidance

Multiple values for a primary key must be corrected. Failure to have unique values for a primary key will result in incorrect results being returned for queries that use these fields. This is especially true for joins - joins on columns where multiple records are found where a single record is assumed will result in inflation of the result set (“fanning”). Also, some analytic frameworks may raise errors if more than one record is found for an entity expected to be unique.

Violated rows query

SELECT 
  '@cdmTableName.@cdmFieldName' AS violating_field,  
  cdmTable.*,
  COUNT_BIG(*) OVER (PARTITION BY @cdmTableName.@cdmFieldName) AS dupe_count
FROM @cdmDatabaseSchema.@cdmTableName
WHERE dupe_count > 1
ORDER BY dupe_count DESC;

ETL Developers

In some cases, a primary key error could arise from a 1:1 relationship modeled in the CDM that is modeled as a 1:n relationship in the source system. For example, a single person could have multiple patient identifiers in a source system. In most cases the multiple records need to be collapsed into a single record.

Deduplication and merging of duplicate patient datasets is a non-trivial process, and the intent of the multiple patient records needs be ascertained prior to making design decisions. For example, multiple records could exist for the same patient in a claims system who was covered by the insurer during one period as a member of a first group and then later re-entered the system as new member of a different group (e.g. new employer). In other cases multiple records could indicate updates to the original record and the latest record could be considered the “correct” information.

Data Users

Whenever possible, the ETL developer / data provider should be involved in resolving a primary key error as this represents a critical failure in the ETL process. Depending on the nature of the error, you may be able to remove duplicate rows from a table to resolve the error; however, proceed at your own risk as these duplicates could be the sign of a deeper issue that needs to be resolved further upstream.