This article describes how to use DQD to generate only the SQL that executes all DataQualityDashboard checks, without actually executing them. There are a few main advantages of running DQD in Sql-only mode:
sqlOnlyIncrementalInsert = TRUE
and
sqlOnlyUnionCount > 1
, multiple checks are unioned
within a cte in the output SQL query to speed performance. When testing
on Spark, this resulted in a 10x or higher performance gain.
sqlOnlyUnionCount
and/or tune
database parameters such as indexing and parallelismThe new sqlOnlyIncrementalInsert
mode generates SQL
queries that will actually populate a DQD results table in your database
with the results of the checks. There are currently some differences in
the result when running these queries, compared to a normal DQD run:
sqlOnlyUnionCount
> 1, if one check
results in an error, multiple checks might fail (since the queries are
unioned in ctes).not_applicable
is not evaluated. A check
fails or passes.Running DQD with sqlOnly = TRUE
and
sqlOnlyIncrementalInsert = FALSE
will generate SQL queries
that can be run to generate the result of each DQ check, but which will
not write the results back to the database.
A few things to note:
connectionDetails
object is needed where only
the dbms
is used during SQL-only execution.
SqlRender
(see example
below).sqlOnlyUnionCount
determines the number of check sqls
to union in a single query. A smaller number gives more control and
progress information, a higher number typically gives a higher
performance. Here, 100 is used.
library(DataQualityDashboard)
# ConnectionDetails object needed for sql dialect
dbmsConnectionDetails <- DatabaseConnector::createConnectionDetails(
dbms = "sql server", # can be rendered to any dbms upon execution
pathToDriver = "/"
)
# Database parameters that are pre-filled in the written queries
# Use @-syntax if creating a template-sql at execution-time (e.g. "@cdmDatabaseSchema")
cdmDatabaseSchema <- "@cdmDatabaseSchema" # the fully qualified database schema name of the CDM
resultsDatabaseSchema <- "@resultsDatabaseSchema" # the fully qualified database schema name of the results schema (that you can write to)
writeTableName <- "@writeTableName"
sqlFolder <- "./results_sql_only"
cdmSourceName <- "Synthea"
sqlOnly <- TRUE
sqlOnlyIncrementalInsert <- TRUE # this will generate an insert SQL query for each check type that will compute check results and insert them into a database table
sqlOnlyUnionCount <- 100 # this unions up to 100 queries in each insert query
verboseMode <- TRUE
cdmVersion <- "5.4"
checkLevels <- c("TABLE", "FIELD", "CONCEPT")
tablesToExclude <- c()
checkNames <- c()
# Run DQD with sqlOnly=TRUE and sqlOnlyIncrementalInsert=TRUE. This will create a sql file for each check type in the output folder
DataQualityDashboard::executeDqChecks(
connectionDetails = dbmsConnectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
resultsDatabaseSchema = resultsDatabaseSchema,
writeTableName = writeTableName,
cdmSourceName = cdmSourceName,
sqlOnly = sqlOnly,
sqlOnlyUnionCount = sqlOnlyUnionCount,
sqlOnlyIncrementalInsert = sqlOnlyIncrementalInsert,
outputFolder = sqlFolder,
checkLevels = checkLevels,
verboseMode = verboseMode,
cdmVersion = cdmVersion,
tablesToExclude = tablesToExclude,
checkNames = checkNames
)
After running above code, you will end up with a number of sql files in the specified output directory:
TABLE|FIELD|CONCEPT_<check_name>.sql
.ddlDqdResults.sql
with the result table creation
query.The queries can then be run in any SQL client, making sure to run
ddlDqdResults.sql
first. The order of the check queries is
not important, and can even be run in parallel. This will run the check,
and store the result in the specified writeTableName
. In
order to show this result in the DQD Dashboard Shiny app, this table has
to be exported and converted to the .json format. See below for example
code of how this can be achieved.
Below code snippet shows how you can run the generated queries on an
OMOP CDM database using OHDSI R packages, and display the results in the
DQD Dashboard. Note that this approach uses two non-exported DQD
functions (.summarizeResults
,
.writeResultsToJson
) that are not tested for this purpose.
In the future we plan to expand support for incremental-insert mode with
a more robust set of public functions. Please reach out with feedback on
our GitHub
page if you’d like to have input on the development of this new
feature!
library(DatabaseConnector)
cdmSourceName <- "<YourSourceName>"
sqlFolder <- "./results_sql_only"
jsonOutputFolder <- sqlFolder
jsonOutputFile <- "sql_only_results.json"
dbms <- Sys.getenv("DBMS")
server <- Sys.getenv("DB_SERVER")
port <- Sys.getenv("DB_PORT")
user <- Sys.getenv("DB_USER")
password <- Sys.getenv("DB_PASSWORD")
pathToDriver <- Sys.getenv("PATH_TO_DRIVER")
connectionDetails <- DatabaseConnector::createConnectionDetails(
dbms = dbms,
server = server,
port = port,
user = user,
password = password,
pathToDriver = pathToDriver
)
cdmDatabaseSchema <- '<YourCdmSchemaName>'
resultsDatabaseSchema <- '<YourResultsSchemaName>'
writeTableName <- 'dqd_results' # or whatever you want to name your results table
c <- DatabaseConnector::connect(connectionDetails)
# Create results table
ddlFile <- file.path(sqlFolder, "ddlDqdResults.sql")
DatabaseConnector::renderTranslateExecuteSql(
connection = c,
sql = readChar(ddlFile, file.info(ddlFile)$size),
resultsDatabaseSchema = resultsDatabaseSchema,
writeTableName = writeTableName
)
# Run checks
dqdSqlFiles <- Sys.glob(file.path(sqlFolder, "*.sql"))
for (dqdSqlFile in dqdSqlFiles) {
if (dqdSqlFile == ddlFile) {
next
}
print(dqdSqlFile)
tryCatch(
expr = {
DatabaseConnector::renderTranslateExecuteSql(
connection = c,
sql = readChar(dqdSqlFile, file.info(dqdSqlFile)$size),
cdmDatabaseSchema = cdmDatabaseSchema,
resultsDatabaseSchema = resultsDatabaseSchema,
writeTableName = writeTableName
)
},
error = function(e) {
print(sprintf("Writing table failed for check %s with error %s", dqdSqlFile, e$message))
}
)
}
# Extract results table to JSON file for viewing or secondary use
DataQualityDashboard::writeDBResultsToJson(
c,
connectionDetails,
resultsDatabaseSchema,
cdmDatabaseSchema,
writeTableName,
jsonOutputFolder,
jsonOutputFile
)
jsonFilePath <- R.utils::getAbsolutePath(file.path(jsonOutputFolder, jsonOutputFile))
DataQualityDashboard::viewDqDashboard(jsonFilePath)