AI-Assisted Quality Control for Clinical Programming
🔍 QC Validation Toolkit
AI-Assisted Quality Control for Clinical Programming
GitHub Copilot Integration for Automated QC Procedures
This toolkit demonstrates comprehensive QC procedures using R with GitHub Copilot assistance for clinical programming validation.
Overview
Quality Control (QC) is a critical component of clinical programming that ensures data integrity, regulatory compliance, and statistical accuracy. This toolkit provides:
Automated data validation procedures
AI-assisted error detection using GitHub Copilot
Standardized QC report generation
Cross-domain validation checks
Required Libraries
library(dplyr)library(tibble)library(stringr)library(lubridate)library(tidyr)library(purrr)cat("=== AI-Assisted Clinical Data Quality Control ===\n")
=== AI-Assisted Clinical Data Quality Control ===
cat("Comprehensive QC procedures with GitHub Copilot integration\n\n")
Comprehensive QC procedures with GitHub Copilot integration
Mock Clinical Data with Intentional Issues
We’ll create test data with various quality issues to demonstrate our QC procedures.
set.seed(2024)# Create ADSL with various data quality issuesadsl_qc <-tibble(STUDYID ="ABC-123",USUBJID =c(paste0("ABC-123-", sprintf("%03d", 1:18)),"ABC-123-019", # Duplicate will be added"ABC-123-019"# Duplicate subject ),SUBJID =c(sprintf("%03d", 1:18), "019", "019"),AGE =c(sample(18:75, 18, replace =TRUE), 65, NA), # Missing ageSEX =c(sample(c("M", "F"), 18, replace =TRUE), "X", "F"), # Invalid sex codeRACE =c(sample(c("WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN"), 18, replace =TRUE), "white", "WHITE"), # Inconsistent caseTRT01P =c(sample(c("Placebo", "Study Drug 5mg", "Study Drug 10mg"), 18, replace =TRUE),"PLACEBO", "Study Drug 5mg"), # Inconsistent treatment nameTRT01PN =c(sample(c(0, 5, 10), 18, replace =TRUE), 0, 5),TRTSDT =c(rep(as.Date("2024-01-15"), 18),as.Date("2023-12-01"), # Treatment start before study startas.Date("2024-01-15") ),TRTEDT =c(rep(as.Date("2024-03-15"), 18),as.Date("2024-03-15"),NA# Missing treatment end date ),SAFFL =c(rep("Y", 19), ""), # Missing safety flagITTFL =c(rep("Y", 18), "N", "Y"),COMPLFL =c(rep("Y", 17), "N", "Y", "Y"),DTHFL =c(rep("N", 19), "Y"),DTHDTC =c(rep("", 19), "2024-02-28"),DTHDT =c(rep(as.Date(NA), 19), as.Date("2024-02-28")))# Display the problematic dataprint("ADSL Data with Quality Issues:")
[1] "ADSL Data with Quality Issues:"
print(adsl_qc)
# A tibble: 20 × 16
STUDYID USUBJID SUBJID AGE SEX RACE TRT01P TRT01PN TRTSDT TRTEDT
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <date> <date>
1 ABC-123 ABC-12… 001 19 F ASIAN Place… 5 2024-01-15 2024-03-15
2 ABC-123 ABC-12… 002 54 F ASIAN Study… 0 2024-01-15 2024-03-15
3 ABC-123 ABC-12… 003 62 M WHITE Place… 0 2024-01-15 2024-03-15
4 ABC-123 ABC-12… 004 34 M BLAC… Study… 0 2024-01-15 2024-03-15
5 ABC-123 ABC-12… 005 62 M ASIAN Place… 10 2024-01-15 2024-03-15
6 ABC-123 ABC-12… 006 49 M WHITE Study… 0 2024-01-15 2024-03-15
7 ABC-123 ABC-12… 007 58 M WHITE Place… 10 2024-01-15 2024-03-15
8 ABC-123 ABC-12… 008 75 F ASIAN Study… 0 2024-01-15 2024-03-15
9 ABC-123 ABC-12… 009 51 M WHITE Study… 5 2024-01-15 2024-03-15
10 ABC-123 ABC-12… 010 46 F BLAC… Place… 5 2024-01-15 2024-03-15
11 ABC-123 ABC-12… 011 28 F WHITE Study… 0 2024-01-15 2024-03-15
12 ABC-123 ABC-12… 012 33 F WHITE Place… 0 2024-01-15 2024-03-15
13 ABC-123 ABC-12… 013 46 F WHITE Study… 5 2024-01-15 2024-03-15
14 ABC-123 ABC-12… 014 31 F BLAC… Study… 10 2024-01-15 2024-03-15
15 ABC-123 ABC-12… 015 51 F BLAC… Place… 5 2024-01-15 2024-03-15
16 ABC-123 ABC-12… 016 43 M WHITE Place… 10 2024-01-15 2024-03-15
17 ABC-123 ABC-12… 017 61 F ASIAN Study… 5 2024-01-15 2024-03-15
18 ABC-123 ABC-12… 018 67 F WHITE Study… 10 2024-01-15 2024-03-15
19 ABC-123 ABC-12… 019 65 X white PLACE… 0 2023-12-01 2024-03-15
20 ABC-123 ABC-12… 019 NA F WHITE Study… 5 2024-01-15 NA
# ℹ 6 more variables: SAFFL <chr>, ITTFL <chr>, COMPLFL <chr>, DTHFL <chr>,
# DTHDTC <chr>, DTHDT <date>
=== Duplicate Subject Check ===
❌ ISSUE FOUND: Duplicate subjects detected
Number of duplicate records: 2
Duplicate subjects:
# A tibble: 2 × 4
USUBJID SUBJID AGE SEX
<chr> <chr> <dbl> <chr>
1 ABC-123-019 019 65 X
2 ABC-123-019 019 NA F
2. Missing Critical Variables Check
check_missing_critical <-function(data, critical_vars =c("USUBJID", "AGE", "SEX", "TRTSDT")) {cat("\n=== Missing Critical Variables Check ===\n") issues <-list()for (var in critical_vars) {if (var %in%names(data)) { missing_count <-sum(is.na(data[[var]]) | data[[var]] =="", na.rm =TRUE) missing_pct <-round(missing_count /nrow(data) *100, 2)if (!is.na(missing_count) && missing_count >0) {cat("❌", var, ":", missing_count, "missing values (", missing_pct, "%)\n")# Show which subjects have missing values missing_subjects <- data %>%filter(is.na(!!sym(var)) |!!sym(var) =="") %>%select(USUBJID, !!sym(var)) issues[[var]] <-list(count = missing_count,percentage = missing_pct,subjects = missing_subjects ) } else {cat("✅", var, ": No missing values\n") } } else {cat("⚠️", var, ": Variable not found in dataset\n") } }return(issues)}# Run missing values checkmissing_result <-check_missing_critical(adsl_qc)
=== Missing Critical Variables Check ===
✅ USUBJID : No missing values
❌ AGE : 1 missing values ( 5 %)
✅ SEX : No missing values
✅ TRTSDT : No missing values
3. Controlled Terminology Validation
check_controlled_terminology <-function(data) {cat("\n=== Controlled Terminology Validation ===\n")# Define expected controlled terminology valid_sex <-c("M", "F") valid_race <-c("WHITE", "BLACK OR AFRICAN AMERICAN", "ASIAN", "AMERICAN INDIAN OR ALASKA NATIVE", "NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER") valid_flags <-c("Y", "N", "") issues <-list()# Check SEX invalid_sex <- data %>%filter(!SEX %in% valid_sex) %>%select(USUBJID, SEX)if (nrow(invalid_sex) >0) {cat("❌ SEX: Invalid values found\n")print(invalid_sex) issues$SEX <- invalid_sex } else {cat("✅ SEX: All values valid\n") }# Check RACE (case sensitivity) invalid_race <- data %>%filter(!RACE %in% valid_race) %>%select(USUBJID, RACE)if (nrow(invalid_race) >0) {cat("❌ RACE: Invalid/inconsistent values found\n")print(invalid_race) issues$RACE <- invalid_race } else {cat("✅ RACE: All values valid\n") }# Check flags flag_vars <-c("SAFFL", "ITTFL", "COMPLFL", "DTHFL")for (flag_var in flag_vars) {if (flag_var %in%names(data)) { invalid_flags <- data %>%filter(!is.na(!!sym(flag_var)) &!(!!sym(flag_var) %in% valid_flags)) %>%select(USUBJID, !!sym(flag_var))if (nrow(invalid_flags) >0) {cat("❌", flag_var, ": Invalid flag values found\n")print(invalid_flags) issues[[flag_var]] <- invalid_flags } else {cat("✅", flag_var, ": All flag values valid\n") } } }return(issues)}# Run controlled terminology checkct_result <-check_controlled_terminology(adsl_qc)
=== Controlled Terminology Validation ===
❌ SEX: Invalid values found
# A tibble: 1 × 2
USUBJID SEX
<chr> <chr>
1 ABC-123-019 X
❌ RACE: Invalid/inconsistent values found
# A tibble: 1 × 2
USUBJID RACE
<chr> <chr>
1 ABC-123-019 white
✅ SAFFL : All flag values valid
✅ ITTFL : All flag values valid
✅ COMPLFL : All flag values valid
✅ DTHFL : All flag values valid
4. Date Logic Validation
check_date_logic <-function(data) {cat("\n=== Date Logic Validation ===\n") issues <-list() study_start <-as.Date("2024-01-01") # Define study start date# Check treatment start vs study start early_treatment <- data %>%filter(!is.na(TRTSDT) & TRTSDT < study_start) %>%select(USUBJID, TRTSDT)if (nrow(early_treatment) >0) {cat("❌ Treatment start before study start:\n")print(early_treatment) issues$early_treatment <- early_treatment } else {cat("✅ All treatment start dates valid\n") }# Check treatment end vs treatment start invalid_treatment_duration <- data %>%filter(!is.na(TRTSDT) &!is.na(TRTEDT) & TRTEDT < TRTSDT) %>%select(USUBJID, TRTSDT, TRTEDT)if (nrow(invalid_treatment_duration) >0) {cat("❌ Treatment end before treatment start:\n")print(invalid_treatment_duration) issues$invalid_duration <- invalid_treatment_duration } else {cat("✅ All treatment durations valid\n") }# Check death date logic death_logic_issues <- data %>%filter(DTHFL =="Y"& (is.na(DTHDT) | DTHDT =="")) %>%select(USUBJID, DTHFL, DTHDT)if (nrow(death_logic_issues) >0) {cat("❌ Death flag Y but missing death date:\n")print(death_logic_issues) issues$death_logic <- death_logic_issues } else {cat("✅ Death date logic consistent\n") }return(issues)}# Run date logic checkdate_result <-check_date_logic(adsl_qc)
=== Date Logic Validation ===
❌ Treatment start before study start:
# A tibble: 1 × 2
USUBJID TRTSDT
<chr> <date>
1 ABC-123-019 2023-12-01
✅ All treatment durations valid
✅ Death date logic consistent
5. Age Validation
check_age_validation <-function(data) {cat("\n=== Age Validation ===\n") issues <-list()# Check age range invalid_age <- data %>%filter(!is.na(AGE) & (AGE <18| AGE >100)) %>%select(USUBJID, AGE)if (nrow(invalid_age) >0) {cat("❌ Age outside expected range (18-100):\n")print(invalid_age) issues$age_range <- invalid_age } else {cat("✅ All ages within expected range\n") }# Age distribution summarycat("\nAge Distribution Summary:\n") age_summary <- data %>%filter(!is.na(AGE)) %>%summarise(n =n(),mean_age =round(mean(AGE), 1),median_age =median(AGE),min_age =min(AGE),max_age =max(AGE),sd_age =round(sd(AGE), 1) )print(age_summary)return(issues)}# Run age validationage_result <-check_age_validation(adsl_qc)
=== Age Validation ===
✅ All ages within expected range
Age Distribution Summary:
# A tibble: 1 × 6
n mean_age median_age min_age max_age sd_age
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 19 49.2 51 19 75 15
Cross-Domain Validation
# Mock additional domain for cross-validationadae_qc <-tibble(USUBJID =c(sample(adsl_qc$USUBJID[1:15], 10), # Some subjects from ADSL"ABC-123-999"# Subject not in ADSL ),AEDECOD =sample(c("Headache", "Nausea", "Dizziness", "Fatigue"), 11, replace =TRUE),AESEV =sample(c("MILD", "MODERATE", "SEVERE"), 11, replace =TRUE),AEREL =sample(c("RELATED", "NOT RELATED", "POSSIBLY RELATED"), 11, replace =TRUE))check_cross_domain <-function(adsl_data, ae_data) {cat("\n=== Cross-Domain Validation ===\n") issues <-list()# Check for AE subjects not in ADSL ae_subjects <-unique(ae_data$USUBJID) adsl_subjects <-unique(adsl_data$USUBJID) orphan_ae <-setdiff(ae_subjects, adsl_subjects)if (length(orphan_ae) >0) {cat("❌ AE subjects not found in ADSL:\n")cat(paste(orphan_ae, collapse =", "), "\n") issues$orphan_ae <- orphan_ae } else {cat("✅ All AE subjects found in ADSL\n") }# Check for ADSL subjects with safety flag Y but no AEs safety_subjects <- adsl_data %>%filter(SAFFL =="Y") %>%pull(USUBJID) subjects_no_ae <-setdiff(safety_subjects, ae_subjects)cat("Safety population subjects without AEs:", length(subjects_no_ae), "\n")if (length(subjects_no_ae) >5) { # Only show first 5cat("Examples:", paste(head(subjects_no_ae, 5), collapse =", "), "\n") }return(issues)}# Run cross-domain checkcross_result <-check_cross_domain(adsl_qc, adae_qc)
=== Cross-Domain Validation ===
❌ AE subjects not found in ADSL:
ABC-123-999
Safety population subjects without AEs: 9
Examples: ABC-123-005, ABC-123-010, ABC-123-011, ABC-123-013, ABC-123-015
Here are effective GitHub Copilot prompts for clinical data QC:
# Example Copilot prompts for QC programming:# 1. Generate validation function# Prompt: "Create a function to validate SDTM dates against FDA requirements"# 2. Check data completeness # Prompt: "Write R code to check missing values in clinical trial dataset by treatment group"# 3. Controlled terminology validation# Prompt: "Create validation rules for CDISC controlled terminology in R using dplyr"# 4. Cross-domain referential integrity# Prompt: "Write function to check that all subjects in AE domain exist in DM domain"# 5. Statistical outlier detection# Prompt: "Identify statistical outliers in vital signs data using R"