Module 6 Theory — SDTM Programming with sdtm.oak
�️ Module 6 — SDTM Programming with sdtm.oak
🎯 Learning Objectives
By the end of this module, you will:
- Understand CDISC SDTM standards and domain structures
- Learn to use the
sdtm.oakpackage for SDTM data creation - Master metadata reading and domain specification
- Create standardized SDTM domains (DM, AE, VS, LB)
- Export datasets to XPT format for regulatory submission
- Use GitHub Copilot in RStudio to assist with SDTM programming
📚 1. Understanding CDISC SDTM
What is SDTM?
Study Data Tabulation Model (SDTM) is a standard for organizing and formatting clinical trial data for regulatory submission. It defines:
- Domain Structure: How data should be organized (DM, AE, VS, LB, etc.)
- Variable Names and Labels: Standardized naming conventions
- Controlled Terminology: Consistent coding of values
- Data Relationships: How domains connect via keys like USUBJID
Core SDTM Domains
| Domain | Description | Key Variables |
|---|---|---|
| DM | Demographics | USUBJID, AGE, SEX, RACE, ARMCD |
| AE | Adverse Events | USUBJID, AETERM, AEDECOD, AESTDTC, AESEV |
| VS | Vital Signs | USUBJID, VSTESTCD, VSORRES, VSSTRESC, VSDTC |
| LB | Laboratory | USUBJID, LBTESTCD, LBORRES, LBSTRESC, LBDTC |
| EX | Exposure | USUBJID, EXTRT, EXDOSE, EXSTDTC, EXENDTC |
🛠️ 2. Introduction to sdtm.oak
The sdtm.oak package provides tools for creating SDTM-compliant datasets in R, offering:
- Metadata Management: Read and process SDTM metadata specifications
- Domain Creation: Functions to build standard SDTM domains
- Data Validation: Check compliance with SDTM standards
- XPT Export: Export to SAS transport format for submission
Installation and Setup
# Install sdtm.oak (example - check actual installation method)
install.packages("sdtm.oak") # if available on CRAN
# or
# remotes::install_github("pharmaverse/sdtm.oak")
library(sdtm.oak)
library(dplyr)
library(haven)
library(tibble)🏗️ 3. Creating Demographics (DM) Domain
Basic DM Domain Structure
# Sample raw demographics data
raw_demo <- tibble(
subject_id = c("001-001", "001-002", "001-003", "001-004"),
age = c(25, 45, 67, 52),
sex = c("F", "M", "F", "M"),
race = c("WHITE", "BLACK", "ASIAN", "WHITE"),
treatment_arm = c("Placebo", "Active", "Placebo", "Active"),
ref_start_date = c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18")
)
# Create SDTM DM domain
dm_sdtm <- raw_demo %>%
transmute(
STUDYID = "STUDY001",
DOMAIN = "DM",
USUBJID = subject_id,
SUBJID = str_extract(subject_id, "\\d+$"),
RFSTDTC = ref_start_date,
RFXSTDTC = ref_start_date,
RFXENDTC = "", # To be populated when known
RFPENDTC = "", # To be populated when known
DTHDTC = "", # Date of death if applicable
DTHFL = "", # Death flag
SITEID = str_extract(subject_id, "^\\d+-"),
AGE = age,
AGEU = "YEARS",
SEX = sex,
RACE = race,
ARMCD = case_when(
treatment_arm == "Placebo" ~ "PBO",
treatment_arm == "Active" ~ "TRT",
TRUE ~ ""
),
ARM = treatment_arm,
ACTARMCD = ARMCD, # Actual arm (same as planned for now)
ACTARM = ARM
) %>%
# Add required SDTM variables
mutate(
DMDTC = "", # Date of demographics collection
DMDY = NA_real_ # Study day of demographics collection
)
print("SDTM DM Domain:")
print(dm_sdtm)🚨 4. Creating Adverse Events (AE) Domain
AE Domain with Study Days
# Sample raw AE data
raw_ae <- tibble(
subject_id = c("001-001", "001-001", "001-002", "001-003"),
ae_term = c("Headache", "Nausea", "Fatigue", "Dizziness"),
ae_start_date = c("2024-01-20", "2024-01-25", "2024-01-18", "2024-01-22"),
ae_end_date = c("2024-01-22", "2024-01-26", "2024-01-20", "2024-01-23"),
severity = c("MILD", "MODERATE", "MILD", "SEVERE"),
serious = c("N", "N", "N", "Y"),
outcome = c("RECOVERED", "RECOVERED", "RECOVERED", "ONGOING")
)
# Create SDTM AE domain
ae_sdtm <- raw_ae %>%
left_join(dm_sdtm %>% select(USUBJID, RFSTDTC),
by = c("subject_id" = "USUBJID")) %>%
transmute(
STUDYID = "STUDY001",
DOMAIN = "AE",
USUBJID = subject_id,
AESEQ = row_number(),
AETERM = str_to_upper(ae_term),
AEDECOD = AETERM, # In real scenario, map to MedDRA
AEBODSYS = case_when( # Body system (simplified mapping)
str_detect(AETERM, "HEADACHE|DIZZINESS") ~ "NERVOUS SYSTEM DISORDERS",
str_detect(AETERM, "NAUSEA") ~ "GASTROINTESTINAL DISORDERS",
str_detect(AETERM, "FATIGUE") ~ "GENERAL DISORDERS",
TRUE ~ "OTHER"
),
AESEV = severity,
AESER = serious,
AEOUT = outcome,
AESTDTC = ae_start_date,
AEENDTC = ae_end_date,
# Calculate study days
AESTDY = case_when(
!is.na(ymd(ae_start_date)) & !is.na(ymd(RFSTDTC)) ~
as.numeric(ymd(ae_start_date) - ymd(RFSTDTC)) + 1,
TRUE ~ NA_real_
),
AEENDY = case_when(
!is.na(ymd(ae_end_date)) & !is.na(ymd(RFSTDTC)) ~
as.numeric(ymd(ae_end_date) - ymd(RFSTDTC)) + 1,
TRUE ~ NA_real_
)
)
print("SDTM AE Domain:")
print(ae_sdtm)🔬 5. Creating Vital Signs (VS) Domain
VS Domain Example
# Sample raw vital signs data
raw_vs <- tibble(
subject_id = rep(c("001-001", "001-002"), each = 6),
visit = rep(c("Baseline", "Week 2", "Week 4"), each = 2, times = 2),
test = rep(c("Systolic BP", "Diastolic BP"), times = 6),
result = c(120, 80, 118, 78, 122, 82, 135, 85, 130, 82, 128, 80),
test_date = rep(c("2024-01-15", "2024-01-29", "2024-02-12"), each = 2, times = 2)
)
# Create SDTM VS domain
vs_sdtm <- raw_vs %>%
left_join(dm_sdtm %>% select(USUBJID, RFSTDTC),
by = c("subject_id" = "USUBJID")) %>%
transmute(
STUDYID = "STUDY001",
DOMAIN = "VS",
USUBJID = subject_id,
VSSEQ = row_number(),
VSTESTCD = case_when(
test == "Systolic BP" ~ "SYSBP",
test == "Diastolic BP" ~ "DIABP",
TRUE ~ ""
),
VSTEST = test,
VSCAT = "VITAL SIGNS",
VSORRES = as.character(result),
VSORRESU = "mmHg",
VSSTRESC = as.character(result),
VSSTRESN = result,
VSSTRESU = "mmHg",
VSSTAT = "",
VSREASND = "",
VSDTC = test_date,
VSDY = case_when(
!is.na(ymd(test_date)) & !is.na(ymd(RFSTDTC)) ~
as.numeric(ymd(test_date) - ymd(RFSTDTC)) + 1,
TRUE ~ NA_real_
),
VISIT = visit,
VISITNUM = case_when(
visit == "Baseline" ~ 1,
visit == "Week 2" ~ 2,
visit == "Week 4" ~ 3,
TRUE ~ NA_real_
)
)
print("SDTM VS Domain:")
print(vs_sdtm)📊 6. Data Validation and Quality Checks
Basic SDTM Validation
# Function to validate SDTM domain structure
validate_sdtm_domain <- function(data, domain_type) {
# Common required variables for all domains
common_vars <- c("STUDYID", "DOMAIN", "USUBJID")
# Domain-specific required variables
domain_vars <- switch(domain_type,
"DM" = c(common_vars, "SUBJID", "RFSTDTC", "AGE", "SEX"),
"AE" = c(common_vars, "AESEQ", "AETERM", "AESTDTC"),
"VS" = c(common_vars, "VSSEQ", "VSTESTCD", "VSORRES"),
common_vars
)
# Check for required variables
missing_vars <- setdiff(domain_vars, names(data))
# Check for duplicate keys
key_vars <- switch(domain_type,
"DM" = "USUBJID",
"AE" = c("USUBJID", "AESEQ"),
"VS" = c("USUBJID", "VSTESTCD", "VSDTC"),
"USUBJID"
)
duplicates <- data %>%
group_by(across(all_of(key_vars))) %>%
filter(n() > 1) %>%
nrow()
# Return validation report
list(
domain = domain_type,
missing_required_vars = missing_vars,
duplicate_records = duplicates,
total_records = nrow(data),
validation_passed = length(missing_vars) == 0 && duplicates == 0
)
}
# Validate our domains
dm_validation <- validate_sdtm_domain(dm_sdtm, "DM")
ae_validation <- validate_sdtm_domain(ae_sdtm, "AE")
vs_validation <- validate_sdtm_domain(vs_sdtm, "VS")
print("DM Validation:")
print(dm_validation)📦 7. Export to XPT Format
Creating Submission-Ready Files
# Export SDTM domains to XPT format for regulatory submission
export_sdtm_to_xpt <- function(data, domain_name, output_dir = "sdtm") {
# Create output directory if it doesn't exist
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
# Create file path
file_path <- file.path(output_dir, paste0(tolower(domain_name), ".xpt"))
# Add SDTM dataset label
attr(data, "label") <- paste("SDTM", toupper(domain_name), "Domain")
# Export to XPT
haven::write_xpt(data, file_path)
cat("Exported", nrow(data), "records to", file_path, "\n")
return(file_path)
}
# Export all domains
dm_file <- export_sdtm_to_xpt(dm_sdtm, "dm")
ae_file <- export_sdtm_to_xpt(ae_sdtm, "ae")
vs_file <- export_sdtm_to_xpt(vs_sdtm, "vs")🤖 8. GitHub Copilot in RStudio for SDTM Programming
Effective Copilot Prompts for SDTM
| Comment Prompt | Expected Copilot Suggestion |
|---|---|
# Create SDTM DM domain from raw demographics |
Domain structure with required variables |
# Calculate study days for adverse events |
Date arithmetic with RFSTDTC |
# Map raw lab values to SDTM LB structure |
Standard LB domain variables |
# Validate SDTM domain for missing variables |
Validation function with checks |
# Export multiple domains to XPT format |
Loop with haven::write_xpt |
SDTM Programming Best Practices with Copilot
# Good: Descriptive comment for domain creation
# Create SDTM AE domain with proper sequencing and study day calculation
# Good: Specify CDISC compliance
# Map adverse event terms to MedDRA preferred terms following CDISC conventions
# Good: Include validation requirements
# Validate AE domain for required variables and duplicate USUBJID/AESEQ combinations✅ Summary: SDTM Programming Workflow
| Step | Task | R Tools |
|---|---|---|
| 1. Planning | Define domain structure and metadata | CDISC specifications |
| 2. Data Mapping | Map raw data to SDTM variables | dplyr, case_when |
| 3. Domain Creation | Build standardized domains | sdtm.oak, transmute |
| 4. Validation | Check compliance and quality | Custom validation functions |
| 5. Export | Create XPT files for submission | haven::write_xpt |
| 6. Documentation | Document mappings and assumptions | Comments, metadata |
🎯 Next Steps
In the demo and exercise, you’ll practice: - Creating multiple SDTM domains from raw clinical data - Implementing proper variable mappings and derivations - Using validation functions to ensure data quality
- Exporting domains to XPT format for regulatory submission - Leveraging GitHub Copilot in RStudio for efficient SDTM programming