Module 3 Theory β Joins and Summaries
π Module 3 β Joins & Data Summaries
π― Learning Objectives
By the end of this module, you will:
- Understand tidy data principles and apply them to clinical datasets (R4DS Ch. 12)
- Reshape clinical data using
pivot_longer()andpivot_wider()for analysis and reporting - Master relational data operations with join functions for combining clinical domains (R4DS Ch. 13)
- Use
group_byandsummariseto create summary statistics by treatment groups - Generate frequency tables using
countandnfunctions
- Practice clinical scenarios: combining DM, AE, and VS domains with tidy data principles
- Handle missing data and edge cases in join operations
οΏ½ 1. Tidy Data Principles for Clinical Studies (R4DS Ch. 12)
What Makes Clinical Data βTidyβ?
In clinical programming, tidy data follows three key principles:
- Each variable forms a column (USUBJID, AVAL, PARAM)
- Each observation forms a row (one lab result per row, one AE per row)
- Each type of observational unit forms a table (DM, AE, VS are separate)
Clinical Data: Wide vs Long Format
library(tidyverse)
# Wide format (common in clinical datasets) - NOT TIDY
vital_signs_wide <- tibble(
USUBJID = c("001-001", "001-002", "001-003"),
SYSBP_BL = c(120, 135, 115),
DIABP_BL = c(80, 85, 75),
PULSE_BL = c(72, 78, 68),
SYSBP_W4 = c(118, 130, 112),
DIABP_W4 = c(78, 82, 72),
PULSE_W4 = c(70, 75, 65)
)
# Long format (tidy) - BETTER for analysis
vital_signs_long <- vital_signs_wide %>%
pivot_longer(
cols = -USUBJID,
names_to = c("PARAM", "VISIT"),
names_sep = "_",
values_to = "AVAL"
) %>%
mutate(
PARAM = case_when(
PARAM == "SYSBP" ~ "Systolic Blood Pressure",
PARAM == "DIABP" ~ "Diastolic Blood Pressure",
PARAM == "PULSE" ~ "Pulse Rate"
),
VISIT = case_when(
VISIT == "BL" ~ "Baseline",
VISIT == "W4" ~ "Week 4"
)
)Reshaping Clinical Data with pivot_longer() and pivot_wider()
# Convert wide lab data to long (CDISC BDS structure)
lab_wide <- tibble(
USUBJID = c("001-001", "001-002", "001-003"),
ALT_BL = c(25, 30, 28),
AST_BL = c(22, 28, 26),
ALT_W2 = c(27, 32, 30),
AST_W2 = c(24, 30, 28)
)
# Make it tidy (long format)
lab_long <- lab_wide %>%
pivot_longer(
cols = -USUBJID,
names_to = c("PARAM", "VISIT"),
names_sep = "_",
values_to = "AVAL"
)
# Convert back to wide for reporting (if needed)
lab_summary_wide <- lab_long %>%
group_by(USUBJID, PARAM) %>%
summarise(
Baseline = first(AVAL[VISIT == "BL"]),
Week2 = first(AVAL[VISIT == "W2"]),
Change = Week2 - Baseline,
.groups = "drop"
) %>%
pivot_wider(
names_from = PARAM,
values_from = c(Baseline, Week2, Change),
names_sep = "_"
)π 2. Relational Data in Clinical Studies (R4DS Ch. 13)
Understanding Keys in Clinical Data
Every clinical database has primary keys and foreign keys that define relationships:
- Primary Key: Uniquely identifies each row (USUBJID in DM, USUBJID+AESEQ in AE)
- Foreign Key: Links to another table (USUBJID in AE links to USUBJID in DM)
# Check for primary key violations (should return no duplicates)
dm %>% count(USUBJID) %>% filter(n > 1) # Should be empty
ae %>% count(USUBJID, AESEQ) %>% filter(n > 1) # Should be emptyCommon Clinical Join Scenarios:
- Demographics + Adverse Events: Add subject characteristics to AE data
- Demographics + Lab Values: Add baseline characteristics to lab results
- ADSL + BDS domains: Add subject-level analysis variables to event data
- Exposure + Efficacy: Combine treatment information with outcomes
Types of Joins (R4DS Ch. 13):
| Join Type | Description | When to Use | Clinical Example |
|---|---|---|---|
left_join() |
Keep all rows from left dataset | Most common - preserve main dataset | Add DM to AE (keep all AEs) |
inner_join() |
Keep only matching rows | When you need complete data only | Safety population analysis |
right_join() |
Keep all rows from right dataset | Less common in clinical programming | Rarely used |
full_join() |
Keep all rows from both datasets | When you need complete picture | Data reconciliation |
semi_join() |
Keep rows in x that have match in y | Filter to matching subjects | ITT population |
anti_join() |
Keep rows in x that have NO match in y | Find missing data | Screen failures |
π§ 3. Basic Join Operations
Setting Up Sample Data
library(dplyr)
library(tibble)
# Demographics data
dm <- tibble(
USUBJID = c("001-001", "001-002", "001-003", "001-004", "001-005"),
AGE = c(45, 67, 52, 71, 34),
SEX = c("M", "F", "M", "F", "M"),
ARMCD = c("TRT", "PBO", "TRT", "TRT", "PBO")
)
# Adverse Events data
ae <- tibble(
USUBJID = c("001-001", "001-001", "001-002", "001-004", "001-006"),
AEDECOD = c("HEADACHE", "NAUSEA", "FATIGUE", "DIZZINESS", "HEADACHE"),
AESEV = c("MILD", "MODERATE", "MILD", "SEVERE", "MILD")
)left_join() - Most Common in Clinical Programming
# Add demographics to adverse events (preserve all AE records)
ae_with_demo <- ae %>%
left_join(dm, by = "USUBJID")
# Result: AE data with subject characteristics added
# Note: Subject 001-006 will have NA for demo variables (not in DM)inner_join() - Complete Cases Only
# Keep only AEs for subjects with complete demographics
ae_complete <- ae %>%
inner_join(dm, by = "USUBJID")
# Result: Excludes AE for subject 001-006 (not in DM)Filtering Joins (R4DS Ch. 13)
# semi_join: Find subjects who had any adverse events
subjects_with_ae <- dm %>%
semi_join(ae, by = "USUBJID")
# anti_join: Find subjects who had NO adverse events
subjects_without_ae <- dm %>%
anti_join(ae, by = "USUBJID")
# Useful for population flags in ADSL
adsl <- dm %>%
mutate(
SAFFL = ifelse(USUBJID %in% ae$USUBJID, "Y", "N"), # Safety population
ITTFL = "Y" # Intent-to-treat (all randomized subjects)
)π 4. Group By and Summarise Operations
Basic Summarisation
# Count adverse events by treatment group
ae_summary <- ae_with_demo %>%
group_by(ARMCD) %>%
summarise(
n_subjects = n_distinct(USUBJID),
n_events = n(),
.groups = "drop"
)
# Count events by severity
severity_summary <- ae_with_demo %>%
group_by(AESEV) %>%
summarise(
count = n(),
percent = round(100 * n() / nrow(ae_with_demo), 1),
.groups = "drop"
)Age-Based Analysis (Elderly vs Non-Elderly)
# First, create elderly flag in demographics
dm <- dm %>%
mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))
# Join and summarise AEs by elderly status
ae_by_elderly <- ae %>%
left_join(dm, by = "USUBJID") %>%
group_by(ELDERLY, AESEV) %>%
summarise(
n_events = n(),
n_subjects = n_distinct(USUBJID),
.groups = "drop"
)Multiple Grouping Variables
# Summarise by treatment and elderly status
complex_summary <- ae %>%
left_join(dm, by = "USUBJID") %>%
mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N")) %>%
group_by(ARMCD, ELDERLY, AESEV) %>%
summarise(
n_events = n(),
n_subjects = n_distinct(USUBJID),
mean_age = round(mean(AGE, na.rm = TRUE), 1),
.groups = "drop"
) %>%
arrange(ARMCD, ELDERLY, AESEV)π’ 5. Frequency Tables with count()
The count() function is a shortcut for group_by() + summarise(n = n()):
# Simple frequency table
ae_counts <- ae_with_demo %>%
count(AEDECOD, sort = TRUE)
# Cross-tabulation
ae_cross_tab <- ae_with_demo %>%
count(ARMCD, AESEV) %>%
tidyr::pivot_wider(names_from = AESEV, values_from = n, values_fill = 0)
# Count with proportions
ae_with_prop <- ae_with_demo %>%
count(AEDECOD) %>%
mutate(
percent = round(100 * n / sum(n), 1),
label = paste0(n, " (", percent, "%)")
)π¨ 6. Clinical Example: Adverse Events Analysis
Letβs work through a complete adverse events analysis:
# Complete AE analysis pipeline
ae_analysis <- ae %>%
left_join(dm, by = "USUBJID") %>%
mutate(
ELDERLY = ifelse(AGE >= 65, "Y", "N"),
SEVERE_AE = ifelse(AESEV == "SEVERE", "Y", "N")
) %>%
group_by(ARMCD, ELDERLY) %>%
summarise(
n_subjects = n_distinct(USUBJID),
n_total_aes = n(),
n_severe_aes = sum(SEVERE_AE == "Y", na.rm = TRUE),
severe_rate = round(100 * n_severe_aes / n_total_aes, 1),
aes_per_subject = round(n_total_aes / n_subjects, 1),
.groups = "drop"
) %>%
arrange(desc(severe_rate))
print(ae_analysis)β οΈ 7. Handling Missing Data in Joins
# Check for subjects in AE but not in DM
ae_subjects <- unique(ae$USUBJID)
dm_subjects <- unique(dm$USUBJID)
missing_in_dm <- setdiff(ae_subjects, dm_subjects)
missing_in_ae <- setdiff(dm_subjects, ae_subjects)
cat("Subjects in AE but not DM:", missing_in_dm, "\n")
cat("Subjects in DM but not AE:", missing_in_ae, "\n")
# Handle missing demographic data explicitly
ae_safe <- ae %>%
left_join(dm, by = "USUBJID") %>%
mutate(
ELDERLY = case_when(
is.na(AGE) ~ "Unknown",
AGE >= 65 ~ "Y",
TRUE ~ "N"
),
ARMCD = if_else(is.na(ARMCD), "Unknown", ARMCD)
)π€ 8. GitHub Copilot for Joins and Summaries
Effective Prompts:
| Comment Prompt | Expected Copilot Suggestion |
|---|---|
# Join demographics to adverse events |
left_join(dm, by = "USUBJID") |
# Count AEs by treatment and severity |
group_by(ARMCD, AESEV) %>% summarise(n = n()) |
# Calculate AE rates by elderly status |
group_by(ELDERLY) %>% summarise(rate = n()/n_distinct(USUBJID)) |
# Create cross-tabulation of treatment by outcome |
count(ARMCD, OUTCOME) %>% pivot_wider(...) |
π Module Summary
By completing this module, you should now be able to:
β
Perform joins using left_join, inner_join, and other dplyr join functions
β
Create summaries using group_by and summarise for clinical analysis
β
Generate frequency tables with count() and n() functions
β
Analyze adverse events by elderly vs non-elderly populations
β
Handle missing data appropriately in join operations
π Next Steps:
- Practice with the demo exercises
- Try adverse event analysis on your own data
- Prepare for Module 4: Date & Text Handling
π‘ Key Takeaways
- left_join() is most common in clinical programming - preserves your main dataset
- group_by() + summarise() creates powerful analysis summaries
- count() is a shortcut for simple frequency tables
- Handle missing data explicitly when joining datasets
- n_distinct() counts unique subjects, n() counts records
- Elderly flag analysis is a common clinical programming pattern
Ready to learn about dates and text handling? Letβs move to Module 4!