Module 3 Theory — Joins and Summaries

🔗 Module 3 — Joins & Data Summaries

🎯 Learning Objectives

By the end of this module, you will:

Understand tidy data principles and apply them to clinical datasets (R4DS Ch. 12)
Reshape clinical data using pivot_longer() and pivot_wider() for analysis and reporting
Master relational data operations with join functions for combining clinical domains (R4DS Ch. 13)
Use group_by and summarise to create summary statistics by treatment groups
Generate frequency tables using count and n functions
Practice clinical scenarios: combining DM, AE, and VS domains with tidy data principles
Handle missing data and edge cases in join operations

� 1. Tidy Data Principles for Clinical Studies (R4DS Ch. 12)

What Makes Clinical Data “Tidy”?

In clinical programming, tidy data follows three key principles:

Each variable forms a column (USUBJID, AVAL, PARAM)
Each observation forms a row (one lab result per row, one AE per row)
Each type of observational unit forms a table (DM, AE, VS are separate)

Clinical Data: Wide vs Long Format

library(tidyverse)

# Wide format (common in clinical datasets) - NOT TIDY
vital_signs_wide <- tibble(
  USUBJID = c("001-001", "001-002", "001-003"),
  SYSBP_BL = c(120, 135, 115),
  DIABP_BL = c(80, 85, 75),
  PULSE_BL = c(72, 78, 68),
  SYSBP_W4 = c(118, 130, 112),
  DIABP_W4 = c(78, 82, 72),
  PULSE_W4 = c(70, 75, 65)
)

# Long format (tidy) - BETTER for analysis
vital_signs_long <- vital_signs_wide %>%
  pivot_longer(
    cols = -USUBJID,
    names_to = c("PARAM", "VISIT"),
    names_sep = "_",
    values_to = "AVAL"
  ) %>%
  mutate(
    PARAM = case_when(
      PARAM == "SYSBP" ~ "Systolic Blood Pressure",
      PARAM == "DIABP" ~ "Diastolic Blood Pressure", 
      PARAM == "PULSE" ~ "Pulse Rate"
    ),
    VISIT = case_when(
      VISIT == "BL" ~ "Baseline",
      VISIT == "W4" ~ "Week 4"
    )
  )

Reshaping Clinical Data with `pivot_longer()` and `pivot_wider()`

# Convert wide lab data to long (CDISC BDS structure)
lab_wide <- tibble(
  USUBJID = c("001-001", "001-002", "001-003"),
  ALT_BL = c(25, 30, 28),
  AST_BL = c(22, 28, 26),
  ALT_W2 = c(27, 32, 30),
  AST_W2 = c(24, 30, 28)
)

# Make it tidy (long format)
lab_long <- lab_wide %>%
  pivot_longer(
    cols = -USUBJID,
    names_to = c("PARAM", "VISIT"),
    names_sep = "_",
    values_to = "AVAL"
  )

# Convert back to wide for reporting (if needed)
lab_summary_wide <- lab_long %>%
  group_by(USUBJID, PARAM) %>%
  summarise(
    Baseline = first(AVAL[VISIT == "BL"]),
    Week2 = first(AVAL[VISIT == "W2"]),
    Change = Week2 - Baseline,
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = PARAM,
    values_from = c(Baseline, Week2, Change),
    names_sep = "_"
  )

🔗 2. Relational Data in Clinical Studies (R4DS Ch. 13)

Understanding Keys in Clinical Data

Every clinical database has primary keys and foreign keys that define relationships:

Primary Key: Uniquely identifies each row (USUBJID in DM, USUBJID+AESEQ in AE)
Foreign Key: Links to another table (USUBJID in AE links to USUBJID in DM)

# Check for primary key violations (should return no duplicates)
dm %>% count(USUBJID) %>% filter(n > 1)  # Should be empty
ae %>% count(USUBJID, AESEQ) %>% filter(n > 1)  # Should be empty

Common Clinical Join Scenarios:

Demographics + Adverse Events: Add subject characteristics to AE data
Demographics + Lab Values: Add baseline characteristics to lab results
ADSL + BDS domains: Add subject-level analysis variables to event data
Exposure + Efficacy: Combine treatment information with outcomes

Types of Joins (R4DS Ch. 13):

Join Type	Description	When to Use	Clinical Example
`left_join()`	Keep all rows from left dataset	Most common - preserve main dataset	Add DM to AE (keep all AEs)
`inner_join()`	Keep only matching rows	When you need complete data only	Safety population analysis
`right_join()`	Keep all rows from right dataset	Less common in clinical programming	Rarely used
`full_join()`	Keep all rows from both datasets	When you need complete picture	Data reconciliation
`semi_join()`	Keep rows in x that have match in y	Filter to matching subjects	ITT population
`anti_join()`	Keep rows in x that have NO match in y	Find missing data	Screen failures

🔧 3. Basic Join Operations

Setting Up Sample Data

library(dplyr)
library(tibble)

# Demographics data
dm <- tibble(
  USUBJID = c("001-001", "001-002", "001-003", "001-004", "001-005"),
  AGE = c(45, 67, 52, 71, 34),
  SEX = c("M", "F", "M", "F", "M"),
  ARMCD = c("TRT", "PBO", "TRT", "TRT", "PBO")
)

# Adverse Events data  
ae <- tibble(
  USUBJID = c("001-001", "001-001", "001-002", "001-004", "001-006"),
  AEDECOD = c("HEADACHE", "NAUSEA", "FATIGUE", "DIZZINESS", "HEADACHE"),
  AESEV = c("MILD", "MODERATE", "MILD", "SEVERE", "MILD")
)

`left_join()` - Most Common in Clinical Programming

# Add demographics to adverse events (preserve all AE records)
ae_with_demo <- ae %>%
  left_join(dm, by = "USUBJID")

# Result: AE data with subject characteristics added
# Note: Subject 001-006 will have NA for demo variables (not in DM)

`inner_join()` - Complete Cases Only

# Keep only AEs for subjects with complete demographics
ae_complete <- ae %>%
  inner_join(dm, by = "USUBJID")

# Result: Excludes AE for subject 001-006 (not in DM)

Filtering Joins (R4DS Ch. 13)

# semi_join: Find subjects who had any adverse events
subjects_with_ae <- dm %>%
  semi_join(ae, by = "USUBJID")

# anti_join: Find subjects who had NO adverse events  
subjects_without_ae <- dm %>%
  anti_join(ae, by = "USUBJID")

# Useful for population flags in ADSL
adsl <- dm %>%
  mutate(
    SAFFL = ifelse(USUBJID %in% ae$USUBJID, "Y", "N"),  # Safety population
    ITTFL = "Y"  # Intent-to-treat (all randomized subjects)
  )

📈 4. Group By and Summarise Operations

Basic Summarisation

# Count adverse events by treatment group
ae_summary <- ae_with_demo %>%
  group_by(ARMCD) %>%
  summarise(
    n_subjects = n_distinct(USUBJID),
    n_events = n(),
    .groups = "drop"
  )

# Count events by severity
severity_summary <- ae_with_demo %>%
  group_by(AESEV) %>%
  summarise(
    count = n(),
    percent = round(100 * n() / nrow(ae_with_demo), 1),
    .groups = "drop"
  )

Age-Based Analysis (Elderly vs Non-Elderly)

# First, create elderly flag in demographics
dm <- dm %>%
  mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))

# Join and summarise AEs by elderly status
ae_by_elderly <- ae %>%
  left_join(dm, by = "USUBJID") %>%
  group_by(ELDERLY, AESEV) %>%
  summarise(
    n_events = n(),
    n_subjects = n_distinct(USUBJID),
    .groups = "drop"
  )

Multiple Grouping Variables

# Summarise by treatment and elderly status
complex_summary <- ae %>%
  left_join(dm, by = "USUBJID") %>%
  mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N")) %>%
  group_by(ARMCD, ELDERLY, AESEV) %>%
  summarise(
    n_events = n(),
    n_subjects = n_distinct(USUBJID),
    mean_age = round(mean(AGE, na.rm = TRUE), 1),
    .groups = "drop"
  ) %>%
  arrange(ARMCD, ELDERLY, AESEV)

🔢 5. Frequency Tables with count()

The count() function is a shortcut for group_by() + summarise(n = n()):

# Simple frequency table
ae_counts <- ae_with_demo %>%
  count(AEDECOD, sort = TRUE)

# Cross-tabulation  
ae_cross_tab <- ae_with_demo %>%
  count(ARMCD, AESEV) %>%
  tidyr::pivot_wider(names_from = AESEV, values_from = n, values_fill = 0)

# Count with proportions
ae_with_prop <- ae_with_demo %>%
  count(AEDECOD) %>%
  mutate(
    percent = round(100 * n / sum(n), 1),
    label = paste0(n, " (", percent, "%)")
  )

🚨 6. Clinical Example: Adverse Events Analysis

Let’s work through a complete adverse events analysis:

# Complete AE analysis pipeline
ae_analysis <- ae %>%
  left_join(dm, by = "USUBJID") %>%
  mutate(
    ELDERLY = ifelse(AGE >= 65, "Y", "N"),
    SEVERE_AE = ifelse(AESEV == "SEVERE", "Y", "N")
  ) %>%
  group_by(ARMCD, ELDERLY) %>%
  summarise(
    n_subjects = n_distinct(USUBJID),
    n_total_aes = n(),
    n_severe_aes = sum(SEVERE_AE == "Y", na.rm = TRUE),
    severe_rate = round(100 * n_severe_aes / n_total_aes, 1),
    aes_per_subject = round(n_total_aes / n_subjects, 1),
    .groups = "drop"
  ) %>%
  arrange(desc(severe_rate))

print(ae_analysis)

⚠️ 7. Handling Missing Data in Joins

# Check for subjects in AE but not in DM
ae_subjects <- unique(ae$USUBJID)
dm_subjects <- unique(dm$USUBJID)

missing_in_dm <- setdiff(ae_subjects, dm_subjects)
missing_in_ae <- setdiff(dm_subjects, ae_subjects)

cat("Subjects in AE but not DM:", missing_in_dm, "\n")
cat("Subjects in DM but not AE:", missing_in_ae, "\n")

# Handle missing demographic data explicitly
ae_safe <- ae %>%
  left_join(dm, by = "USUBJID") %>%
  mutate(
    ELDERLY = case_when(
      is.na(AGE) ~ "Unknown",
      AGE >= 65 ~ "Y", 
      TRUE ~ "N"
    ),
    ARMCD = if_else(is.na(ARMCD), "Unknown", ARMCD)
  )

🤖 8. GitHub Copilot for Joins and Summaries

Effective Prompts:

Comment Prompt	Expected Copilot Suggestion
`# Join demographics to adverse events`	`left_join(dm, by = "USUBJID")`
`# Count AEs by treatment and severity`	`group_by(ARMCD, AESEV) %>% summarise(n = n())`
`# Calculate AE rates by elderly status`	`group_by(ELDERLY) %>% summarise(rate = n()/n_distinct(USUBJID))`
`# Create cross-tabulation of treatment by outcome`	`count(ARMCD, OUTCOME) %>% pivot_wider(...)`

📝 Module Summary

By completing this module, you should now be able to:

✅ Perform joins using left_join, inner_join, and other dplyr join functions
✅ Create summaries using group_by and summarise for clinical analysis
✅ Generate frequency tables with count() and n() functions
✅ Analyze adverse events by elderly vs non-elderly populations
✅ Handle missing data appropriately in join operations

🚀 Next Steps:

Practice with the demo exercises
Try adverse event analysis on your own data
Prepare for Module 4: Date & Text Handling

💡 Key Takeaways

left_join() is most common in clinical programming - preserves your main dataset
group_by() + summarise() creates powerful analysis summaries
count() is a shortcut for simple frequency tables
Handle missing data explicitly when joining datasets
n_distinct() counts unique subjects, n() counts records
Elderly flag analysis is a common clinical programming pattern

Ready to learn about dates and text handling? Let’s move to Module 4!