Module 2 Theory β€” Data Manipulation

πŸ”„ Module 2 β€” Data Transformation & Manipulation

🎯 Learning Objectives

By the end of this module, you will:

  • Master the pipe operator (%>%) for readable code workflows (R4DS Ch. 18)
  • Transform data using dplyr’s core verbs: filter, select, mutate, arrange, summarise (R4DS Ch. 5)
  • Understand tibbles vs data.frames and work with modern data structures (R4DS Ch. 10)
  • Import clinical data from Excel, CSV, and SAS files into R (R4DS Ch. 11)
  • Apply clinical programming patterns with CDISC SDTM/ADAM standards
  • Compare R and SAS approaches for data manipulation in clinical trials

πŸ”— 1. Pipes: The Foundation of Readable Code (R4DS Ch. 18)

The Power of %>% (Pipe Operator)

The pipe operator chains operations together, making code read like English:

library(tidyverse)  # Loads dplyr, tibble, and pipe operator

# Instead of nested functions (hard to read):
result <- arrange(select(filter(dm, AGE >= 65), USUBJID, AGE), AGE)

# Use pipes for readable workflow (left-to-right):
result <- dm %>%
  filter(AGE >= 65) %>%     # First, filter elderly subjects
  select(USUBJID, AGE) %>%  # Then, select relevant columns  
  arrange(AGE)              # Finally, sort by age

Clinical Programming Workflow with Pipes

# CDISC SDTM Demographics processing pipeline
sdtm_dm <- raw_dm %>%
  # Step 1: Data cleaning
  filter(!is.na(SUBJID)) %>%
  
  # Step 2: Standardize formats
  mutate(
    USUBJID = paste0("STUDY001-", sprintf("%03d", SUBJID)),
    RFSTDT = ymd(RFSTDTC)
  ) %>%
  
  # Step 3: Clinical derivations
  mutate(
    AGEGR1 = case_when(
      AGE < 65 ~ "< 65 years",
      AGE >= 65 ~ ">= 65 years"
    ),
    ELDERLY = ifelse(AGE >= 65, "Y", "N")
  ) %>%
  
  # Step 4: Final dataset preparation
  select(STUDYID, USUBJID, AGE, SEX, RFSTDT, AGEGR1, ELDERLY) %>%
  arrange(USUBJID)

πŸ—ƒοΈ 2. Tibbles: Modern Data Frames (R4DS Ch. 10)

What are Tibbles?

Tibbles are enhanced data frames that are central to the tidyverse approach. They provide better printing, stricter subsetting, and never convert strings to factors automatically.

library(tibble)

# Create a tibble (preferred over data.frame)
dm <- tibble(
  USUBJID = c("001-001", "001-002", "001-003"),
  AGE = c(45, 67, 52),
  SEX = c("M", "F", "M"),
  RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17")
)

# View structure - tibbles show data types and dimensions
glimpse(dm)

Tibbles vs Data.frames for Clinical Programming

Feature data.frame tibble Clinical Benefit
Printing Shows all rows Shows first 10 rows Better for large clinical datasets
Strings Converts to factors Keeps as characters Preserves CDISC text values
Subsetting Drops to vector Always returns tibble Consistent data structure
Column names Allows spaces Requires backticks for spaces Enforces CDISC naming standards

R Data Types for Clinical Programming:

Data Type Description Clinical Examples
character Text strings USUBJID, SEX, AEDECOD
numeric Numbers (integer/double) AGE, DOSE, LAB values
logical TRUE/FALSE Flags, eligibility criteria
Date Date objects RFSTDT, AESTDT
factor Categorical with levels Treatment arms, severity grades

πŸ“₯ 3. Data Import for Clinical Studies (R4DS Ch. 11)

Clinical studies typically involve multiple data sources that need to be imported and combined:

Common Clinical Data Formats

library(readr)    # CSV files
library(readxl)   # Excel files  
library(haven)    # SAS/SPSS/Stata files

# Import CSV data (common for EDC exports)
dm_csv <- read_csv("data/demographics.csv")

# Import Excel data (common for lab data, listings)
lab_excel <- read_excel("data/laboratory_data.xlsx", sheet = "Lab Results")

# Import SAS data (legacy clinical data)
dm_sas <- read_sas("data/dm.sas7bdat")

# Import XPT files (regulatory submissions)
dm_xpt <- read_xpt("data/dm.xpt")

Best Practices for Clinical Data Import

# 1. Always check data after import
dm <- read_csv("demographics.csv")
glimpse(dm)        # Check structure
problems(dm)       # Check parsing issues

# 2. Handle missing values explicitly  
dm <- read_csv("demographics.csv", na = c("", "NA", "NULL", "."))

# 3. Specify column types for clinical data
dm <- read_csv("demographics.csv",
  col_types = cols(
    USUBJID = col_character(),
    AGE = col_double(),
    SEX = col_character(),
    RFSTDTC = col_character()  # Import dates as character first
  )
)

# 4. Convert dates after import (safer)
dm <- dm %>%
  mutate(RFSTDT = ymd(RFSTDTC))

πŸ”§ 4. Data Transformation (R4DS Ch. 5)

The Five Key Verbs of Data Transformation

All data manipulation in clinical programming can be broken down into five key operations:

  1. filter() - Pick observations by their values (rows)
  2. select() - Pick variables by their names (columns)
  3. mutate() - Create new variables with functions of existing variables
  4. arrange() - Reorder the rows
  5. summarise() - Collapse many values down to a single summary

filter() - Subset Rows (like SAS WHERE clause)

# SAS: WHERE AGE >= 18;
# R equivalent:
adults <- dm %>% 
  filter(AGE >= 18)

# Multiple conditions
eligible <- dm %>%
  filter(AGE >= 18 & SEX == "F")

select() - Choose Columns (like KEEP statement)

# SAS: KEEP USUBJID AGE SEX;
# R equivalent:
dm_subset <- dm %>%
  select(USUBJID, AGE, SEX)

# Drop columns (like DROP statement)
dm_no_sex <- dm %>%
  select(-SEX)

mutate() - Create/Modify Variables (like assignment statements)

# SAS: ELDERLY = (AGE >= 65);
# R equivalent:
dm <- dm %>%
  mutate(
    ELDERLY = ifelse(AGE >= 65, "Y", "N"),
    AGEYR = AGE,  # Rename/copy
    AGE_MONTHS = AGE * 12
  )

arrange() - Sort Data (like PROC SORT)

# SAS: PROC SORT DATA=dm; BY USUBJID AGE; RUN;
# R equivalent:
dm_sorted <- dm %>%
  arrange(USUBJID, AGE)

# Descending order
dm_sorted_desc <- dm %>%
  arrange(desc(AGE))

summarise() - Create Summary Statistics (like PROC MEANS)

# SAS: PROC MEANS DATA=dm; VAR AGE; BY SEX; RUN;
# R equivalent:
age_stats <- dm %>%
  group_by(SEX) %>%
  summarise(
    n = n(),                    # Count observations
    mean_age = mean(AGE),       # Mean age
    median_age = median(AGE),   # Median age
    sd_age = sd(AGE),          # Standard deviation
    min_age = min(AGE),        # Minimum
    max_age = max(AGE),        # Maximum
    .groups = "drop"           # Remove grouping
  )

# Clinical trial summaries
trial_summary <- dm %>%
  group_by(ARM, SEX) %>%
  summarise(
    subjects = n(),
    mean_age = round(mean(AGE), 1),
    elderly_pct = round(100 * mean(AGE >= 65), 1),
    .groups = "drop"
  )

πŸ”„ 5. R vs SAS DATA Step Comparison

Creating Conditional Variables

SAS DATA Step:

DATA dm;
  SET raw_dm;

  /* Create age groups */
  IF AGE < 65 THEN ELDERLY = "N";
  ELSE ELDERLY = "Y";

  /* Multiple conditions */
  IF AGE < 40 THEN AGEGRP = "Young";
  ELSE IF AGE < 65 THEN AGEGRP = "Middle";
  ELSE AGEGRP = "Elderly";
RUN;

R dplyr equivalent:

dm <- raw_dm %>%
  mutate(
    ELDERLY = ifelse(AGE < 65, "N", "Y"),
    
    AGEGRP = case_when(
      AGE < 40 ~ "Young",
      AGE < 65 ~ "Middle", 
      TRUE ~ "Elderly"  # Default case (like ELSE)
    )
  )

Subsetting and Processing

SAS:

DATA adults;
  SET dm;
  WHERE AGE >= 18;

  AGEGRP = PUT(AGE, AGEGRP.);
  KEEP USUBJID AGE SEX AGEGRP;
RUN;

R:

adults <- dm %>%
  filter(AGE >= 18) %>%
  mutate(AGEGRP = case_when(
    AGE < 40 ~ "Young",
    AGE < 65 ~ "Middle",
    TRUE ~ "Elderly"
  )) %>%
  select(USUBJID, AGE, SEX, AGEGRP)

πŸš€ 4. Practical Example: Deriving Elderly Flag

Let’s work through a complete example of deriving an elderly flag, a common clinical programming task:

library(dplyr)
library(tibble)

# Sample demographics data
dm <- tibble(
  USUBJID = c("001-001", "001-002", "001-003", "001-004", "001-005"),
  AGE = c(45, 67, 52, 71, 28),
  SEX = c("M", "F", "M", "F", "F"),
  RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-01-19")
)

# Step 1: Create elderly flag (age >= 65)
dm <- dm %>%
  mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))

# Step 2: Create more detailed age groups
dm <- dm %>%
  mutate(
    AGEGRP = case_when(
      AGE < 40 ~ "Young Adult",
      AGE >= 40 & AGE < 65 ~ "Middle Age", 
      AGE >= 65 ~ "Elderly"
    ),
    AGEGRPN = case_when(
      AGE < 40 ~ 1,
      AGE >= 40 & AGE < 65 ~ 2,
      AGE >= 65 ~ 3
    )
  )

# View results
dm

πŸƒβ€β™‚οΈ 5. Best Practices for Data Wrangling

1. Readable Code Structure

# Good: Clear pipe chain with meaningful intermediate steps
dm_processed <- dm %>%
  filter(AGE >= 18) %>%          # Include only adults
  mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N")) %>%  # Create flag
  arrange(USUBJID) %>%           # Sort by subject
  select(USUBJID, AGE, SEX, ELDERLY)  # Keep relevant variables

# Less ideal: Everything in one complex pipe
dm_processed <- dm %>% filter(AGE >= 18) %>% mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"), AGEGRP = case_when(AGE < 40 ~ "Young", TRUE ~ "Old")) %>% arrange(USUBJID)

2. Consistent Naming Conventions

# Use CDISC-compliant variable names
dm <- dm %>%
  mutate(
    AGEFL = ifelse(AGE >= 65, "Y", "N"),  # Flag with FL suffix
    AGEGR1 = case_when(                   # Grouping with GR + number
      AGE < 65 ~ "< 65",
      AGE >= 65 ~ ">= 65"
    )
  )

3. Handle Missing Values Explicitly

dm <- dm %>%
  mutate(
    ELDERLY = case_when(
      is.na(AGE) ~ NA_character_,  # Handle missing ages
      AGE >= 65 ~ "Y",
      TRUE ~ "N"
    )
  )

πŸ€– 6. GitHub Copilot for Data Wrangling

Effective Prompts for Clinical Data:

Comment Prompt Expected Copilot Suggestion
# Create elderly flag where age >= 65 mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))
# Filter subjects enrolled after 2024-01-01 filter(as.Date(RFSTDTC) > as.Date("2024-01-01"))
# Create BMI categories using WHO standards mutate(BMICAT = case_when(...))
# Sort by subject ID and visit date arrange(USUBJID, as.Date(VISIT_DATE))

πŸ“ Module Summary

By completing this module, you should now be able to:

βœ… Understand tibbles and R data types for clinical programming
βœ… Use dplyr verbs (filter, select, mutate, arrange) effectively
βœ… Compare R and SAS approaches to data manipulation
βœ… Create clinical flags like elderly indicators
βœ… Write readable, maintainable data wrangling code

πŸš€ Next Steps:

  • Practice with the demo exercises
  • Try deriving elderly flags on your own data
  • Prepare for Module 3: Joins and Summaries

πŸ’‘ Key Takeaways

  1. dplyr provides intuitive verbs that map well to SAS DATA step concepts
  2. Pipe operator (%>%) creates readable, step-by-step data transformations
  3. case_when() is your friend for complex conditional logic
  4. Consistent naming and explicit missing value handling are essential
  5. GitHub Copilot can accelerate coding when given clear, descriptive comments

Ready to learn about joins and summaries? Let’s move to Module 3!