Module 2 Theory — Data Manipulation

🔄 Module 2 — Data Transformation & Manipulation

🎯 Learning Objectives

By the end of this module, you will:

Master the pipe operator (%>%) for readable code workflows (R4DS Ch. 18)
Transform data using dplyr’s core verbs: filter, select, mutate, arrange, summarise (R4DS Ch. 5)
Understand tibbles vs data.frames and work with modern data structures (R4DS Ch. 10)
Import clinical data from Excel, CSV, and SAS files into R (R4DS Ch. 11)
Apply clinical programming patterns with CDISC SDTM/ADAM standards
Compare R and SAS approaches for data manipulation in clinical trials

🔗 1. Pipes: The Foundation of Readable Code (R4DS Ch. 18)

The Power of %>% (Pipe Operator)

The pipe operator chains operations together, making code read like English:

library(tidyverse)  # Loads dplyr, tibble, and pipe operator

# Instead of nested functions (hard to read):
result <- arrange(select(filter(dm, AGE >= 65), USUBJID, AGE), AGE)

# Use pipes for readable workflow (left-to-right):
result <- dm %>%
  filter(AGE >= 65) %>%     # First, filter elderly subjects
  select(USUBJID, AGE) %>%  # Then, select relevant columns  
  arrange(AGE)              # Finally, sort by age

Clinical Programming Workflow with Pipes

# CDISC SDTM Demographics processing pipeline
sdtm_dm <- raw_dm %>%
  # Step 1: Data cleaning
  filter(!is.na(SUBJID)) %>%
  
  # Step 2: Standardize formats
  mutate(
    USUBJID = paste0("STUDY001-", sprintf("%03d", SUBJID)),
    RFSTDT = ymd(RFSTDTC)
  ) %>%
  
  # Step 3: Clinical derivations
  mutate(
    AGEGR1 = case_when(
      AGE < 65 ~ "< 65 years",
      AGE >= 65 ~ ">= 65 years"
    ),
    ELDERLY = ifelse(AGE >= 65, "Y", "N")
  ) %>%
  
  # Step 4: Final dataset preparation
  select(STUDYID, USUBJID, AGE, SEX, RFSTDT, AGEGR1, ELDERLY) %>%
  arrange(USUBJID)

🗃️ 2. Tibbles: Modern Data Frames (R4DS Ch. 10)

What are Tibbles?

Tibbles are enhanced data frames that are central to the tidyverse approach. They provide better printing, stricter subsetting, and never convert strings to factors automatically.

library(tibble)

# Create a tibble (preferred over data.frame)
dm <- tibble(
  USUBJID = c("001-001", "001-002", "001-003"),
  AGE = c(45, 67, 52),
  SEX = c("M", "F", "M"),
  RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17")
)

# View structure - tibbles show data types and dimensions
glimpse(dm)

Tibbles vs Data.frames for Clinical Programming

Feature	data.frame	tibble	Clinical Benefit
Printing	Shows all rows	Shows first 10 rows	Better for large clinical datasets
Strings	Converts to factors	Keeps as characters	Preserves CDISC text values
Subsetting	Drops to vector	Always returns tibble	Consistent data structure
Column names	Allows spaces	Requires backticks for spaces	Enforces CDISC naming standards

R Data Types for Clinical Programming:

Data Type	Description	Clinical Examples
`character`	Text strings	USUBJID, SEX, AEDECOD
`numeric`	Numbers (integer/double)	AGE, DOSE, LAB values
`logical`	TRUE/FALSE	Flags, eligibility criteria
`Date`	Date objects	RFSTDT, AESTDT
`factor`	Categorical with levels	Treatment arms, severity grades

📥 3. Data Import for Clinical Studies (R4DS Ch. 11)

Clinical studies typically involve multiple data sources that need to be imported and combined:

Common Clinical Data Formats

library(readr)    # CSV files
library(readxl)   # Excel files  
library(haven)    # SAS/SPSS/Stata files

# Import CSV data (common for EDC exports)
dm_csv <- read_csv("data/demographics.csv")

# Import Excel data (common for lab data, listings)
lab_excel <- read_excel("data/laboratory_data.xlsx", sheet = "Lab Results")

# Import SAS data (legacy clinical data)
dm_sas <- read_sas("data/dm.sas7bdat")

# Import XPT files (regulatory submissions)
dm_xpt <- read_xpt("data/dm.xpt")

Best Practices for Clinical Data Import

# 1. Always check data after import
dm <- read_csv("demographics.csv")
glimpse(dm)        # Check structure
problems(dm)       # Check parsing issues

# 2. Handle missing values explicitly  
dm <- read_csv("demographics.csv", na = c("", "NA", "NULL", "."))

# 3. Specify column types for clinical data
dm <- read_csv("demographics.csv",
  col_types = cols(
    USUBJID = col_character(),
    AGE = col_double(),
    SEX = col_character(),
    RFSTDTC = col_character()  # Import dates as character first
  )
)

# 4. Convert dates after import (safer)
dm <- dm %>%
  mutate(RFSTDT = ymd(RFSTDTC))

🔧 4. Data Transformation (R4DS Ch. 5)

The Five Key Verbs of Data Transformation

All data manipulation in clinical programming can be broken down into five key operations:

filter() - Pick observations by their values (rows)
select() - Pick variables by their names (columns)
mutate() - Create new variables with functions of existing variables
arrange() - Reorder the rows
summarise() - Collapse many values down to a single summary

`filter()` - Subset Rows (like SAS WHERE clause)

# SAS: WHERE AGE >= 18;
# R equivalent:
adults <- dm %>% 
  filter(AGE >= 18)

# Multiple conditions
eligible <- dm %>%
  filter(AGE >= 18 & SEX == "F")

`select()` - Choose Columns (like KEEP statement)

# SAS: KEEP USUBJID AGE SEX;
# R equivalent:
dm_subset <- dm %>%
  select(USUBJID, AGE, SEX)

# Drop columns (like DROP statement)
dm_no_sex <- dm %>%
  select(-SEX)

`mutate()` - Create/Modify Variables (like assignment statements)

# SAS: ELDERLY = (AGE >= 65);
# R equivalent:
dm <- dm %>%
  mutate(
    ELDERLY = ifelse(AGE >= 65, "Y", "N"),
    AGEYR = AGE,  # Rename/copy
    AGE_MONTHS = AGE * 12
  )

`arrange()` - Sort Data (like PROC SORT)

# SAS: PROC SORT DATA=dm; BY USUBJID AGE; RUN;
# R equivalent:
dm_sorted <- dm %>%
  arrange(USUBJID, AGE)

# Descending order
dm_sorted_desc <- dm %>%
  arrange(desc(AGE))

`summarise()` - Create Summary Statistics (like PROC MEANS)

# SAS: PROC MEANS DATA=dm; VAR AGE; BY SEX; RUN;
# R equivalent:
age_stats <- dm %>%
  group_by(SEX) %>%
  summarise(
    n = n(),                    # Count observations
    mean_age = mean(AGE),       # Mean age
    median_age = median(AGE),   # Median age
    sd_age = sd(AGE),          # Standard deviation
    min_age = min(AGE),        # Minimum
    max_age = max(AGE),        # Maximum
    .groups = "drop"           # Remove grouping
  )

# Clinical trial summaries
trial_summary <- dm %>%
  group_by(ARM, SEX) %>%
  summarise(
    subjects = n(),
    mean_age = round(mean(AGE), 1),
    elderly_pct = round(100 * mean(AGE >= 65), 1),
    .groups = "drop"
  )

🔄 5. R vs SAS DATA Step Comparison

Creating Conditional Variables

SAS DATA Step:

DATA dm;
  SET raw_dm;

  /* Create age groups */
  IF AGE < 65 THEN ELDERLY = "N";
  ELSE ELDERLY = "Y";

  /* Multiple conditions */
  IF AGE < 40 THEN AGEGRP = "Young";
  ELSE IF AGE < 65 THEN AGEGRP = "Middle";
  ELSE AGEGRP = "Elderly";
RUN;

R dplyr equivalent:

dm <- raw_dm %>%
  mutate(
    ELDERLY = ifelse(AGE < 65, "N", "Y"),
    
    AGEGRP = case_when(
      AGE < 40 ~ "Young",
      AGE < 65 ~ "Middle", 
      TRUE ~ "Elderly"  # Default case (like ELSE)
    )
  )

Subsetting and Processing

SAS:

DATA adults;
  SET dm;
  WHERE AGE >= 18;

  AGEGRP = PUT(AGE, AGEGRP.);
  KEEP USUBJID AGE SEX AGEGRP;
RUN;

adults <- dm %>%
  filter(AGE >= 18) %>%
  mutate(AGEGRP = case_when(
    AGE < 40 ~ "Young",
    AGE < 65 ~ "Middle",
    TRUE ~ "Elderly"
  )) %>%
  select(USUBJID, AGE, SEX, AGEGRP)

🚀 4. Practical Example: Deriving Elderly Flag

Let’s work through a complete example of deriving an elderly flag, a common clinical programming task:

library(dplyr)
library(tibble)

# Sample demographics data
dm <- tibble(
  USUBJID = c("001-001", "001-002", "001-003", "001-004", "001-005"),
  AGE = c(45, 67, 52, 71, 28),
  SEX = c("M", "F", "M", "F", "F"),
  RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-01-19")
)

# Step 1: Create elderly flag (age >= 65)
dm <- dm %>%
  mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))

# Step 2: Create more detailed age groups
dm <- dm %>%
  mutate(
    AGEGRP = case_when(
      AGE < 40 ~ "Young Adult",
      AGE >= 40 & AGE < 65 ~ "Middle Age", 
      AGE >= 65 ~ "Elderly"
    ),
    AGEGRPN = case_when(
      AGE < 40 ~ 1,
      AGE >= 40 & AGE < 65 ~ 2,
      AGE >= 65 ~ 3
    )
  )

# View results
dm

🏃‍♂️ 5. Best Practices for Data Wrangling

1. Readable Code Structure

# Good: Clear pipe chain with meaningful intermediate steps
dm_processed <- dm %>%
  filter(AGE >= 18) %>%          # Include only adults
  mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N")) %>%  # Create flag
  arrange(USUBJID) %>%           # Sort by subject
  select(USUBJID, AGE, SEX, ELDERLY)  # Keep relevant variables

# Less ideal: Everything in one complex pipe
dm_processed <- dm %>% filter(AGE >= 18) %>% mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"), AGEGRP = case_when(AGE < 40 ~ "Young", TRUE ~ "Old")) %>% arrange(USUBJID)

2. Consistent Naming Conventions

# Use CDISC-compliant variable names
dm <- dm %>%
  mutate(
    AGEFL = ifelse(AGE >= 65, "Y", "N"),  # Flag with FL suffix
    AGEGR1 = case_when(                   # Grouping with GR + number
      AGE < 65 ~ "< 65",
      AGE >= 65 ~ ">= 65"
    )
  )

3. Handle Missing Values Explicitly

dm <- dm %>%
  mutate(
    ELDERLY = case_when(
      is.na(AGE) ~ NA_character_,  # Handle missing ages
      AGE >= 65 ~ "Y",
      TRUE ~ "N"
    )
  )

🤖 6. GitHub Copilot for Data Wrangling

Effective Prompts for Clinical Data:

Comment Prompt	Expected Copilot Suggestion
`# Create elderly flag where age >= 65`	`mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))`
`# Filter subjects enrolled after 2024-01-01`	`filter(as.Date(RFSTDTC) > as.Date("2024-01-01"))`
`# Create BMI categories using WHO standards`	`mutate(BMICAT = case_when(...))`
`# Sort by subject ID and visit date`	`arrange(USUBJID, as.Date(VISIT_DATE))`

📝 Module Summary

By completing this module, you should now be able to:

✅ Understand tibbles and R data types for clinical programming
✅ Use dplyr verbs (filter, select, mutate, arrange) effectively
✅ Compare R and SAS approaches to data manipulation
✅ Create clinical flags like elderly indicators
✅ Write readable, maintainable data wrangling code

🚀 Next Steps:

Practice with the demo exercises
Try deriving elderly flags on your own data
Prepare for Module 3: Joins and Summaries

💡 Key Takeaways

dplyr provides intuitive verbs that map well to SAS DATA step concepts
Pipe operator (%>%) creates readable, step-by-step data transformations
case_when() is your friend for complex conditional logic
Consistent naming and explicit missing value handling are essential
GitHub Copilot can accelerate coding when given clear, descriptive comments

Ready to learn about joins and summaries? Let’s move to Module 3!