Module 2 Theory β Data Manipulation
π Module 2 β Data Transformation & Manipulation
π― Learning Objectives
By the end of this module, you will:
- Master the pipe operator (
%>%) for readable code workflows (R4DS Ch. 18) - Transform data using dplyrβs core verbs:
filter,select,mutate,arrange,summarise(R4DS Ch. 5) - Understand tibbles vs data.frames and work with modern data structures (R4DS Ch. 10)
- Import clinical data from Excel, CSV, and SAS files into R (R4DS Ch. 11)
- Apply clinical programming patterns with CDISC SDTM/ADAM standards
- Compare R and SAS approaches for data manipulation in clinical trials
π 1. Pipes: The Foundation of Readable Code (R4DS Ch. 18)
The Power of %>% (Pipe Operator)
The pipe operator chains operations together, making code read like English:
library(tidyverse) # Loads dplyr, tibble, and pipe operator
# Instead of nested functions (hard to read):
result <- arrange(select(filter(dm, AGE >= 65), USUBJID, AGE), AGE)
# Use pipes for readable workflow (left-to-right):
result <- dm %>%
filter(AGE >= 65) %>% # First, filter elderly subjects
select(USUBJID, AGE) %>% # Then, select relevant columns
arrange(AGE) # Finally, sort by ageClinical Programming Workflow with Pipes
# CDISC SDTM Demographics processing pipeline
sdtm_dm <- raw_dm %>%
# Step 1: Data cleaning
filter(!is.na(SUBJID)) %>%
# Step 2: Standardize formats
mutate(
USUBJID = paste0("STUDY001-", sprintf("%03d", SUBJID)),
RFSTDT = ymd(RFSTDTC)
) %>%
# Step 3: Clinical derivations
mutate(
AGEGR1 = case_when(
AGE < 65 ~ "< 65 years",
AGE >= 65 ~ ">= 65 years"
),
ELDERLY = ifelse(AGE >= 65, "Y", "N")
) %>%
# Step 4: Final dataset preparation
select(STUDYID, USUBJID, AGE, SEX, RFSTDT, AGEGR1, ELDERLY) %>%
arrange(USUBJID)ποΈ 2. Tibbles: Modern Data Frames (R4DS Ch. 10)
What are Tibbles?
Tibbles are enhanced data frames that are central to the tidyverse approach. They provide better printing, stricter subsetting, and never convert strings to factors automatically.
library(tibble)
# Create a tibble (preferred over data.frame)
dm <- tibble(
USUBJID = c("001-001", "001-002", "001-003"),
AGE = c(45, 67, 52),
SEX = c("M", "F", "M"),
RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17")
)
# View structure - tibbles show data types and dimensions
glimpse(dm)Tibbles vs Data.frames for Clinical Programming
| Feature | data.frame | tibble | Clinical Benefit |
|---|---|---|---|
| Printing | Shows all rows | Shows first 10 rows | Better for large clinical datasets |
| Strings | Converts to factors | Keeps as characters | Preserves CDISC text values |
| Subsetting | Drops to vector | Always returns tibble | Consistent data structure |
| Column names | Allows spaces | Requires backticks for spaces | Enforces CDISC naming standards |
R Data Types for Clinical Programming:
| Data Type | Description | Clinical Examples |
|---|---|---|
character |
Text strings | USUBJID, SEX, AEDECOD |
numeric |
Numbers (integer/double) | AGE, DOSE, LAB values |
logical |
TRUE/FALSE | Flags, eligibility criteria |
Date |
Date objects | RFSTDT, AESTDT |
factor |
Categorical with levels | Treatment arms, severity grades |
π₯ 3. Data Import for Clinical Studies (R4DS Ch. 11)
Clinical studies typically involve multiple data sources that need to be imported and combined:
Common Clinical Data Formats
library(readr) # CSV files
library(readxl) # Excel files
library(haven) # SAS/SPSS/Stata files
# Import CSV data (common for EDC exports)
dm_csv <- read_csv("data/demographics.csv")
# Import Excel data (common for lab data, listings)
lab_excel <- read_excel("data/laboratory_data.xlsx", sheet = "Lab Results")
# Import SAS data (legacy clinical data)
dm_sas <- read_sas("data/dm.sas7bdat")
# Import XPT files (regulatory submissions)
dm_xpt <- read_xpt("data/dm.xpt")Best Practices for Clinical Data Import
# 1. Always check data after import
dm <- read_csv("demographics.csv")
glimpse(dm) # Check structure
problems(dm) # Check parsing issues
# 2. Handle missing values explicitly
dm <- read_csv("demographics.csv", na = c("", "NA", "NULL", "."))
# 3. Specify column types for clinical data
dm <- read_csv("demographics.csv",
col_types = cols(
USUBJID = col_character(),
AGE = col_double(),
SEX = col_character(),
RFSTDTC = col_character() # Import dates as character first
)
)
# 4. Convert dates after import (safer)
dm <- dm %>%
mutate(RFSTDT = ymd(RFSTDTC))π§ 4. Data Transformation (R4DS Ch. 5)
The Five Key Verbs of Data Transformation
All data manipulation in clinical programming can be broken down into five key operations:
filter()- Pick observations by their values (rows)select()- Pick variables by their names (columns)
mutate()- Create new variables with functions of existing variablesarrange()- Reorder the rowssummarise()- Collapse many values down to a single summary
filter() - Subset Rows (like SAS WHERE clause)
# SAS: WHERE AGE >= 18;
# R equivalent:
adults <- dm %>%
filter(AGE >= 18)
# Multiple conditions
eligible <- dm %>%
filter(AGE >= 18 & SEX == "F")select() - Choose Columns (like KEEP statement)
# SAS: KEEP USUBJID AGE SEX;
# R equivalent:
dm_subset <- dm %>%
select(USUBJID, AGE, SEX)
# Drop columns (like DROP statement)
dm_no_sex <- dm %>%
select(-SEX)mutate() - Create/Modify Variables (like assignment statements)
# SAS: ELDERLY = (AGE >= 65);
# R equivalent:
dm <- dm %>%
mutate(
ELDERLY = ifelse(AGE >= 65, "Y", "N"),
AGEYR = AGE, # Rename/copy
AGE_MONTHS = AGE * 12
)arrange() - Sort Data (like PROC SORT)
# SAS: PROC SORT DATA=dm; BY USUBJID AGE; RUN;
# R equivalent:
dm_sorted <- dm %>%
arrange(USUBJID, AGE)
# Descending order
dm_sorted_desc <- dm %>%
arrange(desc(AGE))summarise() - Create Summary Statistics (like PROC MEANS)
# SAS: PROC MEANS DATA=dm; VAR AGE; BY SEX; RUN;
# R equivalent:
age_stats <- dm %>%
group_by(SEX) %>%
summarise(
n = n(), # Count observations
mean_age = mean(AGE), # Mean age
median_age = median(AGE), # Median age
sd_age = sd(AGE), # Standard deviation
min_age = min(AGE), # Minimum
max_age = max(AGE), # Maximum
.groups = "drop" # Remove grouping
)
# Clinical trial summaries
trial_summary <- dm %>%
group_by(ARM, SEX) %>%
summarise(
subjects = n(),
mean_age = round(mean(AGE), 1),
elderly_pct = round(100 * mean(AGE >= 65), 1),
.groups = "drop"
)π 5. R vs SAS DATA Step Comparison
Creating Conditional Variables
SAS DATA Step:
DATA dm;
SET raw_dm;
/* Create age groups */
IF AGE < 65 THEN ELDERLY = "N";
ELSE ELDERLY = "Y";
/* Multiple conditions */
IF AGE < 40 THEN AGEGRP = "Young";
ELSE IF AGE < 65 THEN AGEGRP = "Middle";
ELSE AGEGRP = "Elderly";
RUN;
R dplyr equivalent:
dm <- raw_dm %>%
mutate(
ELDERLY = ifelse(AGE < 65, "N", "Y"),
AGEGRP = case_when(
AGE < 40 ~ "Young",
AGE < 65 ~ "Middle",
TRUE ~ "Elderly" # Default case (like ELSE)
)
)Subsetting and Processing
SAS:
DATA adults;
SET dm;
WHERE AGE >= 18;
AGEGRP = PUT(AGE, AGEGRP.);
KEEP USUBJID AGE SEX AGEGRP;
RUN;
R:
adults <- dm %>%
filter(AGE >= 18) %>%
mutate(AGEGRP = case_when(
AGE < 40 ~ "Young",
AGE < 65 ~ "Middle",
TRUE ~ "Elderly"
)) %>%
select(USUBJID, AGE, SEX, AGEGRP)π 4. Practical Example: Deriving Elderly Flag
Letβs work through a complete example of deriving an elderly flag, a common clinical programming task:
library(dplyr)
library(tibble)
# Sample demographics data
dm <- tibble(
USUBJID = c("001-001", "001-002", "001-003", "001-004", "001-005"),
AGE = c(45, 67, 52, 71, 28),
SEX = c("M", "F", "M", "F", "F"),
RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-01-19")
)
# Step 1: Create elderly flag (age >= 65)
dm <- dm %>%
mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))
# Step 2: Create more detailed age groups
dm <- dm %>%
mutate(
AGEGRP = case_when(
AGE < 40 ~ "Young Adult",
AGE >= 40 & AGE < 65 ~ "Middle Age",
AGE >= 65 ~ "Elderly"
),
AGEGRPN = case_when(
AGE < 40 ~ 1,
AGE >= 40 & AGE < 65 ~ 2,
AGE >= 65 ~ 3
)
)
# View results
dmπββοΈ 5. Best Practices for Data Wrangling
1. Readable Code Structure
# Good: Clear pipe chain with meaningful intermediate steps
dm_processed <- dm %>%
filter(AGE >= 18) %>% # Include only adults
mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N")) %>% # Create flag
arrange(USUBJID) %>% # Sort by subject
select(USUBJID, AGE, SEX, ELDERLY) # Keep relevant variables
# Less ideal: Everything in one complex pipe
dm_processed <- dm %>% filter(AGE >= 18) %>% mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"), AGEGRP = case_when(AGE < 40 ~ "Young", TRUE ~ "Old")) %>% arrange(USUBJID)2. Consistent Naming Conventions
# Use CDISC-compliant variable names
dm <- dm %>%
mutate(
AGEFL = ifelse(AGE >= 65, "Y", "N"), # Flag with FL suffix
AGEGR1 = case_when( # Grouping with GR + number
AGE < 65 ~ "< 65",
AGE >= 65 ~ ">= 65"
)
)3. Handle Missing Values Explicitly
dm <- dm %>%
mutate(
ELDERLY = case_when(
is.na(AGE) ~ NA_character_, # Handle missing ages
AGE >= 65 ~ "Y",
TRUE ~ "N"
)
)π€ 6. GitHub Copilot for Data Wrangling
Effective Prompts for Clinical Data:
| Comment Prompt | Expected Copilot Suggestion |
|---|---|
# Create elderly flag where age >= 65 |
mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N")) |
# Filter subjects enrolled after 2024-01-01 |
filter(as.Date(RFSTDTC) > as.Date("2024-01-01")) |
# Create BMI categories using WHO standards |
mutate(BMICAT = case_when(...)) |
# Sort by subject ID and visit date |
arrange(USUBJID, as.Date(VISIT_DATE)) |
π Module Summary
By completing this module, you should now be able to:
β
Understand tibbles and R data types for clinical programming
β
Use dplyr verbs (filter, select, mutate, arrange) effectively
β
Compare R and SAS approaches to data manipulation
β
Create clinical flags like elderly indicators
β
Write readable, maintainable data wrangling code
π Next Steps:
- Practice with the demo exercises
- Try deriving elderly flags on your own data
- Prepare for Module 3: Joins and Summaries
π‘ Key Takeaways
- dplyr provides intuitive verbs that map well to SAS DATA step concepts
- Pipe operator (%>%) creates readable, step-by-step data transformations
- case_when() is your friend for complex conditional logic
- Consistent naming and explicit missing value handling are essential
- GitHub Copilot can accelerate coding when given clear, descriptive comments
Ready to learn about joins and summaries? Letβs move to Module 3!