Module 5 Theory — Functions, Vectors & Iteration
🔧 Module 5 — Functions, Vectors & Iteration
🎯 Learning Objectives
By the end of this module, you will:
- Master function creation, arguments, and environments (R4DS Ch. 19)
- Understand vector types, testing, and coercion in R (R4DS Ch. 20)
- Learn iteration techniques with
forloops and functional programming (R4DS Ch. 21) - Apply functions and iteration to clinical programming workflows
- Translate SAS macros to R functions using modern R techniques
- Use GitHub Copilot in RStudio to assist with function development and debugging
🔧 1. Functions (R4DS Chapter 19)
Functions are the fundamental building blocks of R programming. They reduce duplication, make code more readable, and help catch errors.
When to Write a Function
Following the DRY principle (Don’t Repeat Yourself), consider writing a function whenever you’ve copied and pasted code more than twice.
# Instead of repeating this pattern:
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
# Write a function:
rescale01 <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)Function Components
Every R function has three key components:
- Arguments (
formals()): the list of arguments that control how you call the function - Body (
body()): the code inside the function - Environment (
environment()): the data structure that determines how the function finds the values associated with names
# Clinical example: Age category function
create_age_category <- function(age) {
case_when(
is.na(age) ~ "Unknown",
age < 18 ~ "Pediatric",
age >= 18 & age < 65 ~ "Adult",
age >= 65 ~ "Elderly"
)
}
# Examine function components
formals(create_age_category)
body(create_age_category)
environment(create_age_category)Function Arguments
Matching Arguments by Position and Name
# Clinical example: BMI calculation with multiple argument types
calculate_bmi <- function(weight, height, unit = "metric", round_digits = 1) {
if (unit == "metric") {
bmi <- weight / (height / 100)^2
} else if (unit == "imperial") {
bmi <- (weight * 703) / height^2
} else {
stop("unit must be 'metric' or 'imperial'")
}
round(bmi, digits = round_digits)
}
# Different ways to call the function:
calculate_bmi(70, 175) # positional
calculate_bmi(weight = 70, height = 175) # named
calculate_bmi(70, 175, unit = "metric") # mixed
calculate_bmi(70, 175, round_digits = 2) # skip middle argumentThe ... (dot-dot-dot) Argument
# Clinical example: Flexible summary function
clinical_summary <- function(data, ..., na.rm = TRUE) {
data %>%
summarise(
across(c(...), list(
mean = ~ mean(.x, na.rm = na.rm),
sd = ~ sd(.x, na.rm = na.rm),
min = ~ min(.x, na.rm = na.rm),
max = ~ max(.x, na.rm = na.rm)
))
)
}
# Usage
demographics %>%
clinical_summary(AGE, WEIGHT, HEIGHT)Return Values and Environment
Functions can return values explicitly with return() or implicitly (last expression):
# Explicit return
calculate_study_day <- function(event_date, ref_date) {
if (is.na(event_date) || is.na(ref_date)) {
return(NA_real_)
}
days <- as.numeric(event_date - ref_date)
if (event_date >= ref_date) {
return(days + 1)
} else {
return(days)
}
}
# Implicit return (preferred when possible)
calculate_study_day_v2 <- function(event_date, ref_date) {
case_when(
is.na(event_date) | is.na(ref_date) ~ NA_real_,
event_date >= ref_date ~ as.numeric(event_date - ref_date) + 1,
TRUE ~ as.numeric(event_date - ref_date)
)
}� 2. Vectors (R4DS Chapter 20)
Vectors are the building blocks of R. Understanding them deeply will help you write better functions and avoid common errors.
Vector Basics
R has two types of vectors:
- Atomic vectors (6 types): logical, integer, double, character, complex, raw
- Lists (recursive vectors)
# Creating vectors for clinical data
logical_vec <- c(TRUE, FALSE, TRUE) # Treatment response
integer_vec <- c(1L, 2L, 3L) # Visit numbers
double_vec <- c(1.5, 2.7, 3.9) # Lab values
character_vec <- c("M", "F", "M") # Gender
# Check types
typeof(logical_vec)
typeof(integer_vec)
typeof(double_vec)
typeof(character_vec)Important Vector Properties
# Clinical example with adverse events
ae_severity <- c("MILD", "MODERATE", "SEVERE", "MILD", "SEVERE")
# Length
length(ae_severity)
# Names
names(ae_severity) <- paste0("AE_", 1:5)
ae_severity
# Dimensions (for matrices/arrays)
lab_matrix <- matrix(c(120, 80, 135, 85, 110, 75), nrow = 3)
dim(lab_matrix)Vector Testing and Coercion
# Testing vector types
test_values <- c(1, 2, 3, "4", 5)
is.character(test_values) # TRUE - coerced to character
is.numeric(test_values) # FALSE
# Explicit coercion
lab_values <- c("120", "85", "normal", "95")
as.numeric(lab_values) # Warning for "normal"
# Safe coercion for clinical data
safe_as_numeric <- function(x) {
suppressWarnings(as.numeric(x))
}
clean_lab_values <- safe_as_numeric(lab_values)
clean_lab_values # c(120, 85, NA, 95)Subsetting Vectors
subject_ids <- c("001-001", "001-002", "001-003", "001-004", "001-005")
# Positive integers select elements
subject_ids[c(1, 3, 5)]
# Negative integers exclude elements
subject_ids[-c(2, 4)]
# Logical vectors select TRUE elements
ages <- c(25, 67, 45, 72, 34)
elderly <- ages >= 65
subject_ids[elderly]
# Named vectors
vital_signs <- c(sbp = 120, dbp = 80, hr = 72)
vital_signs["sbp"]
vital_signs[c("sbp", "dbp")]�🔄 3. SAS Macro to R Function Translation
Understanding vectors helps us write better functions for clinical programming.
SAS Macro Example: Calculate Study Day
%macro calc_study_day(indata=, outdata=, event_date=, ref_date=);
data &outdata;
set &indata;
if not missing(&event_date) and not missing(&ref_date) then do;
if &event_date >= &ref_date then
study_day = &event_date - &ref_date + 1;
else
study_day = &event_date - &ref_date;
end;
run;
%mend calc_study_day;
R Function Translation (Vectorized)
calc_study_day <- function(data, event_date, ref_date) {
data %>%
mutate(
study_day = case_when(
is.na({{ event_date }}) | is.na({{ ref_date }}) ~ NA_real_,
{{ event_date }} >= {{ ref_date }} ~ as.numeric({{ event_date }} - {{ ref_date }}) + 1,
{{ event_date }} < {{ ref_date }} ~ as.numeric({{ event_date }} - {{ ref_date }})
)
)
}
# Usage with proper vector handling
ae_data <- ae_data %>%
calc_study_day(event_date = AESTDT, ref_date = RFSTDT)🔄 4. Iteration (R4DS Chapter 21)
Iteration allows you to do the same thing to multiple inputs. R provides two main paradigms for iteration:
- Imperative programming:
forloops andwhileloops - Functional programming:
map()functions from purrr
For Loops
# Basic for loop structure
clinical_vars <- c("AGE", "WEIGHT", "HEIGHT")
results <- vector("double", length(clinical_vars)) # Pre-allocate
for (i in seq_along(clinical_vars)) {
# Process each clinical variable
results[[i]] <- mean(demographics[[clinical_vars[[i]]]], na.rm = TRUE)
}
names(results) <- clinical_vars
resultsModifying Existing Objects
# Standardize multiple character columns
clinical_data <- tibble(
USUBJID = c("001-001", "001-002"),
AEDECOD = c(" headache ", "NAUSEA "),
CMDECOD = c("aspirin ", " IBUPROFEN")
)
char_cols <- c("AEDECOD", "CMDECOD")
for (col in char_cols) {
clinical_data[[col]] <- str_trim(str_to_upper(clinical_data[[col]]))
}
clinical_dataUnknown Output Length
# Simulate adverse event occurrences (unknown number of events)
simulate_ae_times <- function() {
n_events <- rpois(1, lambda = 2) # Random number of events
if (n_events == 0) return(numeric())
cumsum(rexp(n_events, rate = 0.1)) # Event times
}
# Collect results using a list
ae_simulations <- vector("list", 100)
for (i in 1:100) {
ae_simulations[[i]] <- simulate_ae_times()
}
# Convert to useful format
ae_data_simulated <- tibble(
simulation = rep(1:100, lengths(ae_simulations)),
event_time = unlist(ae_simulations)
)While Loops
# Dose escalation algorithm (common in clinical trials)
dose_escalation <- function(starting_dose = 10, max_dose = 100) {
current_dose <- starting_dose
doses <- current_dose
while (current_dose < max_dose) {
# Simulate safety assessment (simplified)
safety_ok <- rbinom(1, 1, prob = 0.8) # 80% chance of safety
if (safety_ok) {
current_dose <- current_dose * 1.5 # Escalate by 50%
} else {
break # Stop escalation if safety issue
}
doses <- c(doses, current_dose)
}
doses[doses <= max_dose] # Return valid doses only
}
dose_escalation()For Loops vs. Functionals
The functional programming approach is often cleaner and less error-prone:
# For loop approach
means_for <- vector("double", ncol(demographics))
for (i in seq_along(demographics)) {
if (is.numeric(demographics[[i]])) {
means_for[[i]] <- mean(demographics[[i]], na.rm = TRUE)
}
}
# Functional approach with map
library(purrr)
means_functional <- demographics %>%
select(where(is.numeric)) %>%
map_dbl(mean, na.rm = TRUE)🎯 5. Clinical Programming Function Examples
Derive Elderly Flag (Vectorized)
derive_elderly_flag <- function(data, age_var, cutoff = 65) {
data %>%
mutate(
ELDERLY = case_when(
is.na({{ age_var }}) ~ "U",
{{ age_var }} >= cutoff ~ "Y",
{{ age_var }} < cutoff ~ "N"
)
)
}Process Multiple Studies with Iteration
# Function to process a single study
process_study <- function(study_data) {
study_data %>%
derive_elderly_flag(AGE) %>%
mutate(
BMI = WEIGHT / (HEIGHT / 100)^2,
BMI_CAT = case_when(
BMI < 18.5 ~ "Underweight",
BMI >= 18.5 & BMI < 25 ~ "Normal",
BMI >= 25 & BMI < 30 ~ "Overweight",
BMI >= 30 ~ "Obese",
TRUE ~ "Unknown"
)
)
}
# Apply to multiple studies using map
study_files <- c("study001.csv", "study002.csv", "study003.csv")
all_studies <- study_files %>%
map(read_csv) %>%
map(process_study) %>%
set_names(c("Study 001", "Study 002", "Study 003"))Batch Processing with Error Handling
# Safe function that handles errors gracefully
safe_process_study <- possibly(process_study, otherwise = tibble())
# Process multiple files with error handling
results <- study_files %>%
map(~ {
cat("Processing", .x, "...\n")
read_csv(.x, show_col_types = FALSE) %>%
safe_process_study()
})
# Check which files processed successfully
successful <- map_lgl(results, ~ nrow(.x) > 0)
cat("Successfully processed:", sum(successful), "out of", length(study_files), "files\n")🛡️ 6. Advanced Function Features
Input Validation and Error Handling
derive_bmi_category <- function(data, weight_var, height_var, unit = "metric") {
# Input validation
if (!unit %in% c("metric", "imperial")) {
stop("unit must be either 'metric' or 'imperial'")
}
if (!is.data.frame(data)) {
stop("data must be a data frame")
}
data %>%
mutate(
bmi = if (unit == "metric") {
{{ weight_var }} / ({{ height_var }} / 100)^2
} else {
({{ weight_var }} * 703) / {{ height_var }}^2
},
bmi_category = case_when(
is.na(bmi) ~ "Unknown",
bmi < 18.5 ~ "Underweight",
bmi >= 18.5 & bmi < 25 ~ "Normal",
bmi >= 25 & bmi < 30 ~ "Overweight",
bmi >= 30 ~ "Obese"
)
)
}Function Factories
Functions that create other functions are called function factories:
# Create functions for different CDISC domains
create_domain_checker <- function(domain_prefix) {
function(subject_id) {
str_detect(subject_id, paste0("^", domain_prefix, "-"))
}
}
# Create specific checkers
is_ae_subject <- create_domain_checker("AE")
is_dm_subject <- create_domain_checker("DM")
is_vs_subject <- create_domain_checker("VS")
# Usage
subject_ids <- c("AE-001", "DM-002", "VS-003", "AE-004")
is_ae_subject(subject_ids) # TRUE FALSE FALSE TRUEFunction Operators
Functions that take functions as input and return functions as output:
# Add clinical data validation to any function
add_validation <- function(f) {
function(data, ...) {
if (!is.data.frame(data)) {
stop("Input must be a data frame")
}
if (nrow(data) == 0) {
warning("Input data frame is empty")
return(data)
}
result <- f(data, ...)
if (!is.data.frame(result)) {
stop("Function must return a data frame")
}
result
}
}
# Apply validation wrapper
safe_derive_elderly <- add_validation(derive_elderly_flag)
# Now the function includes validation
# safe_derive_elderly(demographics, AGE)Functions with Multiple Return Options
analyze_clinical_data <- function(data, return_type = "summary", group_var = NULL) {
if (!is.null(group_var)) {
base_analysis <- data %>%
group_by({{ group_var }}) %>%
summarise(
n_subjects = n_distinct(USUBJID),
mean_age = mean(AGE, na.rm = TRUE),
median_age = median(AGE, na.rm = TRUE),
.groups = "drop"
)
} else {
base_analysis <- data %>%
summarise(
n_subjects = n_distinct(USUBJID),
mean_age = mean(AGE, na.rm = TRUE),
median_age = median(AGE, na.rm = TRUE)
)
}
switch(return_type,
"summary" = base_analysis,
"detailed" = base_analysis %>%
mutate(
age_sd = sd(AGE, na.rm = TRUE),
age_range = paste(min(AGE, na.rm = TRUE), max(AGE, na.rm = TRUE), sep = "-")
),
"count_only" = base_analysis$n_subjects[[1]],
"ages_only" = data$AGE
)
}🎨 7. The map() Family (Functional Programming)
The purrr package provides a consistent set of tools for working with functions and vectors, implementing functional programming concepts.
Basic map() Functions
library(purrr)
# map() always returns a list
clinical_vars <- list(
study1_ages = c(25, 45, 67, 34, 52),
study2_ages = c(28, 49, 63, 71, 39),
study3_ages = c(44, 55, 29, 67, 42)
)
# Calculate mean age for each study
clinical_vars %>%
map(mean, na.rm = TRUE)
# map_dbl() returns a numeric vector
clinical_vars %>%
map_dbl(mean, na.rm = TRUE)
# map_chr() returns a character vector
clinical_vars %>%
map_chr(~ paste("Mean age:", round(mean(.x), 1)))Anonymous Functions and Shortcuts
# Three ways to write the same thing:
# 1. Anonymous function (traditional)
map(clinical_vars, function(x) mean(x, na.rm = TRUE))
# 2. Formula shortcut (purrr style)
map(clinical_vars, ~ mean(.x, na.rm = TRUE))
# 3. String shortcut (when function exists)
map(clinical_vars, mean, na.rm = TRUE)Processing Clinical Data with map()
# Apply standardization to multiple character columns
clinical_data <- tibble(
USUBJID = c("001-001", "001-002", "001-003"),
AEDECOD = c(" headache ", "NAUSEA ", "fatigue"),
CMDECOD = c("aspirin ", " IBUPROFEN", "acetaminophen ")
)
# Standardize all character columns except USUBJID
clinical_data %>%
mutate(
across(c(AEDECOD, CMDECOD), ~ str_trim(str_to_upper(.x)))
)
# Alternative using map approach
char_cols <- clinical_data %>%
select(-USUBJID) %>%
select(where(is.character))
char_cols %>%
map(~ str_trim(str_to_upper(.x)))map2() and pmap() for Multiple Inputs
# map2() for two inputs
weights <- c(70, 65, 80, 75)
heights <- c(175, 160, 185, 180)
map2_dbl(weights, heights, ~ .x / (.y / 100)^2) # Calculate BMI
# pmap() for multiple inputs
subjects <- list(
weight = c(70, 65, 80, 75),
height = c(175, 160, 185, 180),
age = c(45, 62, 38, 55)
)
# Function that uses all three inputs
health_score <- function(weight, height, age) {
bmi <- weight / (height / 100)^2
age_factor <- ifelse(age > 50, 0.9, 1.0)
round((25 / bmi) * age_factor, 2)
}
pmap_dbl(subjects, health_score)Walk Functions (for Side Effects)
# Use walk() when you want side effects, not return values
study_summaries <- list(
study1 = tibble(USUBJID = 1:5, AGE = c(45, 62, 38, 55, 49)),
study2 = tibble(USUBJID = 1:4, AGE = c(52, 34, 67, 41)),
study3 = tibble(USUBJID = 1:6, AGE = c(29, 58, 46, 71, 33, 54))
)
# Create a summary report for each study
create_study_report <- function(data, study_name) {
summary_stats <- data %>%
summarise(
n = n(),
mean_age = round(mean(AGE), 1),
median_age = median(AGE)
)
cat("Study:", study_name, "\n")
cat("N:", summary_stats$n, "\n")
cat("Mean age:", summary_stats$mean_age, "\n")
cat("Median age:", summary_stats$median_age, "\n\n")
}
# Use walk2() to iterate and produce side effects
walk2(study_summaries, names(study_summaries), create_study_report)Predicate Functions
# keep() and discard() for filtering based on a condition
lab_values <- list(
glucose = c(90, 95, 110, 85),
creatinine = c(1.1, 0.9, 1.3, 1.0),
invalid_test = c(NA, NA, NA, NA),
hemoglobin = c(13.5, 12.1, 14.2, 13.8)
)
# Keep only non-missing lab tests
valid_labs <- lab_values %>%
keep(~ !all(is.na(.x)))
# Find lab tests with any abnormal values (example thresholds)
abnormal_labs <- lab_values %>%
keep(~ any(.x > 100, na.rm = TRUE)) # Simplified exampleAdvanced Iteration Patterns
# Reduce for cumulative operations
daily_ae_counts <- c(2, 1, 3, 0, 2, 1, 4)
# Cumulative AE count
cumulative_aes <- reduce(daily_ae_counts, `+`, .init = 0)
# More complex reduction: combine multiple study datasets
study_datasets <- list(
tibble(USUBJID = c("A001", "A002"), ARM = "Treatment"),
tibble(USUBJID = c("B001", "B002"), ARM = "Placebo"),
tibble(USUBJID = c("C001", "C002"), ARM = "Treatment")
)
# Combine all studies into one dataset
combined_data <- reduce(study_datasets, bind_rows)🤖 8. GitHub Copilot in RStudio for Functions
GitHub Copilot can significantly accelerate function development in clinical programming by suggesting code based on your comments and context.
Effective Copilot Prompts for Functions
| Comment Prompt | Expected Copilot Suggestion |
|---|---|
# Create function to derive safety flag based on lab values |
Function template with safety logic |
# Function to calculate percent change from baseline |
Mathematical formula implementation |
# Translate SAS macro for lab normal ranges to R function |
R function with lab reference logic |
# Function with error handling for missing clinical data |
Try-catch blocks and validation |
# Vectorized function to process multiple subjects |
Function using map() or vectorized operations |
Copilot Best Practices for Clinical Programming
# Good: Descriptive comment for function purpose
# Create function to standardize adverse event terms and derive severity flags
# Good: Specific parameter expectations
# Function parameters: data (tibble), ae_term_var (column name), severity_var (column name)
# Good: Expected output format
# Returns: tibble with standardized AE terms and derived severity categories
# Good: Include clinical context
# Function should follow CDISC SDTM standards for adverse event processing
# Good: Specify error handling requirements
# Function should handle missing values and invalid severity codes gracefullyAdvanced Copilot Techniques
# Use descriptive variable names to guide Copilot
create_sdtm_compliant_function <- function(raw_clinical_data,
adverse_event_term,
start_date,
reference_date) {
# Copilot will better understand the clinical context
# and suggest appropriate SDTM-compliant transformations
}
# Include expected business logic in comments
# Calculate study day where:
# - If event date >= reference date: study_day = event_date - ref_date + 1
# - If event date < reference date: study_day = event_date - ref_date
# - Handle missing dates appropriatelyIterative Development with Copilot
# Start with basic function structure
process_clinical_data <- function(data) {
# Step 1: Data validation
# Step 2: Standardize text fields
# Step 3: Derive calculated variables
# Step 4: Apply business rules
# Step 5: Return processed data
}
# Let Copilot fill in each step, then refine as needed✅ Summary: Functions, Vectors & Iteration in Clinical Programming
Key Concepts from R4DS Integration
| Concept | Description | Clinical Application |
|---|---|---|
| Function Components | Arguments, body, environment | Standardized clinical derivations |
| Vector Types | Atomic vectors, lists, coercion | Proper data type handling |
| For Loops | Imperative iteration | Processing multiple studies/subjects |
| Functional Programming | map() family functions |
Batch processing clinical datasets |
| Function Factories | Functions that create functions | Domain-specific validators |
SAS vs R Functions Comparison
| Aspect | SAS Macros | R Functions |
|---|---|---|
| Syntax | %macro name(params); ... %mend; |
name <- function(params) { ... } |
| Scope | Global macro variables | Local function environment |
| Parameters | Macro parameters with & |
Function arguments with proper typing |
| Return Values | Dataset modification | Explicit return() or last expression |
| Error Handling | Limited options | stop(), warning(), try(), possibly() |
| Vectorization | Manual loops | Built-in vector operations |
| Iteration | Manual DATA step loops | map() family, for loops, while loops |
| Reusability | Macro calls | Function calls with piping |
| Debugging | %put statements |
print(), cat(), RStudio debugger |
| Testing | Manual verification | Unit tests with testthat |
Best Practices for Clinical Programming
- Write functions when you copy code more than twice
- Use vectors effectively for efficient data processing
- Choose the right iteration method (
forloops vs.map()) - Include error handling for robust clinical applications
- Validate inputs to prevent downstream issues
- Document functions clearly for regulatory compliance
- Use meaningful names that reflect clinical concepts
🎯 Next Steps
In the demo and exercise, you’ll practice:
- Functions (Ch. 19): Creating reusable clinical programming functions
- Vectors (Ch. 20): Understanding data types and efficient vector operations
- Iteration (Ch. 21): Using loops and functional programming for batch processing
- SAS Translation: Converting existing SAS macros to modern R functions
- Error Handling: Building robust functions for clinical data processing
- GitHub Copilot: Leveraging AI assistance for function development
This foundation will prepare you for advanced clinical programming workflows and SDTM/ADAM dataset creation.