Module 5 Theory — Functions, Vectors & Iteration

🔧 Module 5 — Functions, Vectors & Iteration

🎯 Learning Objectives

By the end of this module, you will:

Master function creation, arguments, and environments (R4DS Ch. 19)
Understand vector types, testing, and coercion in R (R4DS Ch. 20)
Learn iteration techniques with for loops and functional programming (R4DS Ch. 21)
Apply functions and iteration to clinical programming workflows
Translate SAS macros to R functions using modern R techniques
Use GitHub Copilot in RStudio to assist with function development and debugging

🔧 1. Functions (R4DS Chapter 19)

Functions are the fundamental building blocks of R programming. They reduce duplication, make code more readable, and help catch errors.

When to Write a Function

Following the DRY principle (Don’t Repeat Yourself), consider writing a function whenever you’ve copied and pasted code more than twice.

# Instead of repeating this pattern:
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))

# Write a function:
rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)

Function Components

Every R function has three key components:

Arguments (formals()): the list of arguments that control how you call the function
Body (body()): the code inside the function
Environment (environment()): the data structure that determines how the function finds the values associated with names

# Clinical example: Age category function
create_age_category <- function(age) {
  case_when(
    is.na(age) ~ "Unknown",
    age < 18 ~ "Pediatric",
    age >= 18 & age < 65 ~ "Adult", 
    age >= 65 ~ "Elderly"
  )
}

# Examine function components
formals(create_age_category)
body(create_age_category)
environment(create_age_category)

Function Arguments

Matching Arguments by Position and Name

# Clinical example: BMI calculation with multiple argument types
calculate_bmi <- function(weight, height, unit = "metric", round_digits = 1) {
  if (unit == "metric") {
    bmi <- weight / (height / 100)^2
  } else if (unit == "imperial") {
    bmi <- (weight * 703) / height^2
  } else {
    stop("unit must be 'metric' or 'imperial'")
  }
  
  round(bmi, digits = round_digits)
}

# Different ways to call the function:
calculate_bmi(70, 175)                           # positional
calculate_bmi(weight = 70, height = 175)         # named
calculate_bmi(70, 175, unit = "metric")          # mixed
calculate_bmi(70, 175, round_digits = 2)         # skip middle argument

The `...` (dot-dot-dot) Argument

# Clinical example: Flexible summary function
clinical_summary <- function(data, ..., na.rm = TRUE) {
  data %>%
    summarise(
      across(c(...), list(
        mean = ~ mean(.x, na.rm = na.rm),
        sd = ~ sd(.x, na.rm = na.rm),
        min = ~ min(.x, na.rm = na.rm),
        max = ~ max(.x, na.rm = na.rm)
      ))
    )
}

# Usage
demographics %>%
  clinical_summary(AGE, WEIGHT, HEIGHT)

Return Values and Environment

Functions can return values explicitly with return() or implicitly (last expression):

# Explicit return
calculate_study_day <- function(event_date, ref_date) {
  if (is.na(event_date) || is.na(ref_date)) {
    return(NA_real_)
  }
  
  days <- as.numeric(event_date - ref_date)
  if (event_date >= ref_date) {
    return(days + 1)
  } else {
    return(days)
  }
}

# Implicit return (preferred when possible)
calculate_study_day_v2 <- function(event_date, ref_date) {
  case_when(
    is.na(event_date) | is.na(ref_date) ~ NA_real_,
    event_date >= ref_date ~ as.numeric(event_date - ref_date) + 1,
    TRUE ~ as.numeric(event_date - ref_date)
  )
}

� 2. Vectors (R4DS Chapter 20)

Vectors are the building blocks of R. Understanding them deeply will help you write better functions and avoid common errors.

Vector Basics

R has two types of vectors:

Atomic vectors (6 types): logical, integer, double, character, complex, raw
Lists (recursive vectors)

# Creating vectors for clinical data
logical_vec <- c(TRUE, FALSE, TRUE)     # Treatment response
integer_vec <- c(1L, 2L, 3L)           # Visit numbers  
double_vec <- c(1.5, 2.7, 3.9)         # Lab values
character_vec <- c("M", "F", "M")       # Gender

# Check types
typeof(logical_vec)
typeof(integer_vec)
typeof(double_vec) 
typeof(character_vec)

Important Vector Properties

# Clinical example with adverse events
ae_severity <- c("MILD", "MODERATE", "SEVERE", "MILD", "SEVERE")

# Length
length(ae_severity)

# Names
names(ae_severity) <- paste0("AE_", 1:5)
ae_severity

# Dimensions (for matrices/arrays)
lab_matrix <- matrix(c(120, 80, 135, 85, 110, 75), nrow = 3)
dim(lab_matrix)

Vector Testing and Coercion

# Testing vector types
test_values <- c(1, 2, 3, "4", 5)
is.character(test_values)    # TRUE - coerced to character
is.numeric(test_values)      # FALSE

# Explicit coercion
lab_values <- c("120", "85", "normal", "95")
as.numeric(lab_values)       # Warning for "normal"

# Safe coercion for clinical data
safe_as_numeric <- function(x) {
  suppressWarnings(as.numeric(x))
}

clean_lab_values <- safe_as_numeric(lab_values)
clean_lab_values                     # c(120, 85, NA, 95)

Subsetting Vectors

subject_ids <- c("001-001", "001-002", "001-003", "001-004", "001-005")

# Positive integers select elements
subject_ids[c(1, 3, 5)]

# Negative integers exclude elements  
subject_ids[-c(2, 4)]

# Logical vectors select TRUE elements
ages <- c(25, 67, 45, 72, 34)
elderly <- ages >= 65
subject_ids[elderly]

# Named vectors
vital_signs <- c(sbp = 120, dbp = 80, hr = 72)
vital_signs["sbp"]
vital_signs[c("sbp", "dbp")]

�🔄 3. SAS Macro to R Function Translation

Understanding vectors helps us write better functions for clinical programming.

SAS Macro Example: Calculate Study Day

%macro calc_study_day(indata=, outdata=, event_date=, ref_date=);
  data &outdata;
    set &indata;
    if not missing(&event_date) and not missing(&ref_date) then do;
      if &event_date >= &ref_date then
        study_day = &event_date - &ref_date + 1;
      else
        study_day = &event_date - &ref_date;
    end;
  run;
%mend calc_study_day;

R Function Translation (Vectorized)

calc_study_day <- function(data, event_date, ref_date) {
  data %>%
    mutate(
      study_day = case_when(
        is.na({{ event_date }}) | is.na({{ ref_date }}) ~ NA_real_,
        {{ event_date }} >= {{ ref_date }} ~ as.numeric({{ event_date }} - {{ ref_date }}) + 1,
        {{ event_date }} < {{ ref_date }} ~ as.numeric({{ event_date }} - {{ ref_date }})
      )
    )
}

# Usage with proper vector handling
ae_data <- ae_data %>%
  calc_study_day(event_date = AESTDT, ref_date = RFSTDT)

🔄 4. Iteration (R4DS Chapter 21)

Iteration allows you to do the same thing to multiple inputs. R provides two main paradigms for iteration:

Imperative programming: for loops and while loops
Functional programming: map() functions from purrr

For Loops

# Basic for loop structure
clinical_vars <- c("AGE", "WEIGHT", "HEIGHT")
results <- vector("double", length(clinical_vars))  # Pre-allocate

for (i in seq_along(clinical_vars)) {
  # Process each clinical variable
  results[[i]] <- mean(demographics[[clinical_vars[[i]]]], na.rm = TRUE)
}
names(results) <- clinical_vars
results

Modifying Existing Objects

# Standardize multiple character columns
clinical_data <- tibble(
  USUBJID = c("001-001", "001-002"),
  AEDECOD = c("  headache ", "NAUSEA  "),
  CMDECOD = c("aspirin  ", "  IBUPROFEN")
)

char_cols <- c("AEDECOD", "CMDECOD")

for (col in char_cols) {
  clinical_data[[col]] <- str_trim(str_to_upper(clinical_data[[col]]))
}
clinical_data

Unknown Output Length

# Simulate adverse event occurrences (unknown number of events)
simulate_ae_times <- function() {
  n_events <- rpois(1, lambda = 2)  # Random number of events
  if (n_events == 0) return(numeric())
  
  cumsum(rexp(n_events, rate = 0.1))  # Event times
}

# Collect results using a list
ae_simulations <- vector("list", 100)
for (i in 1:100) {
  ae_simulations[[i]] <- simulate_ae_times()
}

# Convert to useful format
ae_data_simulated <- tibble(
  simulation = rep(1:100, lengths(ae_simulations)),
  event_time = unlist(ae_simulations)
)

While Loops

# Dose escalation algorithm (common in clinical trials)
dose_escalation <- function(starting_dose = 10, max_dose = 100) {
  current_dose <- starting_dose
  doses <- current_dose
  
  while (current_dose < max_dose) {
    # Simulate safety assessment (simplified)
    safety_ok <- rbinom(1, 1, prob = 0.8)  # 80% chance of safety
    
    if (safety_ok) {
      current_dose <- current_dose * 1.5  # Escalate by 50%
    } else {
      break  # Stop escalation if safety issue
    }
    
    doses <- c(doses, current_dose)
  }
  
  doses[doses <= max_dose]  # Return valid doses only
}

dose_escalation()

For Loops vs. Functionals

The functional programming approach is often cleaner and less error-prone:

# For loop approach
means_for <- vector("double", ncol(demographics))
for (i in seq_along(demographics)) {
  if (is.numeric(demographics[[i]])) {
    means_for[[i]] <- mean(demographics[[i]], na.rm = TRUE)
  }
}

# Functional approach with map
library(purrr)
means_functional <- demographics %>%
  select(where(is.numeric)) %>%
  map_dbl(mean, na.rm = TRUE)

🎯 5. Clinical Programming Function Examples

Derive Elderly Flag (Vectorized)

derive_elderly_flag <- function(data, age_var, cutoff = 65) {
  data %>%
    mutate(
      ELDERLY = case_when(
        is.na({{ age_var }}) ~ "U",
        {{ age_var }} >= cutoff ~ "Y",
        {{ age_var }} < cutoff ~ "N"
      )
    )
}

Process Multiple Studies with Iteration

# Function to process a single study
process_study <- function(study_data) {
  study_data %>%
    derive_elderly_flag(AGE) %>%
    mutate(
      BMI = WEIGHT / (HEIGHT / 100)^2,
      BMI_CAT = case_when(
        BMI < 18.5 ~ "Underweight",
        BMI >= 18.5 & BMI < 25 ~ "Normal",
        BMI >= 25 & BMI < 30 ~ "Overweight",
        BMI >= 30 ~ "Obese",
        TRUE ~ "Unknown"
      )
    )
}

# Apply to multiple studies using map
study_files <- c("study001.csv", "study002.csv", "study003.csv")

all_studies <- study_files %>%
  map(read_csv) %>%
  map(process_study) %>%
  set_names(c("Study 001", "Study 002", "Study 003"))

Batch Processing with Error Handling

# Safe function that handles errors gracefully
safe_process_study <- possibly(process_study, otherwise = tibble())

# Process multiple files with error handling
results <- study_files %>%
  map(~ {
    cat("Processing", .x, "...\n")
    read_csv(.x, show_col_types = FALSE) %>%
      safe_process_study()
  })

# Check which files processed successfully
successful <- map_lgl(results, ~ nrow(.x) > 0)
cat("Successfully processed:", sum(successful), "out of", length(study_files), "files\n")

🛡️ 6. Advanced Function Features

Input Validation and Error Handling

derive_bmi_category <- function(data, weight_var, height_var, unit = "metric") {
  # Input validation
  if (!unit %in% c("metric", "imperial")) {
    stop("unit must be either 'metric' or 'imperial'")
  }
  
  if (!is.data.frame(data)) {
    stop("data must be a data frame")
  }
  
  data %>%
    mutate(
      bmi = if (unit == "metric") {
        {{ weight_var }} / ({{ height_var }} / 100)^2
      } else {
        ({{ weight_var }} * 703) / {{ height_var }}^2
      },
      bmi_category = case_when(
        is.na(bmi) ~ "Unknown",
        bmi < 18.5 ~ "Underweight",
        bmi >= 18.5 & bmi < 25 ~ "Normal",
        bmi >= 25 & bmi < 30 ~ "Overweight",
        bmi >= 30 ~ "Obese"
      )
    )
}

Function Factories

Functions that create other functions are called function factories:

# Create functions for different CDISC domains
create_domain_checker <- function(domain_prefix) {
  function(subject_id) {
    str_detect(subject_id, paste0("^", domain_prefix, "-"))
  }
}

# Create specific checkers
is_ae_subject <- create_domain_checker("AE")
is_dm_subject <- create_domain_checker("DM")
is_vs_subject <- create_domain_checker("VS")

# Usage
subject_ids <- c("AE-001", "DM-002", "VS-003", "AE-004")
is_ae_subject(subject_ids)  # TRUE FALSE FALSE TRUE

Function Operators

Functions that take functions as input and return functions as output:

# Add clinical data validation to any function
add_validation <- function(f) {
  function(data, ...) {
    if (!is.data.frame(data)) {
      stop("Input must be a data frame")
    }
    if (nrow(data) == 0) {
      warning("Input data frame is empty")
      return(data)
    }
    
    result <- f(data, ...)
    
    if (!is.data.frame(result)) {
      stop("Function must return a data frame")
    }
    
    result
  }
}

# Apply validation wrapper
safe_derive_elderly <- add_validation(derive_elderly_flag)

# Now the function includes validation
# safe_derive_elderly(demographics, AGE)

Functions with Multiple Return Options

analyze_clinical_data <- function(data, return_type = "summary", group_var = NULL) {
  if (!is.null(group_var)) {
    base_analysis <- data %>%
      group_by({{ group_var }}) %>%
      summarise(
        n_subjects = n_distinct(USUBJID),
        mean_age = mean(AGE, na.rm = TRUE),
        median_age = median(AGE, na.rm = TRUE),
        .groups = "drop"
      )
  } else {
    base_analysis <- data %>%
      summarise(
        n_subjects = n_distinct(USUBJID),
        mean_age = mean(AGE, na.rm = TRUE),
        median_age = median(AGE, na.rm = TRUE)
      )
  }
  
  switch(return_type,
    "summary" = base_analysis,
    "detailed" = base_analysis %>%
      mutate(
        age_sd = sd(AGE, na.rm = TRUE),
        age_range = paste(min(AGE, na.rm = TRUE), max(AGE, na.rm = TRUE), sep = "-")
      ),
    "count_only" = base_analysis$n_subjects[[1]],
    "ages_only" = data$AGE
  )
}

🎨 7. The `map()` Family (Functional Programming)

The purrr package provides a consistent set of tools for working with functions and vectors, implementing functional programming concepts.

Basic `map()` Functions

library(purrr)

# map() always returns a list
clinical_vars <- list(
  study1_ages = c(25, 45, 67, 34, 52),
  study2_ages = c(28, 49, 63, 71, 39),
  study3_ages = c(44, 55, 29, 67, 42)
)

# Calculate mean age for each study
clinical_vars %>% 
  map(mean, na.rm = TRUE)

# map_dbl() returns a numeric vector
clinical_vars %>% 
  map_dbl(mean, na.rm = TRUE)

# map_chr() returns a character vector  
clinical_vars %>% 
  map_chr(~ paste("Mean age:", round(mean(.x), 1)))

Anonymous Functions and Shortcuts

# Three ways to write the same thing:

# 1. Anonymous function (traditional)
map(clinical_vars, function(x) mean(x, na.rm = TRUE))

# 2. Formula shortcut (purrr style)
map(clinical_vars, ~ mean(.x, na.rm = TRUE))

# 3. String shortcut (when function exists)
map(clinical_vars, mean, na.rm = TRUE)

Processing Clinical Data with `map()`

# Apply standardization to multiple character columns
clinical_data <- tibble(
  USUBJID = c("001-001", "001-002", "001-003"),
  AEDECOD = c("  headache ", "NAUSEA  ", "fatigue"),
  CMDECOD = c("aspirin  ", "  IBUPROFEN", "acetaminophen  ")
)

# Standardize all character columns except USUBJID
clinical_data %>%
  mutate(
    across(c(AEDECOD, CMDECOD), ~ str_trim(str_to_upper(.x)))
  )

# Alternative using map approach
char_cols <- clinical_data %>% 
  select(-USUBJID) %>% 
  select(where(is.character))

char_cols %>%
  map(~ str_trim(str_to_upper(.x)))

`map2()` and `pmap()` for Multiple Inputs

# map2() for two inputs
weights <- c(70, 65, 80, 75)
heights <- c(175, 160, 185, 180)

map2_dbl(weights, heights, ~ .x / (.y / 100)^2)  # Calculate BMI

# pmap() for multiple inputs
subjects <- list(
  weight = c(70, 65, 80, 75),
  height = c(175, 160, 185, 180),
  age = c(45, 62, 38, 55)
)

# Function that uses all three inputs
health_score <- function(weight, height, age) {
  bmi <- weight / (height / 100)^2
  age_factor <- ifelse(age > 50, 0.9, 1.0)
  round((25 / bmi) * age_factor, 2)
}

pmap_dbl(subjects, health_score)

Walk Functions (for Side Effects)

# Use walk() when you want side effects, not return values
study_summaries <- list(
  study1 = tibble(USUBJID = 1:5, AGE = c(45, 62, 38, 55, 49)),
  study2 = tibble(USUBJID = 1:4, AGE = c(52, 34, 67, 41)),
  study3 = tibble(USUBJID = 1:6, AGE = c(29, 58, 46, 71, 33, 54))
)

# Create a summary report for each study
create_study_report <- function(data, study_name) {
  summary_stats <- data %>%
    summarise(
      n = n(),
      mean_age = round(mean(AGE), 1),
      median_age = median(AGE)
    )
  
  cat("Study:", study_name, "\n")
  cat("N:", summary_stats$n, "\n")
  cat("Mean age:", summary_stats$mean_age, "\n")
  cat("Median age:", summary_stats$median_age, "\n\n")
}

# Use walk2() to iterate and produce side effects
walk2(study_summaries, names(study_summaries), create_study_report)

Predicate Functions

# keep() and discard() for filtering based on a condition
lab_values <- list(
  glucose = c(90, 95, 110, 85),
  creatinine = c(1.1, 0.9, 1.3, 1.0),
  invalid_test = c(NA, NA, NA, NA),
  hemoglobin = c(13.5, 12.1, 14.2, 13.8)
)

# Keep only non-missing lab tests
valid_labs <- lab_values %>%
  keep(~ !all(is.na(.x)))

# Find lab tests with any abnormal values (example thresholds)
abnormal_labs <- lab_values %>%
  keep(~ any(.x > 100, na.rm = TRUE))  # Simplified example

Advanced Iteration Patterns

# Reduce for cumulative operations
daily_ae_counts <- c(2, 1, 3, 0, 2, 1, 4)

# Cumulative AE count
cumulative_aes <- reduce(daily_ae_counts, `+`, .init = 0)

# More complex reduction: combine multiple study datasets
study_datasets <- list(
  tibble(USUBJID = c("A001", "A002"), ARM = "Treatment"),
  tibble(USUBJID = c("B001", "B002"), ARM = "Placebo"),
  tibble(USUBJID = c("C001", "C002"), ARM = "Treatment")
)

# Combine all studies into one dataset
combined_data <- reduce(study_datasets, bind_rows)

🤖 8. GitHub Copilot in RStudio for Functions

GitHub Copilot can significantly accelerate function development in clinical programming by suggesting code based on your comments and context.

Effective Copilot Prompts for Functions

Comment Prompt	Expected Copilot Suggestion
`# Create function to derive safety flag based on lab values`	Function template with safety logic
`# Function to calculate percent change from baseline`	Mathematical formula implementation
`# Translate SAS macro for lab normal ranges to R function`	R function with lab reference logic
`# Function with error handling for missing clinical data`	Try-catch blocks and validation
`# Vectorized function to process multiple subjects`	Function using map() or vectorized operations

Copilot Best Practices for Clinical Programming

# Good: Descriptive comment for function purpose
# Create function to standardize adverse event terms and derive severity flags

# Good: Specific parameter expectations  
# Function parameters: data (tibble), ae_term_var (column name), severity_var (column name)

# Good: Expected output format
# Returns: tibble with standardized AE terms and derived severity categories

# Good: Include clinical context
# Function should follow CDISC SDTM standards for adverse event processing

# Good: Specify error handling requirements
# Function should handle missing values and invalid severity codes gracefully

Advanced Copilot Techniques

# Use descriptive variable names to guide Copilot
create_sdtm_compliant_function <- function(raw_clinical_data, 
                                          adverse_event_term, 
                                          start_date, 
                                          reference_date) {
  # Copilot will better understand the clinical context
  # and suggest appropriate SDTM-compliant transformations
}

# Include expected business logic in comments
# Calculate study day where: 
# - If event date >= reference date: study_day = event_date - ref_date + 1
# - If event date < reference date: study_day = event_date - ref_date  
# - Handle missing dates appropriately

Iterative Development with Copilot

# Start with basic function structure
process_clinical_data <- function(data) {
  # Step 1: Data validation
  
  # Step 2: Standardize text fields
  
  # Step 3: Derive calculated variables
  
  # Step 4: Apply business rules
  
  # Step 5: Return processed data
}

# Let Copilot fill in each step, then refine as needed

✅ Summary: Functions, Vectors & Iteration in Clinical Programming

Key Concepts from R4DS Integration

Concept	Description	Clinical Application
Function Components	Arguments, body, environment	Standardized clinical derivations
Vector Types	Atomic vectors, lists, coercion	Proper data type handling
For Loops	Imperative iteration	Processing multiple studies/subjects
Functional Programming	`map()` family functions	Batch processing clinical datasets
Function Factories	Functions that create functions	Domain-specific validators

SAS vs R Functions Comparison

Aspect	SAS Macros	R Functions
Syntax	`%macro name(params); ... %mend;`	`name <- function(params) { ... }`
Scope	Global macro variables	Local function environment
Parameters	Macro parameters with `&`	Function arguments with proper typing
Return Values	Dataset modification	Explicit `return()` or last expression
Error Handling	Limited options	`stop()`, `warning()`, `try()`, `possibly()`
Vectorization	Manual loops	Built-in vector operations
Iteration	Manual DATA step loops	`map()` family, `for` loops, `while` loops
Reusability	Macro calls	Function calls with piping
Debugging	`%put` statements	`print()`, `cat()`, RStudio debugger
Testing	Manual verification	Unit tests with `testthat`

Best Practices for Clinical Programming

Write functions when you copy code more than twice
Use vectors effectively for efficient data processing
Choose the right iteration method (for loops vs. map())
Include error handling for robust clinical applications
Validate inputs to prevent downstream issues
Document functions clearly for regulatory compliance
Use meaningful names that reflect clinical concepts

🎯 Next Steps

In the demo and exercise, you’ll practice:

Functions (Ch. 19): Creating reusable clinical programming functions
Vectors (Ch. 20): Understanding data types and efficient vector operations
Iteration (Ch. 21): Using loops and functional programming for batch processing
SAS Translation: Converting existing SAS macros to modern R functions
Error Handling: Building robust functions for clinical data processing
GitHub Copilot: Leveraging AI assistance for function development

This foundation will prepare you for advanced clinical programming workflows and SDTM/ADAM dataset creation.