Module 5 Theory — Functions, Vectors & Iteration

🔧 Module 5 — Functions, Vectors & Iteration

🎯 Learning Objectives

By the end of this module, you will:

  • Master function creation, arguments, and environments (R4DS Ch. 19)
  • Understand vector types, testing, and coercion in R (R4DS Ch. 20)
  • Learn iteration techniques with for loops and functional programming (R4DS Ch. 21)
  • Apply functions and iteration to clinical programming workflows
  • Translate SAS macros to R functions using modern R techniques
  • Use GitHub Copilot in RStudio to assist with function development and debugging

🔧 1. Functions (R4DS Chapter 19)

Functions are the fundamental building blocks of R programming. They reduce duplication, make code more readable, and help catch errors.

When to Write a Function

Following the DRY principle (Don’t Repeat Yourself), consider writing a function whenever you’ve copied and pasted code more than twice.

# Instead of repeating this pattern:
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))

# Write a function:
rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)

Function Components

Every R function has three key components:

  1. Arguments (formals()): the list of arguments that control how you call the function
  2. Body (body()): the code inside the function
  3. Environment (environment()): the data structure that determines how the function finds the values associated with names
# Clinical example: Age category function
create_age_category <- function(age) {
  case_when(
    is.na(age) ~ "Unknown",
    age < 18 ~ "Pediatric",
    age >= 18 & age < 65 ~ "Adult", 
    age >= 65 ~ "Elderly"
  )
}

# Examine function components
formals(create_age_category)
body(create_age_category)
environment(create_age_category)

Function Arguments

Matching Arguments by Position and Name

# Clinical example: BMI calculation with multiple argument types
calculate_bmi <- function(weight, height, unit = "metric", round_digits = 1) {
  if (unit == "metric") {
    bmi <- weight / (height / 100)^2
  } else if (unit == "imperial") {
    bmi <- (weight * 703) / height^2
  } else {
    stop("unit must be 'metric' or 'imperial'")
  }
  
  round(bmi, digits = round_digits)
}

# Different ways to call the function:
calculate_bmi(70, 175)                           # positional
calculate_bmi(weight = 70, height = 175)         # named
calculate_bmi(70, 175, unit = "metric")          # mixed
calculate_bmi(70, 175, round_digits = 2)         # skip middle argument

The ... (dot-dot-dot) Argument

# Clinical example: Flexible summary function
clinical_summary <- function(data, ..., na.rm = TRUE) {
  data %>%
    summarise(
      across(c(...), list(
        mean = ~ mean(.x, na.rm = na.rm),
        sd = ~ sd(.x, na.rm = na.rm),
        min = ~ min(.x, na.rm = na.rm),
        max = ~ max(.x, na.rm = na.rm)
      ))
    )
}

# Usage
demographics %>%
  clinical_summary(AGE, WEIGHT, HEIGHT)

Return Values and Environment

Functions can return values explicitly with return() or implicitly (last expression):

# Explicit return
calculate_study_day <- function(event_date, ref_date) {
  if (is.na(event_date) || is.na(ref_date)) {
    return(NA_real_)
  }
  
  days <- as.numeric(event_date - ref_date)
  if (event_date >= ref_date) {
    return(days + 1)
  } else {
    return(days)
  }
}

# Implicit return (preferred when possible)
calculate_study_day_v2 <- function(event_date, ref_date) {
  case_when(
    is.na(event_date) | is.na(ref_date) ~ NA_real_,
    event_date >= ref_date ~ as.numeric(event_date - ref_date) + 1,
    TRUE ~ as.numeric(event_date - ref_date)
  )
}

� 2. Vectors (R4DS Chapter 20)

Vectors are the building blocks of R. Understanding them deeply will help you write better functions and avoid common errors.

Vector Basics

R has two types of vectors:

  1. Atomic vectors (6 types): logical, integer, double, character, complex, raw
  2. Lists (recursive vectors)
# Creating vectors for clinical data
logical_vec <- c(TRUE, FALSE, TRUE)     # Treatment response
integer_vec <- c(1L, 2L, 3L)           # Visit numbers  
double_vec <- c(1.5, 2.7, 3.9)         # Lab values
character_vec <- c("M", "F", "M")       # Gender

# Check types
typeof(logical_vec)
typeof(integer_vec)
typeof(double_vec) 
typeof(character_vec)

Important Vector Properties

# Clinical example with adverse events
ae_severity <- c("MILD", "MODERATE", "SEVERE", "MILD", "SEVERE")

# Length
length(ae_severity)

# Names
names(ae_severity) <- paste0("AE_", 1:5)
ae_severity

# Dimensions (for matrices/arrays)
lab_matrix <- matrix(c(120, 80, 135, 85, 110, 75), nrow = 3)
dim(lab_matrix)

Vector Testing and Coercion

# Testing vector types
test_values <- c(1, 2, 3, "4", 5)
is.character(test_values)    # TRUE - coerced to character
is.numeric(test_values)      # FALSE

# Explicit coercion
lab_values <- c("120", "85", "normal", "95")
as.numeric(lab_values)       # Warning for "normal"

# Safe coercion for clinical data
safe_as_numeric <- function(x) {
  suppressWarnings(as.numeric(x))
}

clean_lab_values <- safe_as_numeric(lab_values)
clean_lab_values                     # c(120, 85, NA, 95)

Subsetting Vectors

subject_ids <- c("001-001", "001-002", "001-003", "001-004", "001-005")

# Positive integers select elements
subject_ids[c(1, 3, 5)]

# Negative integers exclude elements  
subject_ids[-c(2, 4)]

# Logical vectors select TRUE elements
ages <- c(25, 67, 45, 72, 34)
elderly <- ages >= 65
subject_ids[elderly]

# Named vectors
vital_signs <- c(sbp = 120, dbp = 80, hr = 72)
vital_signs["sbp"]
vital_signs[c("sbp", "dbp")]

�🔄 3. SAS Macro to R Function Translation

Understanding vectors helps us write better functions for clinical programming.

SAS Macro Example: Calculate Study Day

%macro calc_study_day(indata=, outdata=, event_date=, ref_date=);
  data &outdata;
    set &indata;
    if not missing(&event_date) and not missing(&ref_date) then do;
      if &event_date >= &ref_date then
        study_day = &event_date - &ref_date + 1;
      else
        study_day = &event_date - &ref_date;
    end;
  run;
%mend calc_study_day;

R Function Translation (Vectorized)

calc_study_day <- function(data, event_date, ref_date) {
  data %>%
    mutate(
      study_day = case_when(
        is.na({{ event_date }}) | is.na({{ ref_date }}) ~ NA_real_,
        {{ event_date }} >= {{ ref_date }} ~ as.numeric({{ event_date }} - {{ ref_date }}) + 1,
        {{ event_date }} < {{ ref_date }} ~ as.numeric({{ event_date }} - {{ ref_date }})
      )
    )
}

# Usage with proper vector handling
ae_data <- ae_data %>%
  calc_study_day(event_date = AESTDT, ref_date = RFSTDT)

🔄 4. Iteration (R4DS Chapter 21)

Iteration allows you to do the same thing to multiple inputs. R provides two main paradigms for iteration:

  1. Imperative programming: for loops and while loops
  2. Functional programming: map() functions from purrr

For Loops

# Basic for loop structure
clinical_vars <- c("AGE", "WEIGHT", "HEIGHT")
results <- vector("double", length(clinical_vars))  # Pre-allocate

for (i in seq_along(clinical_vars)) {
  # Process each clinical variable
  results[[i]] <- mean(demographics[[clinical_vars[[i]]]], na.rm = TRUE)
}
names(results) <- clinical_vars
results

Modifying Existing Objects

# Standardize multiple character columns
clinical_data <- tibble(
  USUBJID = c("001-001", "001-002"),
  AEDECOD = c("  headache ", "NAUSEA  "),
  CMDECOD = c("aspirin  ", "  IBUPROFEN")
)

char_cols <- c("AEDECOD", "CMDECOD")

for (col in char_cols) {
  clinical_data[[col]] <- str_trim(str_to_upper(clinical_data[[col]]))
}
clinical_data

Unknown Output Length

# Simulate adverse event occurrences (unknown number of events)
simulate_ae_times <- function() {
  n_events <- rpois(1, lambda = 2)  # Random number of events
  if (n_events == 0) return(numeric())
  
  cumsum(rexp(n_events, rate = 0.1))  # Event times
}

# Collect results using a list
ae_simulations <- vector("list", 100)
for (i in 1:100) {
  ae_simulations[[i]] <- simulate_ae_times()
}

# Convert to useful format
ae_data_simulated <- tibble(
  simulation = rep(1:100, lengths(ae_simulations)),
  event_time = unlist(ae_simulations)
)

While Loops

# Dose escalation algorithm (common in clinical trials)
dose_escalation <- function(starting_dose = 10, max_dose = 100) {
  current_dose <- starting_dose
  doses <- current_dose
  
  while (current_dose < max_dose) {
    # Simulate safety assessment (simplified)
    safety_ok <- rbinom(1, 1, prob = 0.8)  # 80% chance of safety
    
    if (safety_ok) {
      current_dose <- current_dose * 1.5  # Escalate by 50%
    } else {
      break  # Stop escalation if safety issue
    }
    
    doses <- c(doses, current_dose)
  }
  
  doses[doses <= max_dose]  # Return valid doses only
}

dose_escalation()

For Loops vs. Functionals

The functional programming approach is often cleaner and less error-prone:

# For loop approach
means_for <- vector("double", ncol(demographics))
for (i in seq_along(demographics)) {
  if (is.numeric(demographics[[i]])) {
    means_for[[i]] <- mean(demographics[[i]], na.rm = TRUE)
  }
}

# Functional approach with map
library(purrr)
means_functional <- demographics %>%
  select(where(is.numeric)) %>%
  map_dbl(mean, na.rm = TRUE)

🎯 5. Clinical Programming Function Examples

Derive Elderly Flag (Vectorized)

derive_elderly_flag <- function(data, age_var, cutoff = 65) {
  data %>%
    mutate(
      ELDERLY = case_when(
        is.na({{ age_var }}) ~ "U",
        {{ age_var }} >= cutoff ~ "Y",
        {{ age_var }} < cutoff ~ "N"
      )
    )
}

Process Multiple Studies with Iteration

# Function to process a single study
process_study <- function(study_data) {
  study_data %>%
    derive_elderly_flag(AGE) %>%
    mutate(
      BMI = WEIGHT / (HEIGHT / 100)^2,
      BMI_CAT = case_when(
        BMI < 18.5 ~ "Underweight",
        BMI >= 18.5 & BMI < 25 ~ "Normal",
        BMI >= 25 & BMI < 30 ~ "Overweight",
        BMI >= 30 ~ "Obese",
        TRUE ~ "Unknown"
      )
    )
}

# Apply to multiple studies using map
study_files <- c("study001.csv", "study002.csv", "study003.csv")

all_studies <- study_files %>%
  map(read_csv) %>%
  map(process_study) %>%
  set_names(c("Study 001", "Study 002", "Study 003"))

Batch Processing with Error Handling

# Safe function that handles errors gracefully
safe_process_study <- possibly(process_study, otherwise = tibble())

# Process multiple files with error handling
results <- study_files %>%
  map(~ {
    cat("Processing", .x, "...\n")
    read_csv(.x, show_col_types = FALSE) %>%
      safe_process_study()
  })

# Check which files processed successfully
successful <- map_lgl(results, ~ nrow(.x) > 0)
cat("Successfully processed:", sum(successful), "out of", length(study_files), "files\n")

🛡️ 6. Advanced Function Features

Input Validation and Error Handling

derive_bmi_category <- function(data, weight_var, height_var, unit = "metric") {
  # Input validation
  if (!unit %in% c("metric", "imperial")) {
    stop("unit must be either 'metric' or 'imperial'")
  }
  
  if (!is.data.frame(data)) {
    stop("data must be a data frame")
  }
  
  data %>%
    mutate(
      bmi = if (unit == "metric") {
        {{ weight_var }} / ({{ height_var }} / 100)^2
      } else {
        ({{ weight_var }} * 703) / {{ height_var }}^2
      },
      bmi_category = case_when(
        is.na(bmi) ~ "Unknown",
        bmi < 18.5 ~ "Underweight",
        bmi >= 18.5 & bmi < 25 ~ "Normal",
        bmi >= 25 & bmi < 30 ~ "Overweight",
        bmi >= 30 ~ "Obese"
      )
    )
}

Function Factories

Functions that create other functions are called function factories:

# Create functions for different CDISC domains
create_domain_checker <- function(domain_prefix) {
  function(subject_id) {
    str_detect(subject_id, paste0("^", domain_prefix, "-"))
  }
}

# Create specific checkers
is_ae_subject <- create_domain_checker("AE")
is_dm_subject <- create_domain_checker("DM")
is_vs_subject <- create_domain_checker("VS")

# Usage
subject_ids <- c("AE-001", "DM-002", "VS-003", "AE-004")
is_ae_subject(subject_ids)  # TRUE FALSE FALSE TRUE

Function Operators

Functions that take functions as input and return functions as output:

# Add clinical data validation to any function
add_validation <- function(f) {
  function(data, ...) {
    if (!is.data.frame(data)) {
      stop("Input must be a data frame")
    }
    if (nrow(data) == 0) {
      warning("Input data frame is empty")
      return(data)
    }
    
    result <- f(data, ...)
    
    if (!is.data.frame(result)) {
      stop("Function must return a data frame")
    }
    
    result
  }
}

# Apply validation wrapper
safe_derive_elderly <- add_validation(derive_elderly_flag)

# Now the function includes validation
# safe_derive_elderly(demographics, AGE)

Functions with Multiple Return Options

analyze_clinical_data <- function(data, return_type = "summary", group_var = NULL) {
  if (!is.null(group_var)) {
    base_analysis <- data %>%
      group_by({{ group_var }}) %>%
      summarise(
        n_subjects = n_distinct(USUBJID),
        mean_age = mean(AGE, na.rm = TRUE),
        median_age = median(AGE, na.rm = TRUE),
        .groups = "drop"
      )
  } else {
    base_analysis <- data %>%
      summarise(
        n_subjects = n_distinct(USUBJID),
        mean_age = mean(AGE, na.rm = TRUE),
        median_age = median(AGE, na.rm = TRUE)
      )
  }
  
  switch(return_type,
    "summary" = base_analysis,
    "detailed" = base_analysis %>%
      mutate(
        age_sd = sd(AGE, na.rm = TRUE),
        age_range = paste(min(AGE, na.rm = TRUE), max(AGE, na.rm = TRUE), sep = "-")
      ),
    "count_only" = base_analysis$n_subjects[[1]],
    "ages_only" = data$AGE
  )
}

🎨 7. The map() Family (Functional Programming)

The purrr package provides a consistent set of tools for working with functions and vectors, implementing functional programming concepts.

Basic map() Functions

library(purrr)

# map() always returns a list
clinical_vars <- list(
  study1_ages = c(25, 45, 67, 34, 52),
  study2_ages = c(28, 49, 63, 71, 39),
  study3_ages = c(44, 55, 29, 67, 42)
)

# Calculate mean age for each study
clinical_vars %>% 
  map(mean, na.rm = TRUE)

# map_dbl() returns a numeric vector
clinical_vars %>% 
  map_dbl(mean, na.rm = TRUE)

# map_chr() returns a character vector  
clinical_vars %>% 
  map_chr(~ paste("Mean age:", round(mean(.x), 1)))

Anonymous Functions and Shortcuts

# Three ways to write the same thing:

# 1. Anonymous function (traditional)
map(clinical_vars, function(x) mean(x, na.rm = TRUE))

# 2. Formula shortcut (purrr style)
map(clinical_vars, ~ mean(.x, na.rm = TRUE))

# 3. String shortcut (when function exists)
map(clinical_vars, mean, na.rm = TRUE)

Processing Clinical Data with map()

# Apply standardization to multiple character columns
clinical_data <- tibble(
  USUBJID = c("001-001", "001-002", "001-003"),
  AEDECOD = c("  headache ", "NAUSEA  ", "fatigue"),
  CMDECOD = c("aspirin  ", "  IBUPROFEN", "acetaminophen  ")
)

# Standardize all character columns except USUBJID
clinical_data %>%
  mutate(
    across(c(AEDECOD, CMDECOD), ~ str_trim(str_to_upper(.x)))
  )

# Alternative using map approach
char_cols <- clinical_data %>% 
  select(-USUBJID) %>% 
  select(where(is.character))

char_cols %>%
  map(~ str_trim(str_to_upper(.x)))

map2() and pmap() for Multiple Inputs

# map2() for two inputs
weights <- c(70, 65, 80, 75)
heights <- c(175, 160, 185, 180)

map2_dbl(weights, heights, ~ .x / (.y / 100)^2)  # Calculate BMI

# pmap() for multiple inputs
subjects <- list(
  weight = c(70, 65, 80, 75),
  height = c(175, 160, 185, 180),
  age = c(45, 62, 38, 55)
)

# Function that uses all three inputs
health_score <- function(weight, height, age) {
  bmi <- weight / (height / 100)^2
  age_factor <- ifelse(age > 50, 0.9, 1.0)
  round((25 / bmi) * age_factor, 2)
}

pmap_dbl(subjects, health_score)

Walk Functions (for Side Effects)

# Use walk() when you want side effects, not return values
study_summaries <- list(
  study1 = tibble(USUBJID = 1:5, AGE = c(45, 62, 38, 55, 49)),
  study2 = tibble(USUBJID = 1:4, AGE = c(52, 34, 67, 41)),
  study3 = tibble(USUBJID = 1:6, AGE = c(29, 58, 46, 71, 33, 54))
)

# Create a summary report for each study
create_study_report <- function(data, study_name) {
  summary_stats <- data %>%
    summarise(
      n = n(),
      mean_age = round(mean(AGE), 1),
      median_age = median(AGE)
    )
  
  cat("Study:", study_name, "\n")
  cat("N:", summary_stats$n, "\n")
  cat("Mean age:", summary_stats$mean_age, "\n")
  cat("Median age:", summary_stats$median_age, "\n\n")
}

# Use walk2() to iterate and produce side effects
walk2(study_summaries, names(study_summaries), create_study_report)

Predicate Functions

# keep() and discard() for filtering based on a condition
lab_values <- list(
  glucose = c(90, 95, 110, 85),
  creatinine = c(1.1, 0.9, 1.3, 1.0),
  invalid_test = c(NA, NA, NA, NA),
  hemoglobin = c(13.5, 12.1, 14.2, 13.8)
)

# Keep only non-missing lab tests
valid_labs <- lab_values %>%
  keep(~ !all(is.na(.x)))

# Find lab tests with any abnormal values (example thresholds)
abnormal_labs <- lab_values %>%
  keep(~ any(.x > 100, na.rm = TRUE))  # Simplified example

Advanced Iteration Patterns

# Reduce for cumulative operations
daily_ae_counts <- c(2, 1, 3, 0, 2, 1, 4)

# Cumulative AE count
cumulative_aes <- reduce(daily_ae_counts, `+`, .init = 0)

# More complex reduction: combine multiple study datasets
study_datasets <- list(
  tibble(USUBJID = c("A001", "A002"), ARM = "Treatment"),
  tibble(USUBJID = c("B001", "B002"), ARM = "Placebo"),
  tibble(USUBJID = c("C001", "C002"), ARM = "Treatment")
)

# Combine all studies into one dataset
combined_data <- reduce(study_datasets, bind_rows)

🤖 8. GitHub Copilot in RStudio for Functions

GitHub Copilot can significantly accelerate function development in clinical programming by suggesting code based on your comments and context.

Effective Copilot Prompts for Functions

Comment Prompt Expected Copilot Suggestion
# Create function to derive safety flag based on lab values Function template with safety logic
# Function to calculate percent change from baseline Mathematical formula implementation
# Translate SAS macro for lab normal ranges to R function R function with lab reference logic
# Function with error handling for missing clinical data Try-catch blocks and validation
# Vectorized function to process multiple subjects Function using map() or vectorized operations

Copilot Best Practices for Clinical Programming

# Good: Descriptive comment for function purpose
# Create function to standardize adverse event terms and derive severity flags

# Good: Specific parameter expectations  
# Function parameters: data (tibble), ae_term_var (column name), severity_var (column name)

# Good: Expected output format
# Returns: tibble with standardized AE terms and derived severity categories

# Good: Include clinical context
# Function should follow CDISC SDTM standards for adverse event processing

# Good: Specify error handling requirements
# Function should handle missing values and invalid severity codes gracefully

Advanced Copilot Techniques

# Use descriptive variable names to guide Copilot
create_sdtm_compliant_function <- function(raw_clinical_data, 
                                          adverse_event_term, 
                                          start_date, 
                                          reference_date) {
  # Copilot will better understand the clinical context
  # and suggest appropriate SDTM-compliant transformations
}

# Include expected business logic in comments
# Calculate study day where: 
# - If event date >= reference date: study_day = event_date - ref_date + 1
# - If event date < reference date: study_day = event_date - ref_date  
# - Handle missing dates appropriately

Iterative Development with Copilot

# Start with basic function structure
process_clinical_data <- function(data) {
  # Step 1: Data validation
  
  # Step 2: Standardize text fields
  
  # Step 3: Derive calculated variables
  
  # Step 4: Apply business rules
  
  # Step 5: Return processed data
}

# Let Copilot fill in each step, then refine as needed

✅ Summary: Functions, Vectors & Iteration in Clinical Programming

Key Concepts from R4DS Integration

Concept Description Clinical Application
Function Components Arguments, body, environment Standardized clinical derivations
Vector Types Atomic vectors, lists, coercion Proper data type handling
For Loops Imperative iteration Processing multiple studies/subjects
Functional Programming map() family functions Batch processing clinical datasets
Function Factories Functions that create functions Domain-specific validators

SAS vs R Functions Comparison

Aspect SAS Macros R Functions
Syntax %macro name(params); ... %mend; name <- function(params) { ... }
Scope Global macro variables Local function environment
Parameters Macro parameters with & Function arguments with proper typing
Return Values Dataset modification Explicit return() or last expression
Error Handling Limited options stop(), warning(), try(), possibly()
Vectorization Manual loops Built-in vector operations
Iteration Manual DATA step loops map() family, for loops, while loops
Reusability Macro calls Function calls with piping
Debugging %put statements print(), cat(), RStudio debugger
Testing Manual verification Unit tests with testthat

Best Practices for Clinical Programming

  1. Write functions when you copy code more than twice
  2. Use vectors effectively for efficient data processing
  3. Choose the right iteration method (for loops vs. map())
  4. Include error handling for robust clinical applications
  5. Validate inputs to prevent downstream issues
  6. Document functions clearly for regulatory compliance
  7. Use meaningful names that reflect clinical concepts

🎯 Next Steps

In the demo and exercise, you’ll practice:

  • Functions (Ch. 19): Creating reusable clinical programming functions
  • Vectors (Ch. 20): Understanding data types and efficient vector operations
  • Iteration (Ch. 21): Using loops and functional programming for batch processing
  • SAS Translation: Converting existing SAS macros to modern R functions
  • Error Handling: Building robust functions for clinical data processing
  • GitHub Copilot: Leveraging AI assistance for function development

This foundation will prepare you for advanced clinical programming workflows and SDTM/ADAM dataset creation.