Module 4 Theory — Date & Text Handling

📅 Module 4 — Dates, Text & Factors

🎯 Learning Objectives

By the end of this module, you will:

Master string manipulation with stringr for clinical text processing (R4DS Ch. 14)
Apply regular expressions for pattern matching in clinical data (R4DS Ch. 15)
Work with factors for categorical clinical data management (R4DS Ch. 16)
Convert dates using lubridate functions (ymd, dmy, mdy) for clinical data
Calculate study days (e.g., AESTDY = AESTDTC - RFSTDTC + 1)
Clean and standardize adverse event terms and medical coding
Handle date/time formats and missing date scenarios in clinical contexts

📅 1. Date Handling with lubridate

Working with dates is crucial in clinical programming for calculating study days, visit windows, and time-to-event analyses.

Installing and Loading lubridate

# Install if needed
install.packages("lubridate")

# Load the package
library(lubridate)
library(dplyr)

Common Date Conversion Functions

Function	Use Case	Example
`ymd()`	Year-Month-Day format	`ymd("2024-01-15")`
`dmy()`	Day-Month-Year format	`dmy("15/01/2024")`
`mdy()`	Month-Day-Year format	`mdy("01/15/2024")`
`ymd_hms()`	Date with time	`ymd_hms("2024-01-15 08:30:00")`

Basic Date Conversion Examples

# Different input formats
date1 <- ymd("2024-01-15")         # ISO format
date2 <- dmy("15/01/2024")         # European format  
date3 <- mdy("01/15/2024")         # US format
date4 <- ymd("20240115")           # Compact format

# All produce the same Date object
print(date1)  # "2024-01-15"
print(date2)  # "2024-01-15"
print(date3)  # "2024-01-15"
print(date4)  # "2024-01-15"

2. Study Day Calculations

Study day calculations are fundamental in clinical programming. The formula is typically: Study Day = Event Date - Reference Start Date + 1

Basic Study Day Calculation

# Sample adverse events data
ae_data <- tibble(
  USUBJID = c("001-001", "001-001", "001-002", "001-002"),
  AEDECOD = c("HEADACHE", "NAUSEA", "FATIGUE", "DIZZINESS"),
  AESTDTC = c("2024-01-20", "2024-01-25", "2024-01-18", "2024-01-22"),
  RFSTDTC = c("2024-01-15", "2024-01-15", "2024-01-16", "2024-01-16")
)

# Calculate AESTDY
ae_data <- ae_data %>%
  mutate(
    AESTDT = ymd(AESTDTC),      # Convert to Date
    RFSTDT = ymd(RFSTDTC),      # Convert to Date
    AESTDY = as.numeric(AESTDT - RFSTDT) + 1  # Calculate study day
  )

print(ae_data)

Handling Different Scenarios

# More complex study day calculation with validation
ae_data <- ae_data %>%
  mutate(
    AESTDY = case_when(
      is.na(AESTDT) | is.na(RFSTDT) ~ NA_real_,  # Handle missing dates
      AESTDT < RFSTDT ~ as.numeric(AESTDT - RFSTDT),  # Negative study days (pre-treatment)
      TRUE ~ as.numeric(AESTDT - RFSTDT) + 1     # Positive study days (post-treatment)
    )
  )

📝 3. String Manipulation with stringr (R4DS Ch. 14)

Strings are a fundamental data type in clinical programming. The stringr package, part of the tidyverse, provides a cohesive set of functions designed to make working with strings as easy as possible.

Why stringr for Clinical Programming?

In clinical data, you’ll frequently encounter:

Messy text data: Inconsistent capitalization, extra spaces, typos
Medical terms: Need standardization for analysis and reporting
Coded values: Converting between different coding systems (MedDRA, WHODD)
Free text: Patient narratives, adverse event descriptions
Data validation: Ensuring text follows expected patterns

String Basics

library(stringr)

# Strings can be created with single or double quotes
ae_term1 <- "HEADACHE"
ae_term2 <- 'Nausea'

# Combine strings
paste("Patient", "001", sep = "-")  # "Patient-001"
str_c("Patient", "001", sep = "-") # stringr equivalent

# String length
str_length("ADVERSE EVENT")  # 13

Essential stringr Functions for Clinical Data

Function	Purpose	Clinical Example
`str_detect()`	Check if pattern exists	`str_detect(AETERM, "HEADACHE")`
`str_replace()`	Replace first match	`str_replace(AETERM, "SEVEAR", "SEVERE")`
`str_replace_all()`	Replace all matches	`str_replace_all(AETERM, " ", "_")`
`str_trim()`	Remove whitespace	`str_trim(" NAUSEA ")`
`str_to_upper()`	Convert to uppercase	`str_to_upper("headache")`
`str_to_lower()`	Convert to lowercase	`str_to_lower("HEADACHE")`
`str_extract()`	Extract matching pattern	`str_extract(USUBJID, "\\d+")`
`str_count()`	Count pattern matches	`str_count(AETERM, "\\w+")`

Basic String Operations

library(stringr)

# Sample adverse event terms (often messy in real data)
ae_terms <- c("  HEADACHE  ", "nausea", "SEVERE fatigue", "Mild Dizziness")

# Clean and standardize
cleaned_terms <- ae_terms %>%
  str_trim() %>%                    # Remove leading/trailing spaces
  str_to_upper() %>%                # Convert to uppercase
  str_replace_all("\\s+", " ")      # Replace multiple spaces with single space

print(cleaned_terms)
# [1] "HEADACHE" "NAUSEA" "SEVERE FATIGUE" "MILD DIZZINESS"

Clinical Text Processing Examples

# Working with actual clinical data scenarios
ae_data <- ae_data %>%
  mutate(
    # Clean adverse event terms
    AEDECOD_CLEAN = AEDECOD %>%
      str_trim() %>%
      str_to_upper() %>%
      str_replace_all("\\s+", " "),
    
    # Extract severity from combined terms  
    SEVERITY_EXTRACTED = case_when(
      str_detect(AEDECOD, "(?i)mild") ~ "MILD",
      str_detect(AEDECOD, "(?i)moderate") ~ "MODERATE", 
      str_detect(AEDECOD, "(?i)severe") ~ "SEVERE",
      TRUE ~ "UNKNOWN"
    ),
    
    # Create flags based on text patterns
    HEADACHE_FLAG = ifelse(str_detect(AEDECOD, "(?i)headache"), "Y", "N"),
    
    # Clean and standardize medication names
    CONMED_CLEAN = str_replace_all(AEDECOD, "[^A-Za-z0-9 ]", "") %>%
      str_trim() %>%
      str_to_upper()
  )

🔍 4. Regular Expressions (R4DS Ch. 15)

Regular expressions (regex) are a powerful tool for describing patterns in strings. In clinical programming, regex helps you find, extract, and validate complex patterns in text data.

Why Regular Expressions in Clinical Programming?

Subject ID validation: Ensure IDs follow protocol-specified formats
Medical term extraction: Pull out specific parts of complex medical descriptions
Data quality checks: Identify malformed dates, unusual values
Text standardization: Clean and normalize free-text fields
Pattern detection: Find specific symptoms, medications, or conditions in narratives

Basic Regular Expression Patterns

Pattern	Meaning	Clinical Example
`\\d`	Any digit	Subject ID: `"\\d{3}-\\d{3}"` matches “001-001”
`\\w`	Any word character	`"\\w+"` matches individual words
`\\s`	Any whitespace	`"\\s+"` matches spaces, tabs, newlines
`^`	Start of string	`"^SUBJECT"` matches strings starting with “SUBJECT”
`$`	End of string	`"COMPLETED$"` matches strings ending with “COMPLETED”
`+`	One or more	`"\\d+"` matches one or more digits
`*`	Zero or more	`"\\s*"` matches any amount of whitespace
`?`	Zero or one	`"MILD?"` matches “MIL” or “MILD”
`\|`	OR	`"MILD\|MODERATE\|SEVERE"` matches any severity

Clinical Data Pattern Examples

# Sample clinical data with patterns to extract
clinical_text <- c(
  "SUBJECT-001-VISIT-001",
  "Patient ID: 12345, DOB: 1985-03-15", 
  "AE: Mild Headache, Grade 2",
  "MEDICATION: Aspirin 81mg daily",
  "Adverse Event occurred on 2024-01-15"
)

# Extract subject IDs
str_extract(clinical_text, "\\d{3}-\\d{3}")

# Extract dates in YYYY-MM-DD format
str_extract(clinical_text, "\\d{4}-\\d{2}-\\d{2}")

# Extract severity levels
str_extract(clinical_text, "(?i)(mild|moderate|severe)")

# Extract medication doses
str_extract(clinical_text, "\\d+mg")

Advanced Pattern Matching for Clinical Data

# Validate USUBJID format (e.g., must be XXX-XXX format)
usubjid_pattern <- "^\\d{3}-\\d{3}$"
valid_ids <- c("001-001", "002-045", "123-456")
invalid_ids <- c("1-1", "ABC-123", "001-001-001")

str_detect(valid_ids, usubjid_pattern)   # All TRUE
str_detect(invalid_ids, usubjid_pattern) # All FALSE

# Extract and clean adverse event severities
ae_terms <- c("Mild headache", "MODERATE Nausea", "severe fatigue", "Headache")

severity_pattern <- "(?i)(mild|moderate|severe)"
extracted_severity <- str_extract(ae_terms, severity_pattern)
cleaned_severity <- str_to_upper(extracted_severity)

print(cleaned_severity)
# [1] "MILD" "MODERATE" "SEVERE" NA

Practical Clinical Regex Applications

# 1. Clean and validate visit dates
visit_dates <- c("2024-01-15", "15/01/2024", "Jan 15, 2024", "invalid")

# Extract ISO format dates
iso_dates <- str_extract(visit_dates, "^\\d{4}-\\d{2}-\\d{2}$")

# 2. Extract medication information
medication_text <- "Patient taking Metformin 500mg twice daily and Lisinopril 10mg once daily"

# Extract all medications and doses
medications <- str_extract_all(medication_text, "\\w+\\s+\\d+mg")[[1]]
print(medications)
# [1] "Metformin 500mg" "Lisinopril 10mg"

# 3. Standardize adverse event terms
ae_raw <- c("mild headache", "SEVERE headache", "Moderate HEADACHE")
ae_clean <- ae_raw %>%
  str_to_upper() %>%
  str_replace_all("\\s+", " ") %>%
  str_trim()

🏷️ 5. Working with Factors (R4DS Ch. 16)

Factors are R’s way of handling categorical data. In clinical programming, factors are essential for:

Treatment groups: Controlling order in analyses and plots
Severity levels: Ensuring proper ordering (Mild < Moderate < Severe)
Visit schedules: Maintaining chronological order of visits
Categorical endpoints: Managing ordered categories for efficacy analyses

Creating and Managing Factors

library(forcats)  # Part of tidyverse for factor manipulation

# Treatment groups with natural ordering
treatment <- c("Placebo", "Low Dose", "High Dose", "Placebo", "High Dose")
treatment_factor <- factor(treatment, 
                          levels = c("Placebo", "Low Dose", "High Dose"))

# Check the levels
levels(treatment_factor)
# [1] "Placebo"   "Low Dose"  "High Dose"

# Adverse event severity with ordered levels
severity <- c("Mild", "Severe", "Moderate", "Mild", "Severe")
severity_factor <- factor(severity,
                         levels = c("Mild", "Moderate", "Severe"),
                         ordered = TRUE)

# Now we can make meaningful comparisons
severity_factor[1] < severity_factor[3]  # TRUE (Mild < Moderate)

Clinical Factor Applications

# Example: Visit factors for proper chronological ordering
visits <- c("Screening", "Baseline", "Week 4", "Week 8", "Week 12", "End of Study")
visit_factor <- factor(visits, levels = visits)  # Preserves order

# Example: Dose response categories
dose_response <- c("No Response", "Partial Response", "Complete Response")
response_factor <- factor(dose_response, 
                         levels = dose_response,
                         ordered = TRUE)

# Using factors in clinical data processing
clinical_df <- tibble(
  USUBJID = c("001-001", "001-002", "001-003"),
  TRT01A = factor(c("Placebo", "Active", "Active"), 
                  levels = c("Placebo", "Active")),
  AESEV = factor(c("MILD", "MODERATE", "SEVERE"),
                 levels = c("MILD", "MODERATE", "SEVERE"),
                 ordered = TRUE)
)

# Factors maintain proper ordering in summaries
clinical_df %>% count(TRT01A)  # Placebo appears first
clinical_df %>% count(AESEV)   # Maintains severity order

🚧 6. Combining Date and Text Operations

Complete AESTDY Derivation Example

# Comprehensive example: derive AESTDY and clean AE terms
ae_complete <- tibble(
  USUBJID = c("001-001", "001-001", "001-002", "001-003"),
  AEDECOD = c("  Mild HEADACHE  ", "NAUSEA (moderate)", "severe FATIGUE", "Dizziness"),
  AESTDTC = c("2024-01-20", "2024-01-25", "2024-01-18", "2024-01-22"),
  RFSTDTC = c("2024-01-15", "2024-01-15", "2024-01-16", "2024-01-16")
) %>%
  mutate(
    # Date conversions and study day calculation
    AESTDT = ymd(AESTDTC),
    RFSTDT = ymd(RFSTDTC),
    AESTDY = as.numeric(AESTDT - RFSTDT) + 1,
    
    # Clean adverse event terms
    AEDECOD_CLEAN = AEDECOD %>%
      str_trim() %>%
      str_to_upper() %>%
      str_replace_all("\\([^)]*\\)", "") %>%  # Remove parentheses and contents
      str_replace_all("\\s+", " ") %>%        # Replace multiple spaces
      str_trim(),                             # Trim again after cleaning
    
    # Extract base term (remove severity qualifiers) 
    AETERM_BASE = AEDECOD_CLEAN %>%
      str_replace_all("^(MILD|MODERATE|SEVERE)\\s+", ""),
    
    # Create study day categories
    AESTDY_CAT = case_when(
      AESTDY <= 0 ~ "Pre-treatment",
      AESTDY <= 7 ~ "Week 1", 
      AESTDY <= 14 ~ "Week 2",
      TRUE ~ "After Week 2"
    )
  )

print(ae_complete)

📅 7. Advanced Date Handling

Working with Time Components

# Handling date-time data
datetime_data <- tibble(
  USUBJID = c("001-001", "001-002"),
  AESTDTC = c("2024-01-20T08:30:00", "2024-01-21T14:15:30"),
  RFSTDTC = c("2024-01-15T09:00:00", "2024-01-16T10:30:00")
) %>%
  mutate(
    # Parse date-time
    AESTDT = ymd_hms(AESTDTC),
    RFSTDT = ymd_hms(RFSTDTC),
    
    # Extract components
    AE_DATE = date(AESTDT),
    AE_TIME = format(AESTDT, "%H:%M:%S"),
    AE_HOUR = hour(AESTDT),
    
    # Calculate study day from dates only (ignoring time)
    AESTDY = as.numeric(date(AESTDT) - date(RFSTDT)) + 1,
    
    # Time-based categories
    TIME_PERIOD = case_when(
      AE_HOUR < 12 ~ "Morning",
      AE_HOUR < 18 ~ "Afternoon", 
      TRUE ~ "Evening"
    )
  )


---

## 🤖 8. GitHub Copilot for Date & Text Operations

### Effective Prompts for Clinical Programming:

| Comment Prompt | Expected Copilot Suggestion |
|----|----|
| `# Calculate study day from AE start date and reference date` | `mutate(AESTDY = as.numeric(ymd(AESTDTC) - ymd(RFSTDTC)) + 1)` |
| `# Clean adverse event terms and convert to uppercase` | `str_trim() %>% str_to_upper()` |
| `# Extract severity from AE term if present` | `str_extract(AEDECOD, "MILD|MODERATE|SEVERE")` |
| `# Convert date string to proper Date format` | `ymd(date_string)` or `dmy(date_string)` |
| `# Flag AEs occurring in first week of treatment` | `mutate(WEEK1_AE = ifelse(AESTDY <= 7, "Y", "N"))` |

---

## ⚠️ 9. Common Pitfalls and Best Practices

### Date Handling Best Practices

```r
# ✅ Good practices
ae_data <- ae_data %>%
  mutate(
    # Always handle missing dates explicitly
    AESTDY = case_when(
      is.na(ymd(AESTDTC)) | is.na(ymd(RFSTDTC)) ~ NA_real_,
      TRUE ~ as.numeric(ymd(AESTDTC) - ymd(RFSTDTC)) + 1
    ),

    # Check for impossible dates
    DATE_FLAG = case_when(
      ymd(AESTDTC) < ymd("1900-01-01") ~ "INVALID",
      ymd(AESTDTC) > today() ~ "FUTURE",
      TRUE ~ "VALID"
    )
  )

# ❌ Avoid these issues
# Don't assume all dates parse correctly
# Don't forget the +1 in study day calculations for most sponsors
# Don't ignore time zones if working with datetime data

String Handling Best Practices

# ✅ Good practices for text cleaning
clean_ae_terms <- function(terms) {
  terms %>%
    str_trim() %>%                           # Remove whitespace
    str_to_upper() %>%                       # Standardize case
    str_replace_all("\\s+", " ") %>%         # Normalize spacing
    str_replace_all("[^A-Z0-9 ]", "") %>%    # Remove special characters
    str_trim()                               # Trim again after cleaning
}

📝 Module Summary

By completing this module, you should now be able to:

✅ Convert dates using lubridate functions (ymd, dmy, mdy) for various input formats
✅ Calculate study days accurately with proper handling of missing dates
✅ Manipulate text using stringr functions for data cleaning and standardization
✅ Use regular expressions for advanced pattern matching in clinical data
✅ Work with factors for categorical variables like treatment groups and severity levels
✅ Derive AESTDY and clean adverse event terms in realistic clinical scenarios
✅ Handle edge cases like missing dates, invalid dates, and messy text data

🔗 R4DS Connections:

This module covers essential concepts for clinical programming:

Strings: Essential string manipulation with stringr
Regular expressions: Pattern matching for data validation
Factors: Categorical data handling with forcats

🚀 Next Steps:

Practice with the demo exercises
Try deriving study days with your own clinical data
Apply string patterns and factor management to real clinical datasets
Prepare for Module 5: Functions & Macro Translation

💡 Key Takeaways

lubridate makes date parsing intuitive - ymd(), dmy(), mdy() handle most formats automatically
Study day = Event Date - Reference Date + 1 is the standard clinical calculation
stringr provides consistent string manipulation with clear, readable function names
Always handle missing data explicitly when working with dates and text
GitHub Copilot excels at suggesting date/text transformations for clinical programming
Combine date and text operations for comprehensive data cleaning pipelines

Ready to learn about writing functions? Let’s move to Module 5!