Module 4 Theory β Date & Text Handling
π Module 4 β Dates, Text & Factors
π― Learning Objectives
By the end of this module, you will:
- Master string manipulation with stringr for clinical text processing (R4DS Ch. 14)
- Apply regular expressions for pattern matching in clinical data (R4DS Ch. 15)
- Work with factors for categorical clinical data management (R4DS Ch. 16)
- Convert dates using lubridate functions (ymd, dmy, mdy) for clinical data
- Calculate study days (e.g., AESTDY = AESTDTC - RFSTDTC + 1)
- Clean and standardize adverse event terms and medical coding
- Handle date/time formats and missing date scenarios in clinical contexts
π 1. Date Handling with lubridate
Working with dates is crucial in clinical programming for calculating study days, visit windows, and time-to-event analyses.
Installing and Loading lubridate
# Install if needed
install.packages("lubridate")
# Load the package
library(lubridate)
library(dplyr)Common Date Conversion Functions
| Function | Use Case | Example |
|---|---|---|
ymd() |
Year-Month-Day format | ymd("2024-01-15") |
dmy() |
Day-Month-Year format | dmy("15/01/2024") |
mdy() |
Month-Day-Year format | mdy("01/15/2024") |
ymd_hms() |
Date with time | ymd_hms("2024-01-15 08:30:00") |
Basic Date Conversion Examples
# Different input formats
date1 <- ymd("2024-01-15") # ISO format
date2 <- dmy("15/01/2024") # European format
date3 <- mdy("01/15/2024") # US format
date4 <- ymd("20240115") # Compact format
# All produce the same Date object
print(date1) # "2024-01-15"
print(date2) # "2024-01-15"
print(date3) # "2024-01-15"
print(date4) # "2024-01-15"2. Study Day Calculations
Study day calculations are fundamental in clinical programming. The formula is typically: Study Day = Event Date - Reference Start Date + 1
Basic Study Day Calculation
# Sample adverse events data
ae_data <- tibble(
USUBJID = c("001-001", "001-001", "001-002", "001-002"),
AEDECOD = c("HEADACHE", "NAUSEA", "FATIGUE", "DIZZINESS"),
AESTDTC = c("2024-01-20", "2024-01-25", "2024-01-18", "2024-01-22"),
RFSTDTC = c("2024-01-15", "2024-01-15", "2024-01-16", "2024-01-16")
)
# Calculate AESTDY
ae_data <- ae_data %>%
mutate(
AESTDT = ymd(AESTDTC), # Convert to Date
RFSTDT = ymd(RFSTDTC), # Convert to Date
AESTDY = as.numeric(AESTDT - RFSTDT) + 1 # Calculate study day
)
print(ae_data)Handling Different Scenarios
# More complex study day calculation with validation
ae_data <- ae_data %>%
mutate(
AESTDY = case_when(
is.na(AESTDT) | is.na(RFSTDT) ~ NA_real_, # Handle missing dates
AESTDT < RFSTDT ~ as.numeric(AESTDT - RFSTDT), # Negative study days (pre-treatment)
TRUE ~ as.numeric(AESTDT - RFSTDT) + 1 # Positive study days (post-treatment)
)
)π 3. String Manipulation with stringr (R4DS Ch. 14)
Strings are a fundamental data type in clinical programming. The stringr package, part of the tidyverse, provides a cohesive set of functions designed to make working with strings as easy as possible.
Why stringr for Clinical Programming?
In clinical data, youβll frequently encounter:
- Messy text data: Inconsistent capitalization, extra spaces, typos
- Medical terms: Need standardization for analysis and reporting
- Coded values: Converting between different coding systems (MedDRA, WHODD)
- Free text: Patient narratives, adverse event descriptions
- Data validation: Ensuring text follows expected patterns
String Basics
library(stringr)
# Strings can be created with single or double quotes
ae_term1 <- "HEADACHE"
ae_term2 <- 'Nausea'
# Combine strings
paste("Patient", "001", sep = "-") # "Patient-001"
str_c("Patient", "001", sep = "-") # stringr equivalent
# String length
str_length("ADVERSE EVENT") # 13Essential stringr Functions for Clinical Data
| Function | Purpose | Clinical Example |
|---|---|---|
str_detect() |
Check if pattern exists | str_detect(AETERM, "HEADACHE") |
str_replace() |
Replace first match | str_replace(AETERM, "SEVEAR", "SEVERE") |
str_replace_all() |
Replace all matches | str_replace_all(AETERM, " ", "_") |
str_trim() |
Remove whitespace | str_trim(" NAUSEA ") |
str_to_upper() |
Convert to uppercase | str_to_upper("headache") |
str_to_lower() |
Convert to lowercase | str_to_lower("HEADACHE") |
str_extract() |
Extract matching pattern | str_extract(USUBJID, "\\d+") |
str_count() |
Count pattern matches | str_count(AETERM, "\\w+") |
Basic String Operations
library(stringr)
# Sample adverse event terms (often messy in real data)
ae_terms <- c(" HEADACHE ", "nausea", "SEVERE fatigue", "Mild Dizziness")
# Clean and standardize
cleaned_terms <- ae_terms %>%
str_trim() %>% # Remove leading/trailing spaces
str_to_upper() %>% # Convert to uppercase
str_replace_all("\\s+", " ") # Replace multiple spaces with single space
print(cleaned_terms)
# [1] "HEADACHE" "NAUSEA" "SEVERE FATIGUE" "MILD DIZZINESS"Clinical Text Processing Examples
# Working with actual clinical data scenarios
ae_data <- ae_data %>%
mutate(
# Clean adverse event terms
AEDECOD_CLEAN = AEDECOD %>%
str_trim() %>%
str_to_upper() %>%
str_replace_all("\\s+", " "),
# Extract severity from combined terms
SEVERITY_EXTRACTED = case_when(
str_detect(AEDECOD, "(?i)mild") ~ "MILD",
str_detect(AEDECOD, "(?i)moderate") ~ "MODERATE",
str_detect(AEDECOD, "(?i)severe") ~ "SEVERE",
TRUE ~ "UNKNOWN"
),
# Create flags based on text patterns
HEADACHE_FLAG = ifelse(str_detect(AEDECOD, "(?i)headache"), "Y", "N"),
# Clean and standardize medication names
CONMED_CLEAN = str_replace_all(AEDECOD, "[^A-Za-z0-9 ]", "") %>%
str_trim() %>%
str_to_upper()
)π 4. Regular Expressions (R4DS Ch. 15)
Regular expressions (regex) are a powerful tool for describing patterns in strings. In clinical programming, regex helps you find, extract, and validate complex patterns in text data.
Why Regular Expressions in Clinical Programming?
- Subject ID validation: Ensure IDs follow protocol-specified formats
- Medical term extraction: Pull out specific parts of complex medical descriptions
- Data quality checks: Identify malformed dates, unusual values
- Text standardization: Clean and normalize free-text fields
- Pattern detection: Find specific symptoms, medications, or conditions in narratives
Basic Regular Expression Patterns
| Pattern | Meaning | Clinical Example |
|---|---|---|
\\d |
Any digit | Subject ID: "\\d{3}-\\d{3}" matches β001-001β |
\\w |
Any word character | "\\w+" matches individual words |
\\s |
Any whitespace | "\\s+" matches spaces, tabs, newlines |
^ |
Start of string | "^SUBJECT" matches strings starting with βSUBJECTβ |
$ |
End of string | "COMPLETED$" matches strings ending with βCOMPLETEDβ |
+ |
One or more | "\\d+" matches one or more digits |
* |
Zero or more | "\\s*" matches any amount of whitespace |
? |
Zero or one | "MILD?" matches βMILβ or βMILDβ |
| |
OR | "MILD|MODERATE|SEVERE" matches any severity |
Clinical Data Pattern Examples
# Sample clinical data with patterns to extract
clinical_text <- c(
"SUBJECT-001-VISIT-001",
"Patient ID: 12345, DOB: 1985-03-15",
"AE: Mild Headache, Grade 2",
"MEDICATION: Aspirin 81mg daily",
"Adverse Event occurred on 2024-01-15"
)
# Extract subject IDs
str_extract(clinical_text, "\\d{3}-\\d{3}")
# Extract dates in YYYY-MM-DD format
str_extract(clinical_text, "\\d{4}-\\d{2}-\\d{2}")
# Extract severity levels
str_extract(clinical_text, "(?i)(mild|moderate|severe)")
# Extract medication doses
str_extract(clinical_text, "\\d+mg")Advanced Pattern Matching for Clinical Data
# Validate USUBJID format (e.g., must be XXX-XXX format)
usubjid_pattern <- "^\\d{3}-\\d{3}$"
valid_ids <- c("001-001", "002-045", "123-456")
invalid_ids <- c("1-1", "ABC-123", "001-001-001")
str_detect(valid_ids, usubjid_pattern) # All TRUE
str_detect(invalid_ids, usubjid_pattern) # All FALSE
# Extract and clean adverse event severities
ae_terms <- c("Mild headache", "MODERATE Nausea", "severe fatigue", "Headache")
severity_pattern <- "(?i)(mild|moderate|severe)"
extracted_severity <- str_extract(ae_terms, severity_pattern)
cleaned_severity <- str_to_upper(extracted_severity)
print(cleaned_severity)
# [1] "MILD" "MODERATE" "SEVERE" NAPractical Clinical Regex Applications
# 1. Clean and validate visit dates
visit_dates <- c("2024-01-15", "15/01/2024", "Jan 15, 2024", "invalid")
# Extract ISO format dates
iso_dates <- str_extract(visit_dates, "^\\d{4}-\\d{2}-\\d{2}$")
# 2. Extract medication information
medication_text <- "Patient taking Metformin 500mg twice daily and Lisinopril 10mg once daily"
# Extract all medications and doses
medications <- str_extract_all(medication_text, "\\w+\\s+\\d+mg")[[1]]
print(medications)
# [1] "Metformin 500mg" "Lisinopril 10mg"
# 3. Standardize adverse event terms
ae_raw <- c("mild headache", "SEVERE headache", "Moderate HEADACHE")
ae_clean <- ae_raw %>%
str_to_upper() %>%
str_replace_all("\\s+", " ") %>%
str_trim()π·οΈ 5. Working with Factors (R4DS Ch. 16)
Factors are Rβs way of handling categorical data. In clinical programming, factors are essential for:
- Treatment groups: Controlling order in analyses and plots
- Severity levels: Ensuring proper ordering (Mild < Moderate < Severe)
- Visit schedules: Maintaining chronological order of visits
- Categorical endpoints: Managing ordered categories for efficacy analyses
Creating and Managing Factors
library(forcats) # Part of tidyverse for factor manipulation
# Treatment groups with natural ordering
treatment <- c("Placebo", "Low Dose", "High Dose", "Placebo", "High Dose")
treatment_factor <- factor(treatment,
levels = c("Placebo", "Low Dose", "High Dose"))
# Check the levels
levels(treatment_factor)
# [1] "Placebo" "Low Dose" "High Dose"
# Adverse event severity with ordered levels
severity <- c("Mild", "Severe", "Moderate", "Mild", "Severe")
severity_factor <- factor(severity,
levels = c("Mild", "Moderate", "Severe"),
ordered = TRUE)
# Now we can make meaningful comparisons
severity_factor[1] < severity_factor[3] # TRUE (Mild < Moderate)Clinical Factor Applications
# Example: Visit factors for proper chronological ordering
visits <- c("Screening", "Baseline", "Week 4", "Week 8", "Week 12", "End of Study")
visit_factor <- factor(visits, levels = visits) # Preserves order
# Example: Dose response categories
dose_response <- c("No Response", "Partial Response", "Complete Response")
response_factor <- factor(dose_response,
levels = dose_response,
ordered = TRUE)
# Using factors in clinical data processing
clinical_df <- tibble(
USUBJID = c("001-001", "001-002", "001-003"),
TRT01A = factor(c("Placebo", "Active", "Active"),
levels = c("Placebo", "Active")),
AESEV = factor(c("MILD", "MODERATE", "SEVERE"),
levels = c("MILD", "MODERATE", "SEVERE"),
ordered = TRUE)
)
# Factors maintain proper ordering in summaries
clinical_df %>% count(TRT01A) # Placebo appears first
clinical_df %>% count(AESEV) # Maintains severity orderπ§ 6. Combining Date and Text Operations
Complete AESTDY Derivation Example
# Comprehensive example: derive AESTDY and clean AE terms
ae_complete <- tibble(
USUBJID = c("001-001", "001-001", "001-002", "001-003"),
AEDECOD = c(" Mild HEADACHE ", "NAUSEA (moderate)", "severe FATIGUE", "Dizziness"),
AESTDTC = c("2024-01-20", "2024-01-25", "2024-01-18", "2024-01-22"),
RFSTDTC = c("2024-01-15", "2024-01-15", "2024-01-16", "2024-01-16")
) %>%
mutate(
# Date conversions and study day calculation
AESTDT = ymd(AESTDTC),
RFSTDT = ymd(RFSTDTC),
AESTDY = as.numeric(AESTDT - RFSTDT) + 1,
# Clean adverse event terms
AEDECOD_CLEAN = AEDECOD %>%
str_trim() %>%
str_to_upper() %>%
str_replace_all("\\([^)]*\\)", "") %>% # Remove parentheses and contents
str_replace_all("\\s+", " ") %>% # Replace multiple spaces
str_trim(), # Trim again after cleaning
# Extract base term (remove severity qualifiers)
AETERM_BASE = AEDECOD_CLEAN %>%
str_replace_all("^(MILD|MODERATE|SEVERE)\\s+", ""),
# Create study day categories
AESTDY_CAT = case_when(
AESTDY <= 0 ~ "Pre-treatment",
AESTDY <= 7 ~ "Week 1",
AESTDY <= 14 ~ "Week 2",
TRUE ~ "After Week 2"
)
)
print(ae_complete)π 7. Advanced Date Handling
Working with Time Components
# Handling date-time data
datetime_data <- tibble(
USUBJID = c("001-001", "001-002"),
AESTDTC = c("2024-01-20T08:30:00", "2024-01-21T14:15:30"),
RFSTDTC = c("2024-01-15T09:00:00", "2024-01-16T10:30:00")
) %>%
mutate(
# Parse date-time
AESTDT = ymd_hms(AESTDTC),
RFSTDT = ymd_hms(RFSTDTC),
# Extract components
AE_DATE = date(AESTDT),
AE_TIME = format(AESTDT, "%H:%M:%S"),
AE_HOUR = hour(AESTDT),
# Calculate study day from dates only (ignoring time)
AESTDY = as.numeric(date(AESTDT) - date(RFSTDT)) + 1,
# Time-based categories
TIME_PERIOD = case_when(
AE_HOUR < 12 ~ "Morning",
AE_HOUR < 18 ~ "Afternoon",
TRUE ~ "Evening"
)
)
---
## π€ 8. GitHub Copilot for Date & Text Operations
### Effective Prompts for Clinical Programming:
| Comment Prompt | Expected Copilot Suggestion |
|----|----|
| `# Calculate study day from AE start date and reference date` | `mutate(AESTDY = as.numeric(ymd(AESTDTC) - ymd(RFSTDTC)) + 1)` |
| `# Clean adverse event terms and convert to uppercase` | `str_trim() %>% str_to_upper()` |
| `# Extract severity from AE term if present` | `str_extract(AEDECOD, "MILD|MODERATE|SEVERE")` |
| `# Convert date string to proper Date format` | `ymd(date_string)` or `dmy(date_string)` |
| `# Flag AEs occurring in first week of treatment` | `mutate(WEEK1_AE = ifelse(AESTDY <= 7, "Y", "N"))` |
---
## β οΈ 9. Common Pitfalls and Best Practices
### Date Handling Best Practices
```r
# β
Good practices
ae_data <- ae_data %>%
mutate(
# Always handle missing dates explicitly
AESTDY = case_when(
is.na(ymd(AESTDTC)) | is.na(ymd(RFSTDTC)) ~ NA_real_,
TRUE ~ as.numeric(ymd(AESTDTC) - ymd(RFSTDTC)) + 1
),
# Check for impossible dates
DATE_FLAG = case_when(
ymd(AESTDTC) < ymd("1900-01-01") ~ "INVALID",
ymd(AESTDTC) > today() ~ "FUTURE",
TRUE ~ "VALID"
)
)
# β Avoid these issues
# Don't assume all dates parse correctly
# Don't forget the +1 in study day calculations for most sponsors
# Don't ignore time zones if working with datetime data
String Handling Best Practices
# β
Good practices for text cleaning
clean_ae_terms <- function(terms) {
terms %>%
str_trim() %>% # Remove whitespace
str_to_upper() %>% # Standardize case
str_replace_all("\\s+", " ") %>% # Normalize spacing
str_replace_all("[^A-Z0-9 ]", "") %>% # Remove special characters
str_trim() # Trim again after cleaning
}π Module Summary
By completing this module, you should now be able to:
β
Convert dates using lubridate functions (ymd, dmy, mdy) for various input formats
β
Calculate study days accurately with proper handling of missing dates
β
Manipulate text using stringr functions for data cleaning and standardization
β
Use regular expressions for advanced pattern matching in clinical data
β
Work with factors for categorical variables like treatment groups and severity levels
β
Derive AESTDY and clean adverse event terms in realistic clinical scenarios
β
Handle edge cases like missing dates, invalid dates, and messy text data
π R4DS Connections:
This module covers essential concepts for clinical programming:
- Strings: Essential string manipulation with stringr
- Regular expressions: Pattern matching for data validation
- Factors: Categorical data handling with forcats
π Next Steps:
- Practice with the demo exercises
- Try deriving study days with your own clinical data
- Apply string patterns and factor management to real clinical datasets
- Prepare for Module 5: Functions & Macro Translation
π‘ Key Takeaways
- lubridate makes date parsing intuitive - ymd(), dmy(), mdy() handle most formats automatically
- Study day = Event Date - Reference Date + 1 is the standard clinical calculation
- stringr provides consistent string manipulation with clear, readable function names
- Always handle missing data explicitly when working with dates and text
- GitHub Copilot excels at suggesting date/text transformations for clinical programming
- Combine date and text operations for comprehensive data cleaning pipelines
Ready to learn about writing functions? Letβs move to Module 5!