Module 1 Theory β€” Introduction to R and Data Science

πŸ“Š Module 1 β€” Introduction to R

🎯 Learning Objectives

By the end of this module, you will:

  • Understand the R ecosystem and data science workflow (R4DS Ch. 1)
  • Master R workflow basics including code organization and project management (R4DS Ch. 6)
  • Navigate RStudio interface efficiently for clinical programming workflows (R4DS Ch. 6)
  • Apply workflow fundamentals with scripts, directories, and reproducible practices (R4DS Ch. 8)
  • Work with vectors and understand R’s fundamental data structures (R4DS Ch. 4)
  • Load essential packages (tidyverse, haven) for clinical programming
  • Set up clinical programming environment with proper configuration

🌐 1. What is Data Science? (R4DS Ch. 1)

The Data Science Process

Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge. The R4DS process includes:

Import β†’ Tidy β†’ Transform β†’ Visualize β†’ Model β†’ Communicate
   ↓        ↓        ↓          ↓         ↓         ↓
 Data β†’ Clean β†’ Wrangle β†’ Explore β†’ Predict β†’ Report

In Clinical Programming Context: - Import: Read SAS XPT files, CSV data, CRF exports - Tidy: Structure data according to CDISC standards (SDTM/ADAM)
- Transform: Create derived variables, flag values, calculate parameters - Visualize: Create safety and efficacy plots, patient profiles - Model: Statistical analyses, modeling exposure-response - Communicate: Generate tables, listings, figures (TLFs)

R in the Clinical Data Science Ecosystem

R provides a complete toolkit for clinical data science:

  • Data Import: haven (SAS), readr (CSV), readxl (Excel)
  • Data Manipulation: dplyr for data wrangling
  • Data Visualization: ggplot2 for publication-quality graphics
  • Reproducible Reports: R Markdown for integrated analysis and reporting
  • Clinical Standards: admiral, xportr for CDISC compliance

πŸ“Š 2. Working with Vectors (R4DS Ch. 4)

Understanding R’s Foundation: Vectors

Everything in R is built on vectors. Understanding vectors is crucial for clinical programming.

Atomic Vector Types

# Logical vectors (for flags and indicators)
elderly_flag <- c(TRUE, FALSE, TRUE, FALSE)
safety_population <- c(TRUE, TRUE, FALSE, TRUE)

# Integer vectors (for counts, IDs)
subject_n <- c(1L, 2L, 3L, 4L, 5L)
visit_num <- c(1L, 2L, 3L, 4L)

# Double vectors (for measurements, calculations)
age <- c(45.2, 62.8, 28.1, 71.5)
weight <- c(70.5, 85.2, 55.8, 92.1)

# Character vectors (for categories, IDs)
usubjid <- c("001-001", "001-002", "001-003", "001-004")
treatment <- c("Placebo", "Drug A", "Drug A", "Placebo")

Clinical Programming Applications

# Missing values in clinical data
lab_values <- c(120, 135, NA, 142, NA)  # Missing lab results
is.na(lab_values)  # Check for missing values

# Coercion and type conversion
age_char <- c("45", "62", "28", "71")
age_num <- as.numeric(age_char)  # Convert to numeric

# Vector operations for derived variables
bmi <- weight / ((height/100)^2)  # Calculate BMI
age_group <- ifelse(age >= 65, "Elderly", "Adult")  # Create age groups

Vector Attributes and Names

# Named vectors for lookup tables
dose_levels <- c("Low" = 10, "Medium" = 50, "High" = 100)
treatment_codes <- c("Placebo" = "P", "Active" = "A")

# Using names for clinical programming
dose_levels["High"]  # Returns 100

πŸ”§ 3. R Workflow Basics (R4DS Ch. 6)

What is Workflow?

Workflow encompasses the practices and tools that make your data analysis:

  • Reproducible: Others (and future you) can recreate your work
  • Reliable: Your code works consistently across different environments
  • Shareable: Code and results can be easily communicated

The RStudio IDE for Clinical Programming

RStudio provides a powerful integrated development environment specifically designed for R:

Four Key Panes:

  1. Script Editor (Top-Left): Write R scripts (.R) and R Markdown (.Rmd) files
  2. Console (Bottom-Left): Execute code interactively and see immediate results
  3. Environment/History (Top-Right): View objects in memory and command history
  4. Files/Plots/Packages/Help (Bottom-Right): Navigate project files and view outputs

Essential RStudio Features for Clinical Work:

  • Syntax highlighting: Color-coded R syntax for better readability
  • Auto-completion: Tab completion for functions, arguments, and object names
  • Code folding: Collapse sections of code for better organization
  • Integrated help: Quick access to function documentation
  • Project management: Organize related files and maintain consistent working directories

Code Organization Best Practices

# Use clear, descriptive variable names
elderly_subjects <- dm[dm$AGE >= 65, ]  # Good
old <- dm[dm$AGE >= 65, ]              # Avoid

# Add comments explaining clinical context
# Calculate study day from reference start date
studyday <- as.numeric(visit_date - rfstdtc) + 1

# Use consistent spacing and indentation
dm_clean <- dm %>%
  filter(!is.na(USUBJID)) %>%          # Remove subjects without ID
  mutate(AGEGR1 = case_when(           # Create age groups per protocol
    AGE < 65 ~ "< 65 years",
    AGE >= 65 ~ ">= 65 years"
  ))

πŸ“‚ 4. Workflow: Scripts and Projects (R4DS Ch. 8)

Scripts for Reproducible Analysis

Scripts are text files containing R code that can be executed to reproduce your analysis. For clinical programming, well-organized scripts are essential for regulatory compliance and collaboration.

Script Best Practices:

# =============================================================================
# Program: DM Dataset Creation
# Author: [Your Name]  
# Date: 2024-11-10
# Purpose: Create SDTM DM dataset from raw demographics data
# Input: /data/raw/demographics.csv
# Output: /data/sdtm/dm.xpt
# =============================================================================

# Load required packages
library(tidyverse)
library(haven)
library(lubridate)

# Read raw data
raw_dm <- read_csv("data/raw/demographics.csv")

# Create SDTM DM dataset
dm <- raw_dm %>%
  select(USUBJID = subject_id, AGE = age, SEX = sex) %>%
  mutate(
    DOMAIN = "DM",
    RFSTDTC = "2024-01-15",  # Protocol-specified reference date
    ARMCD = "TRT01A",        # Treatment arm code
    ARM = "Treatment A"      # Treatment arm description
  )

# Export to XPT format
write_xpt(dm, "data/sdtm/dm.xpt")

RStudio Projects (.Rproj)

Projects help organize your clinical programming work and ensure reproducibility:

Benefits of Using Projects:

  • Consistent working directory: No need for setwd()
  • Isolated environments: Each project maintains its own workspace
  • Version control integration: Easy Git integration for regulatory tracking
  • Organized file structure: Keep data, programs, and outputs separated

Working Directory and Paths

Understanding file paths is crucial for reproducible clinical programming:

# Check current working directory
getwd()

# Read files using relative paths (preferred)  
dm <- read_csv("data/raw/demographics.csv")      # Good
dm <- read_csv("C:/Users/me/data/demo.csv")      # Avoid - not portable

# Use here package for robust path handling
library(here)
dm <- read_csv(here("data", "raw", "demographics.csv"))

πŸ“¦ 5. Essential Packages for Clinical Programming

The tidyverse is a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures.

Installing and Loading the Tidyverse

# Install tidyverse (includes dplyr, ggplot2, tibble, readr, and more)
install.packages("tidyverse")

# Install additional clinical-specific packages
install.packages(c("haven", "lubridate", "here"))

# Load packages at start of each session
library(tidyverse)  # Loads core tidyverse packages
library(haven)      # Read/write SAS, SPSS, Stata files
library(lubridate)  # Date/time handling
library(here)       # Robust file path handling

Core Tidyverse Packages for Clinical Work:

Package Purpose Key Functions
dplyr Data manipulation (like SAS DATA step) filter(), select(), mutate()
tibble Enhanced data frames tibble(), as_tibble()
readr Reading rectangular data read_csv(), write_csv()
ggplot2 Data visualization ggplot(), geom_point(), geom_line()
stringr String manipulation str_detect(), str_replace()
forcats Factor handling fct_reorder(), fct_lump()

Clinical-Specific Packages:

Package Purpose Key Functions
haven Import/export SAS XPT and SAS7BDAT files read_xpt(), write_xpt()
lubridate Date/time manipulation ymd(), dmy(), today()
here Robust file path construction here()

Package Loading Best Practices

# Good: Load all packages at the top of your script
library(tidyverse)
library(haven)
library(lubridate)

# Check what packages are loaded
search()

# Check tidyverse package versions
tidyverse_packages()

πŸ€– 6. Getting Help and AI Assistance

Getting Help in R

R has extensive built-in help systems and community resources:

Built-in Help Functions:

# Get help for a specific function
?mean
help(mean)

# Search for functions containing a keyword
??regression
help.search("regression")

# Get examples of function usage
example(mean)

# View function arguments
args(lm)

RStudio Help Integration:

  • Help tab: Click on functions or use F1 for contextual help
  • Auto-completion: Tab completion shows function arguments
  • Function tooltips: Hover over functions to see signatures

AI-Powered Assistance: GitHub Copilot

GitHub Copilot is an AI pair programmer that accelerates clinical programming by providing intelligent code suggestions.

Setup for RStudio:

  1. GitHub Copilot subscription: Sign up (free for students/eligible users)
  2. RStudio configuration: Enable in Tools β†’ Global Options β†’ Copilot
  3. Sign in: Connect your GitHub account to RStudio

How to Use Copilot Effectively:

Method 1: Comment-Driven Development

# Read a SAS XPT file and convert to tibble
# Copilot will suggest: dm <- read_xpt("dm.xpt") %>% as_tibble()

# Calculate age from birthdate and reference date  
# Copilot will suggest: age <- floor(as.numeric(refdate - birthdate) / 365.25)

Method 2: Function Name Auto-completion

# Start typing a function name and Copilot suggests parameters
dm %>% 
  filter(  # Copilot suggests: AGE >= 18, ARMCD == "TRT01A", etc.

Method 3: Pattern Recognition

# After writing similar code, Copilot learns patterns
dm <- dm %>% mutate(AGEGR1 = case_when(
  AGE < 65 ~ "< 65",
  AGE >= 65 ~ ">= 65"  # Copilot suggests this based on pattern
))

Common Copilot Prompts for Clinical Programming:

Comment Prompt Likely Copilot Suggestion
# Create elderly flag for age >= 65 mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))
# Read SAS transport file dm <- read_xpt("dm.xpt")
# Calculate study day mutate(AESTDY = as.numeric(AESTDTC - RFSTDTC) + 1)
# Convert character date to Date mutate(date = ymd(date_char))

Important: Always review and validate Copilot suggestions. Copilot accelerates coding but doesn’t replace your clinical programming expertise and knowledge of CDISC standards.


πŸ“ Module Summary

By completing this module, you should now be able to:

βœ… Understand the data science process and how it applies to clinical programming (R4DS Ch. 1)
βœ… Work with vectors and understand R’s fundamental data structures (R4DS Ch. 4)
βœ… Navigate RStudio efficiently and organize your workflow (R4DS Ch. 6)
βœ… Create reproducible scripts and manage projects effectively (R4DS Ch. 8)
βœ… Load and use tidyverse packages for clinical data analysis
βœ… Apply workflow best practices for clinical programming environments

πŸš€ Next Steps:

  • Practice vector operations and RStudio navigation with hands-on exercises
  • Set up your first clinical programming project using RStudio Projects
  • Prepare for Module 2: Data Manipulation with the tidyverse

🎯 Key Takeaways

  1. Data science workflow provides a systematic approach to clinical data analysis
  2. Vectors are fundamental - understanding them is essential for R programming
  3. RStudio Projects ensure reproducible and organized clinical programming
  4. Tidyverse packages provide a coherent set of tools for data science
  5. Workflow practices from R4DS apply directly to regulatory clinical programming
  6. AI assistance can accelerate coding while maintaining regulatory compliance

πŸ”— R4DS Chapter Integration:

  • Chapter 1: Data science process applied to clinical workflows
  • Chapter 4: Vector fundamentals for clinical data manipulation
  • Chapter 6: RStudio workflow for clinical programming
  • Chapter 8: Script and project organization for regulatory compliance

Ready to start manipulating clinical data? Let’s continue to Module 2!