Module 1 Theory — Introduction to R and Data Science

📊 Module 1 — Introduction to R

🎯 Learning Objectives

By the end of this module, you will:

Understand the R ecosystem and data science workflow (R4DS Ch. 1)
Master R workflow basics including code organization and project management (R4DS Ch. 6)
Navigate RStudio interface efficiently for clinical programming workflows (R4DS Ch. 6)
Apply workflow fundamentals with scripts, directories, and reproducible practices (R4DS Ch. 8)
Work with vectors and understand R’s fundamental data structures (R4DS Ch. 4)
Load essential packages (tidyverse, haven) for clinical programming
Set up clinical programming environment with proper configuration

🌐 1. What is Data Science? (R4DS Ch. 1)

The Data Science Process

Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge. The R4DS process includes:

Import → Tidy → Transform → Visualize → Model → Communicate
   ↓        ↓        ↓          ↓         ↓         ↓
 Data → Clean → Wrangle → Explore → Predict → Report

In Clinical Programming Context: - Import: Read SAS XPT files, CSV data, CRF exports - Tidy: Structure data according to CDISC standards (SDTM/ADAM)
- Transform: Create derived variables, flag values, calculate parameters - Visualize: Create safety and efficacy plots, patient profiles - Model: Statistical analyses, modeling exposure-response - Communicate: Generate tables, listings, figures (TLFs)

R in the Clinical Data Science Ecosystem

R provides a complete toolkit for clinical data science:

Data Import: haven (SAS), readr (CSV), readxl (Excel)
Data Manipulation: dplyr for data wrangling
Data Visualization: ggplot2 for publication-quality graphics
Reproducible Reports: R Markdown for integrated analysis and reporting
Clinical Standards: admiral, xportr for CDISC compliance

📊 2. Working with Vectors (R4DS Ch. 4)

Understanding R’s Foundation: Vectors

Everything in R is built on vectors. Understanding vectors is crucial for clinical programming.

Atomic Vector Types

# Logical vectors (for flags and indicators)
elderly_flag <- c(TRUE, FALSE, TRUE, FALSE)
safety_population <- c(TRUE, TRUE, FALSE, TRUE)

# Integer vectors (for counts, IDs)
subject_n <- c(1L, 2L, 3L, 4L, 5L)
visit_num <- c(1L, 2L, 3L, 4L)

# Double vectors (for measurements, calculations)
age <- c(45.2, 62.8, 28.1, 71.5)
weight <- c(70.5, 85.2, 55.8, 92.1)

# Character vectors (for categories, IDs)
usubjid <- c("001-001", "001-002", "001-003", "001-004")
treatment <- c("Placebo", "Drug A", "Drug A", "Placebo")

Clinical Programming Applications

# Missing values in clinical data
lab_values <- c(120, 135, NA, 142, NA)  # Missing lab results
is.na(lab_values)  # Check for missing values

# Coercion and type conversion
age_char <- c("45", "62", "28", "71")
age_num <- as.numeric(age_char)  # Convert to numeric

# Vector operations for derived variables
bmi <- weight / ((height/100)^2)  # Calculate BMI
age_group <- ifelse(age >= 65, "Elderly", "Adult")  # Create age groups

Vector Attributes and Names

# Named vectors for lookup tables
dose_levels <- c("Low" = 10, "Medium" = 50, "High" = 100)
treatment_codes <- c("Placebo" = "P", "Active" = "A")

# Using names for clinical programming
dose_levels["High"]  # Returns 100

🔧 3. R Workflow Basics (R4DS Ch. 6)

What is Workflow?

Workflow encompasses the practices and tools that make your data analysis:

Reproducible: Others (and future you) can recreate your work
Reliable: Your code works consistently across different environments
Shareable: Code and results can be easily communicated

The RStudio IDE for Clinical Programming

RStudio provides a powerful integrated development environment specifically designed for R:

Four Key Panes:

Script Editor (Top-Left): Write R scripts (.R) and R Markdown (.Rmd) files
Console (Bottom-Left): Execute code interactively and see immediate results
Environment/History (Top-Right): View objects in memory and command history
Files/Plots/Packages/Help (Bottom-Right): Navigate project files and view outputs

Essential RStudio Features for Clinical Work:

Syntax highlighting: Color-coded R syntax for better readability
Auto-completion: Tab completion for functions, arguments, and object names
Code folding: Collapse sections of code for better organization
Integrated help: Quick access to function documentation
Project management: Organize related files and maintain consistent working directories

Code Organization Best Practices

# Use clear, descriptive variable names
elderly_subjects <- dm[dm$AGE >= 65, ]  # Good
old <- dm[dm$AGE >= 65, ]              # Avoid

# Add comments explaining clinical context
# Calculate study day from reference start date
studyday <- as.numeric(visit_date - rfstdtc) + 1

# Use consistent spacing and indentation
dm_clean <- dm %>%
  filter(!is.na(USUBJID)) %>%          # Remove subjects without ID
  mutate(AGEGR1 = case_when(           # Create age groups per protocol
    AGE < 65 ~ "< 65 years",
    AGE >= 65 ~ ">= 65 years"
  ))

📂 4. Workflow: Scripts and Projects (R4DS Ch. 8)

Scripts for Reproducible Analysis

Scripts are text files containing R code that can be executed to reproduce your analysis. For clinical programming, well-organized scripts are essential for regulatory compliance and collaboration.

Script Best Practices:

# =============================================================================
# Program: DM Dataset Creation
# Author: [Your Name]  
# Date: 2024-11-10
# Purpose: Create SDTM DM dataset from raw demographics data
# Input: /data/raw/demographics.csv
# Output: /data/sdtm/dm.xpt
# =============================================================================

# Load required packages
library(tidyverse)
library(haven)
library(lubridate)

# Read raw data
raw_dm <- read_csv("data/raw/demographics.csv")

# Create SDTM DM dataset
dm <- raw_dm %>%
  select(USUBJID = subject_id, AGE = age, SEX = sex) %>%
  mutate(
    DOMAIN = "DM",
    RFSTDTC = "2024-01-15",  # Protocol-specified reference date
    ARMCD = "TRT01A",        # Treatment arm code
    ARM = "Treatment A"      # Treatment arm description
  )

# Export to XPT format
write_xpt(dm, "data/sdtm/dm.xpt")

RStudio Projects (.Rproj)

Projects help organize your clinical programming work and ensure reproducibility:

Benefits of Using Projects:

Consistent working directory: No need for setwd()
Isolated environments: Each project maintains its own workspace
Version control integration: Easy Git integration for regulatory tracking
Organized file structure: Keep data, programs, and outputs separated

Recommended Project Structure:

my_clinical_study/
├── my_clinical_study.Rproj
├── data/
│   ├── raw/           # Original data files
│   ├── sdtm/          # SDTM datasets
│   └── adam/          # ADAM datasets
├── programs/
│   ├── sdtm/          # SDTM creation programs
│   ├── adam/          # ADAM creation programs
│   └── tlf/           # Tables, listings, figures
├── outputs/
│   ├── datasets/      # Final datasets
│   ├── tables/        # Analysis tables
│   └── figures/       # Analysis figures
└── docs/              # Documentation, protocols

Working Directory and Paths

Understanding file paths is crucial for reproducible clinical programming:

# Check current working directory
getwd()

# Read files using relative paths (preferred)  
dm <- read_csv("data/raw/demographics.csv")      # Good
dm <- read_csv("C:/Users/me/data/demo.csv")      # Avoid - not portable

# Use here package for robust path handling
library(here)
dm <- read_csv(here("data", "raw", "demographics.csv"))

📦 5. Essential Packages for Clinical Programming

The tidyverse is a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures.

Installing and Loading the Tidyverse

# Install tidyverse (includes dplyr, ggplot2, tibble, readr, and more)
install.packages("tidyverse")

# Install additional clinical-specific packages
install.packages(c("haven", "lubridate", "here"))

# Load packages at start of each session
library(tidyverse)  # Loads core tidyverse packages
library(haven)      # Read/write SAS, SPSS, Stata files
library(lubridate)  # Date/time handling
library(here)       # Robust file path handling

Core Tidyverse Packages for Clinical Work:

Package	Purpose	Key Functions
`dplyr`	Data manipulation (like SAS DATA step)	`filter()`, `select()`, `mutate()`
`tibble`	Enhanced data frames	`tibble()`, `as_tibble()`
`readr`	Reading rectangular data	`read_csv()`, `write_csv()`
`ggplot2`	Data visualization	`ggplot()`, `geom_point()`, `geom_line()`
`stringr`	String manipulation	`str_detect()`, `str_replace()`
`forcats`	Factor handling	`fct_reorder()`, `fct_lump()`

Clinical-Specific Packages:

Package	Purpose	Key Functions
`haven`	Import/export SAS XPT and SAS7BDAT files	`read_xpt()`, `write_xpt()`
`lubridate`	Date/time manipulation	`ymd()`, `dmy()`, `today()`
`here`	Robust file path construction	`here()`

Package Loading Best Practices

# Good: Load all packages at the top of your script
library(tidyverse)
library(haven)
library(lubridate)

# Check what packages are loaded
search()

# Check tidyverse package versions
tidyverse_packages()

🤖 6. Getting Help and AI Assistance

Getting Help in R

R has extensive built-in help systems and community resources:

Built-in Help Functions:

# Get help for a specific function
?mean
help(mean)

# Search for functions containing a keyword
??regression
help.search("regression")

# Get examples of function usage
example(mean)

# View function arguments
args(lm)

RStudio Help Integration:

Help tab: Click on functions or use F1 for contextual help
Auto-completion: Tab completion shows function arguments
Function tooltips: Hover over functions to see signatures

AI-Powered Assistance: GitHub Copilot

GitHub Copilot is an AI pair programmer that accelerates clinical programming by providing intelligent code suggestions.

Setup for RStudio:

GitHub Copilot subscription: Sign up (free for students/eligible users)
RStudio configuration: Enable in Tools → Global Options → Copilot
Sign in: Connect your GitHub account to RStudio

How to Use Copilot Effectively:

Method 1: Comment-Driven Development

# Read a SAS XPT file and convert to tibble
# Copilot will suggest: dm <- read_xpt("dm.xpt") %>% as_tibble()

# Calculate age from birthdate and reference date  
# Copilot will suggest: age <- floor(as.numeric(refdate - birthdate) / 365.25)

Method 2: Function Name Auto-completion

# Start typing a function name and Copilot suggests parameters
dm %>% 
  filter(  # Copilot suggests: AGE >= 18, ARMCD == "TRT01A", etc.

Method 3: Pattern Recognition

# After writing similar code, Copilot learns patterns
dm <- dm %>% mutate(AGEGR1 = case_when(
  AGE < 65 ~ "< 65",
  AGE >= 65 ~ ">= 65"  # Copilot suggests this based on pattern
))

Common Copilot Prompts for Clinical Programming:

Comment Prompt	Likely Copilot Suggestion
`# Create elderly flag for age >= 65`	`mutate(ELDERLY = ifelse(AGE >= 65, "Y", "N"))`
`# Read SAS transport file`	`dm <- read_xpt("dm.xpt")`
`# Calculate study day`	`mutate(AESTDY = as.numeric(AESTDTC - RFSTDTC) + 1)`
`# Convert character date to Date`	`mutate(date = ymd(date_char))`

Important: Always review and validate Copilot suggestions. Copilot accelerates coding but doesn’t replace your clinical programming expertise and knowledge of CDISC standards.

📝 Module Summary

By completing this module, you should now be able to:

✅ Understand the data science process and how it applies to clinical programming (R4DS Ch. 1)
✅ Work with vectors and understand R’s fundamental data structures (R4DS Ch. 4)
✅ Navigate RStudio efficiently and organize your workflow (R4DS Ch. 6)
✅ Create reproducible scripts and manage projects effectively (R4DS Ch. 8)
✅ Load and use tidyverse packages for clinical data analysis
✅ Apply workflow best practices for clinical programming environments

🚀 Next Steps:

Practice vector operations and RStudio navigation with hands-on exercises
Set up your first clinical programming project using RStudio Projects
Prepare for Module 2: Data Manipulation with the tidyverse

🎯 Key Takeaways

Data science workflow provides a systematic approach to clinical data analysis
Vectors are fundamental - understanding them is essential for R programming
RStudio Projects ensure reproducible and organized clinical programming
Tidyverse packages provide a coherent set of tools for data science
Workflow practices from R4DS apply directly to regulatory clinical programming
AI assistance can accelerate coding while maintaining regulatory compliance

🔗 R4DS Chapter Integration:

Chapter 1: Data science process applied to clinical workflows
Chapter 4: Vector fundamentals for clinical data manipulation
Chapter 6: RStudio workflow for clinical programming
Chapter 8: Script and project organization for regulatory compliance

Ready to start manipulating clinical data? Let’s continue to Module 2!