Module 1 Solution — Introduction to R and Data Science

🧪 Module 1 Solutions — R Fundamentals and Data Science

This document provides solutions for Module 1 exercises covering R fundamentals, vectors, RStudio workflow, and data science principles applied to clinical programming.

✅ Exercise 1: Working with Vectors (R4DS Ch. 4)

Understanding vectors is fundamental to R programming. Here we create the basic building blocks for clinical data.

# Load required packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.6.0
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)
library(lubridate)

# Create atomic vectors for clinical data
# Logical vectors for flags
safety_population <- c(TRUE, TRUE, FALSE, TRUE, TRUE)
elderly_flag <- c(FALSE, TRUE, FALSE, TRUE, FALSE)

# Numeric vectors for measurements  
age <- c(25, 45, 67, 52, 71)
weight <- c(65.2, 78.5, 85.1, 72.8, 90.3)
height <- c(165, 170, 160, 175, 168)

# Character vectors for IDs and categories
usubjid <- c("001-001", "001-002", "001-003", "001-004", "001-005")  
treatment <- c("Placebo", "Drug A", "Drug A", "Placebo", "Drug A")

# Vector operations
bmi <- weight / ((height/100)^2)
age_group <- ifelse(age >= 65, "Elderly", "Adult")

# Display results
cat("Ages:", age, "\n")

Ages: 25 45 67 52 71

cat("BMI:", round(bmi, 1), "\n")

BMI: 23.9 27.2 33.2 23.8 32

cat("Age Groups:", age_group, "\n")

Age Groups: Adult Adult Elderly Adult Elderly

✅ Exercise 2: Create Clinical Dataset with Tibbles

Tibbles are enhanced data frames that provide better printing and stricter subsetting, essential for clinical programming.

# Create tibble using vectors from Exercise 1
dm <- tibble(
  USUBJID = usubjid,
  AGE = age,
  WEIGHT = weight,
  HEIGHT = height, 
  SEX = c("F", "M", "F", "M", "F"),
  TRT01A = treatment,
  RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-01-19"),
  SAFFL = safety_population,
  BMI = round(bmi, 1),
  AGEGR1 = age_group
)

# Display the dataset
dm

# A tibble: 5 × 10
  USUBJID   AGE WEIGHT HEIGHT SEX   TRT01A  RFSTDTC    SAFFL   BMI AGEGR1
  <chr>   <dbl>  <dbl>  <dbl> <chr> <chr>   <chr>      <lgl> <dbl> <chr>
1 001-001    25   65.2    165 F     Placebo 2024-01-15 TRUE   23.9 Adult
2 001-002    45   78.5    170 M     Drug A  2024-01-16 TRUE   27.2 Adult
3 001-003    67   85.1    160 F     Drug A  2024-01-17 FALSE  33.2 Elderly
4 001-004    52   72.8    175 M     Placebo 2024-01-18 TRUE   23.8 Adult
5 001-005    71   90.3    168 F     Drug A  2024-01-19 TRUE   32   Elderly

# Explore the dataset structure
glimpse(dm)

Rows: 5
Columns: 10
$ USUBJID <chr> "001-001", "001-002", "001-003", "001-004", "001-005"
$ AGE     <dbl> 25, 45, 67, 52, 71
$ WEIGHT  <dbl> 65.2, 78.5, 85.1, 72.8, 90.3
$ HEIGHT  <dbl> 165, 170, 160, 175, 168
$ SEX     <chr> "F", "M", "F", "M", "F"
$ TRT01A  <chr> "Placebo", "Drug A", "Drug A", "Placebo", "Drug A"
$ RFSTDTC <chr> "2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-…
$ SAFFL   <lgl> TRUE, TRUE, FALSE, TRUE, TRUE
$ BMI     <dbl> 23.9, 27.2, 33.2, 23.8, 32.0
$ AGEGR1  <chr> "Adult", "Adult", "Elderly", "Adult", "Elderly"

# Summary statistics
summary(dm)

   USUBJID               AGE         WEIGHT          HEIGHT
 Length:5           Min.   :25   Min.   :65.20   Min.   :160.0
 Class :character   1st Qu.:45   1st Qu.:72.80   1st Qu.:165.0
 Mode  :character   Median :52   Median :78.50   Median :168.0
                    Mean   :52   Mean   :78.38   Mean   :167.6
                    3rd Qu.:67   3rd Qu.:85.10   3rd Qu.:170.0
                    Max.   :71   Max.   :90.30   Max.   :175.0
     SEX               TRT01A            RFSTDTC            SAFFL
 Length:5           Length:5           Length:5           Mode :logical
 Class :character   Class :character   Class :character   FALSE:1
 Mode  :character   Mode  :character   Mode  :character   TRUE :4



      BMI           AGEGR1
 Min.   :23.80   Length:5
 1st Qu.:23.90   Class :character
 Median :27.20   Mode  :character
 Mean   :28.02
 3rd Qu.:32.00
 Max.   :33.20

Key Learning Points: - Tibbles show data types clearly - Better printing than data.frames - Stricter subsetting prevents common errors

✅ Exercise 3: RStudio Workflow Practice (R4DS Ch. 6)

Proper workflow practices ensure reproducible and maintainable clinical programming code.

# Practice code organization - add comments explaining clinical context
dm <- dm %>%
  mutate(
    # Create elderly flag based on protocol definition (age >= 65)
    ELDERLY = ifelse(AGE >= 65, "Y", "N"),
    
    # Convert character date to proper Date format for calculations
    RFSTDT = ymd(RFSTDTC),
    
    # Create BMI categories following clinical guidelines
    BMICAT = case_when(
      BMI < 18.5 ~ "Underweight",
      BMI < 25 ~ "Normal", 
      BMI < 30 ~ "Overweight",
      BMI >= 30 ~ "Obese"
    )
  )

# Display the updated dataset
print(dm)

# A tibble: 5 × 13
  USUBJID   AGE WEIGHT HEIGHT SEX   TRT01A  RFSTDTC   SAFFL   BMI AGEGR1 ELDERLY
  <chr>   <dbl>  <dbl>  <dbl> <chr> <chr>   <chr>     <lgl> <dbl> <chr>  <chr>
1 001-001    25   65.2    165 F     Placebo 2024-01-… TRUE   23.9 Adult  N
2 001-002    45   78.5    170 M     Drug A  2024-01-… TRUE   27.2 Adult  N
3 001-003    67   85.1    160 F     Drug A  2024-01-… FALSE  33.2 Elder… Y
4 001-004    52   72.8    175 M     Placebo 2024-01-… TRUE   23.8 Adult  N
5 001-005    71   90.3    168 F     Drug A  2024-01-… TRUE   32   Elder… Y
# ℹ 2 more variables: RFSTDT <date>, BMICAT <chr>

RStudio Best Practices Applied: - Clear, descriptive variable names - Comments explaining clinical context - Consistent code formatting and indentation - Logical grouping of related operations

✅ Exercise 4 & 5: Project Organization and Data Summarization

Understanding project structure and creating summaries are essential for clinical programming workflows.

# Project organization demonstration
cat("Current working directory:", getwd(), "\n")

Current working directory: C:/Users/chiar/OneDrive - Alma Mater Studiorum Università di Bologna/Desktop/Repo/beginR/training_material/module 1 - intro

cat("\nRecommended clinical project structure:\n")


Recommended clinical project structure:

cat("my_clinical_study/\n")

my_clinical_study/

cat("├── my_clinical_study.Rproj\n")

├── my_clinical_study.Rproj

cat("├── data/\n")

├── data/

cat("│   ├── raw/           # Original data files\n")

│   ├── raw/           # Original data files

cat("│   ├── sdtm/          # SDTM datasets\n")

│   ├── sdtm/          # SDTM datasets

cat("│   └── adam/          # ADAM datasets\n")

│   └── adam/          # ADAM datasets

cat("├── programs/\n")

├── programs/

cat("│   ├── sdtm/          # SDTM creation programs\n")

│   ├── sdtm/          # SDTM creation programs

cat("│   ├── adam/          # ADAM creation programs\n")

│   ├── adam/          # ADAM creation programs

cat("│   └── tlf/           # Tables, listings, figures programs\n")

│   └── tlf/           # Tables, listings, figures programs

cat("└── outputs/           # Generated outputs\n")

└── outputs/           # Generated outputs

# Create safety population summary following good practices
safety_summary <- dm %>%
  filter(SAFFL == TRUE) %>%               # Keep only safety population
  select(USUBJID, AGE, SEX, TRT01A, BMICAT) %>%  # Select key variables
  arrange(AGE)                            # Sort by age

print(safety_summary)

# A tibble: 4 × 5
  USUBJID   AGE SEX   TRT01A  BMICAT
  <chr>   <dbl> <chr> <chr>   <chr>
1 001-001    25 F     Placebo Normal
2 001-002    45 M     Drug A  Overweight
3 001-004    52 M     Placebo Normal
4 001-005    71 F     Drug A  Obese

# Data summarization examples
# 1. BMI category counts
bmi_summary <- dm %>%
  count(BMICAT, name = "n_subjects")

# 2. Mean age by treatment
age_by_treatment <- dm %>%
  group_by(TRT01A) %>%
  summarise(
    n = n(),
    mean_age = mean(AGE),
    sd_age = sd(AGE)
  )

# 3. Elderly subjects count
elderly_summary <- dm %>%
  count(ELDERLY)

print("BMI Category Summary:")

[1] "BMI Category Summary:"

print(bmi_summary)

# A tibble: 3 × 2
  BMICAT     n_subjects
  <chr>           <int>
1 Normal              2
2 Obese               2
3 Overweight          1

print("Age by Treatment:")

[1] "Age by Treatment:"

print(age_by_treatment)

# A tibble: 2 × 4
  TRT01A      n mean_age sd_age
  <chr>   <int>    <dbl>  <dbl>
1 Drug A      3     61     14
2 Placebo     2     38.5   19.1

print("Elderly Summary:")

[1] "Elderly Summary:"

print(elderly_summary)

# A tibble: 2 × 2
  ELDERLY     n
  <chr>   <int>
1 N           3
2 Y           2

✅ Exercises 6 & 7: Help System and String Manipulation

Learning to get help and manipulate strings are crucial skills for clinical programming.

# Getting help examples
cat("R Help System Commands:\n")

R Help System Commands:

cat("?mean        # Help for specific function\n")

?mean        # Help for specific function

cat("??regression # Search for functions\n")

??regression # Search for functions

cat("example(mean) # See function examples\n")

example(mean) # See function examples

cat("args(lm)     # View function arguments\n\n")

args(lm)     # View function arguments

# String manipulation for clinical data
dm <- dm %>%
  mutate(
    # Extract site number from USUBJID (part before the dash)
    SITE = str_extract(USUBJID, "\\d{3}"),  # Extracts first 3 digits
    
    # Create formatted subject label  
    SUBJ_LABEL = paste0("Subject ", USUBJID, " (Age: ", AGE, ")")
  )

print(dm)

# A tibble: 5 × 15
  USUBJID   AGE WEIGHT HEIGHT SEX   TRT01A  RFSTDTC   SAFFL   BMI AGEGR1 ELDERLY
  <chr>   <dbl>  <dbl>  <dbl> <chr> <chr>   <chr>     <lgl> <dbl> <chr>  <chr>
1 001-001    25   65.2    165 F     Placebo 2024-01-… TRUE   23.9 Adult  N
2 001-002    45   78.5    170 M     Drug A  2024-01-… TRUE   27.2 Adult  N
3 001-003    67   85.1    160 F     Drug A  2024-01-… FALSE  33.2 Elder… Y
4 001-004    52   72.8    175 M     Placebo 2024-01-… TRUE   23.8 Adult  N
5 001-005    71   90.3    168 F     Drug A  2024-01-… TRUE   32   Elder… Y
# ℹ 4 more variables: RFSTDT <date>, BMICAT <chr>, SITE <chr>, SUBJ_LABEL <chr>

# Example calculations for AI assistance practice
young_flag <- ifelse(dm$AGE < 30, "Y", "N")
treatment_weeks <- 12
total_days <- treatment_weeks * 7

cat("Example calculations:\n")

Example calculations:

cat("Young subjects (< 30 years):", sum(young_flag == "Y"), "\n")

Young subjects (< 30 years): 1

cat("Treatment duration:", treatment_weeks, "weeks =", total_days, "days\n")

Treatment duration: 12 weeks = 84 days

✅ Final Summary Report

Creating comprehensive summaries is essential for clinical data review.

# Create a clinical summary report
cat("=== DEMOGRAPHICS SUMMARY ===\n")

=== DEMOGRAPHICS SUMMARY ===

cat("Total subjects:", nrow(dm), "\n")

Total subjects: 5

cat("Age range:", min(dm$AGE), "to", max(dm$AGE), "years\n")

Age range: 25 to 71 years

cat("Female subjects:", sum(dm$SEX == "F"), "\n")

Female subjects: 3

cat("Male subjects:", sum(dm$SEX == "M"), "\n")

Male subjects: 2

cat("Elderly subjects (65+):", sum(dm$ELDERLY == "Y"), "\n")

Elderly subjects (65+): 2

cat("Safety population:", sum(dm$SAFFL == TRUE), "\n")

Safety population: 4

# Treatment distribution
cat("\nTreatment Distribution:\n")


Treatment Distribution:

print(table(dm$TRT01A))


 Drug A Placebo
      3       2

# BMI category distribution
cat("\nBMI Category Distribution:\n")


BMI Category Distribution:

print(table(dm$BMICAT))


    Normal      Obese Overweight
         2          2          1

🎯 Module 1 Learning Summary

You have successfully completed all R4DS Chapter integrations:

✅ Chapter 1: Understanding the data science process in clinical context
✅ Chapter 4: Working with vectors as R’s fundamental data structures
✅ Chapter 6: RStudio workflow best practices for clinical programming
✅ Chapter 8: Script organization and project management for reproducibility

Key Skills Developed:

Vector creation and manipulation for clinical data
Tibble creation and data structure understanding
RStudio interface navigation and productivity features
Code organization and documentation best practices
Project structure for regulatory compliance
Data summarization and reporting techniques

Clinical Programming Applications:

CDISC-compliant variable naming and structure
Regulatory-ready code documentation
Reproducible analysis workflows
Efficient data manipulation patterns

🚀 Ready for Module 2: Data Manipulation with dplyr’s five key verbs!