Module 1 Solution β€” Introduction to R and Data Science

πŸ§ͺ Module 1 Solutions β€” R Fundamentals and Data Science

This document provides solutions for Module 1 exercises covering R fundamentals, vectors, RStudio workflow, and data science principles applied to clinical programming.


βœ… Exercise 1: Working with Vectors (R4DS Ch. 4)

Understanding vectors is fundamental to R programming. Here we create the basic building blocks for clinical data.

# Load required packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
βœ” dplyr     1.1.4     βœ” readr     2.1.5
βœ” forcats   1.0.0     βœ” stringr   1.6.0
βœ” ggplot2   4.0.0     βœ” tibble    3.3.0
βœ” lubridate 1.9.4     βœ” tidyr     1.3.1
βœ” purrr     1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
βœ– dplyr::filter() masks stats::filter()
βœ– dplyr::lag()    masks stats::lag()
β„Ή Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
library(lubridate)

# Create atomic vectors for clinical data
# Logical vectors for flags
safety_population <- c(TRUE, TRUE, FALSE, TRUE, TRUE)
elderly_flag <- c(FALSE, TRUE, FALSE, TRUE, FALSE)

# Numeric vectors for measurements  
age <- c(25, 45, 67, 52, 71)
weight <- c(65.2, 78.5, 85.1, 72.8, 90.3)
height <- c(165, 170, 160, 175, 168)

# Character vectors for IDs and categories
usubjid <- c("001-001", "001-002", "001-003", "001-004", "001-005")  
treatment <- c("Placebo", "Drug A", "Drug A", "Placebo", "Drug A")

# Vector operations
bmi <- weight / ((height/100)^2)
age_group <- ifelse(age >= 65, "Elderly", "Adult")

# Display results
cat("Ages:", age, "\n")
Ages: 25 45 67 52 71 
cat("BMI:", round(bmi, 1), "\n") 
BMI: 23.9 27.2 33.2 23.8 32 
cat("Age Groups:", age_group, "\n")
Age Groups: Adult Adult Elderly Adult Elderly 

βœ… Exercise 2: Create Clinical Dataset with Tibbles

Tibbles are enhanced data frames that provide better printing and stricter subsetting, essential for clinical programming.

# Create tibble using vectors from Exercise 1
dm <- tibble(
  USUBJID = usubjid,
  AGE = age,
  WEIGHT = weight,
  HEIGHT = height, 
  SEX = c("F", "M", "F", "M", "F"),
  TRT01A = treatment,
  RFSTDTC = c("2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-01-19"),
  SAFFL = safety_population,
  BMI = round(bmi, 1),
  AGEGR1 = age_group
)

# Display the dataset
dm
# A tibble: 5 Γ— 10
  USUBJID   AGE WEIGHT HEIGHT SEX   TRT01A  RFSTDTC    SAFFL   BMI AGEGR1
  <chr>   <dbl>  <dbl>  <dbl> <chr> <chr>   <chr>      <lgl> <dbl> <chr>
1 001-001    25   65.2    165 F     Placebo 2024-01-15 TRUE   23.9 Adult
2 001-002    45   78.5    170 M     Drug A  2024-01-16 TRUE   27.2 Adult
3 001-003    67   85.1    160 F     Drug A  2024-01-17 FALSE  33.2 Elderly
4 001-004    52   72.8    175 M     Placebo 2024-01-18 TRUE   23.8 Adult
5 001-005    71   90.3    168 F     Drug A  2024-01-19 TRUE   32   Elderly
# Explore the dataset structure
glimpse(dm)
Rows: 5
Columns: 10
$ USUBJID <chr> "001-001", "001-002", "001-003", "001-004", "001-005"
$ AGE     <dbl> 25, 45, 67, 52, 71
$ WEIGHT  <dbl> 65.2, 78.5, 85.1, 72.8, 90.3
$ HEIGHT  <dbl> 165, 170, 160, 175, 168
$ SEX     <chr> "F", "M", "F", "M", "F"
$ TRT01A  <chr> "Placebo", "Drug A", "Drug A", "Placebo", "Drug A"
$ RFSTDTC <chr> "2024-01-15", "2024-01-16", "2024-01-17", "2024-01-18", "2024-…
$ SAFFL   <lgl> TRUE, TRUE, FALSE, TRUE, TRUE
$ BMI     <dbl> 23.9, 27.2, 33.2, 23.8, 32.0
$ AGEGR1  <chr> "Adult", "Adult", "Elderly", "Adult", "Elderly"
# Summary statistics
summary(dm)
   USUBJID               AGE         WEIGHT          HEIGHT
 Length:5           Min.   :25   Min.   :65.20   Min.   :160.0
 Class :character   1st Qu.:45   1st Qu.:72.80   1st Qu.:165.0
 Mode  :character   Median :52   Median :78.50   Median :168.0
                    Mean   :52   Mean   :78.38   Mean   :167.6
                    3rd Qu.:67   3rd Qu.:85.10   3rd Qu.:170.0
                    Max.   :71   Max.   :90.30   Max.   :175.0
     SEX               TRT01A            RFSTDTC            SAFFL
 Length:5           Length:5           Length:5           Mode :logical
 Class :character   Class :character   Class :character   FALSE:1
 Mode  :character   Mode  :character   Mode  :character   TRUE :4



      BMI           AGEGR1
 Min.   :23.80   Length:5
 1st Qu.:23.90   Class :character
 Median :27.20   Mode  :character
 Mean   :28.02
 3rd Qu.:32.00
 Max.   :33.20                     

Key Learning Points: - Tibbles show data types clearly - Better printing than data.frames - Stricter subsetting prevents common errors


βœ… Exercise 3: RStudio Workflow Practice (R4DS Ch. 6)

Proper workflow practices ensure reproducible and maintainable clinical programming code.

# Practice code organization - add comments explaining clinical context
dm <- dm %>%
  mutate(
    # Create elderly flag based on protocol definition (age >= 65)
    ELDERLY = ifelse(AGE >= 65, "Y", "N"),
    
    # Convert character date to proper Date format for calculations
    RFSTDT = ymd(RFSTDTC),
    
    # Create BMI categories following clinical guidelines
    BMICAT = case_when(
      BMI < 18.5 ~ "Underweight",
      BMI < 25 ~ "Normal", 
      BMI < 30 ~ "Overweight",
      BMI >= 30 ~ "Obese"
    )
  )

# Display the updated dataset
print(dm)
# A tibble: 5 Γ— 13
  USUBJID   AGE WEIGHT HEIGHT SEX   TRT01A  RFSTDTC   SAFFL   BMI AGEGR1 ELDERLY
  <chr>   <dbl>  <dbl>  <dbl> <chr> <chr>   <chr>     <lgl> <dbl> <chr>  <chr>
1 001-001    25   65.2    165 F     Placebo 2024-01-… TRUE   23.9 Adult  N
2 001-002    45   78.5    170 M     Drug A  2024-01-… TRUE   27.2 Adult  N
3 001-003    67   85.1    160 F     Drug A  2024-01-… FALSE  33.2 Elder… Y
4 001-004    52   72.8    175 M     Placebo 2024-01-… TRUE   23.8 Adult  N
5 001-005    71   90.3    168 F     Drug A  2024-01-… TRUE   32   Elder… Y
# β„Ή 2 more variables: RFSTDT <date>, BMICAT <chr>

RStudio Best Practices Applied: - Clear, descriptive variable names - Comments explaining clinical context - Consistent code formatting and indentation - Logical grouping of related operations


βœ… Exercise 4 & 5: Project Organization and Data Summarization

Understanding project structure and creating summaries are essential for clinical programming workflows.

# Project organization demonstration
cat("Current working directory:", getwd(), "\n")
Current working directory: C:/Users/chiar/OneDrive - Alma Mater Studiorum UniversitΓ  di Bologna/Desktop/Repo/beginR/training_material/module 1 - intro 
cat("\nRecommended clinical project structure:\n")

Recommended clinical project structure:
cat("my_clinical_study/\n")
my_clinical_study/
cat("β”œβ”€β”€ my_clinical_study.Rproj\n") 
β”œβ”€β”€ my_clinical_study.Rproj
cat("β”œβ”€β”€ data/\n")
β”œβ”€β”€ data/
cat("β”‚   β”œβ”€β”€ raw/           # Original data files\n")
β”‚   β”œβ”€β”€ raw/           # Original data files
cat("β”‚   β”œβ”€β”€ sdtm/          # SDTM datasets\n")
β”‚   β”œβ”€β”€ sdtm/          # SDTM datasets
cat("β”‚   └── adam/          # ADAM datasets\n")
β”‚   └── adam/          # ADAM datasets
cat("β”œβ”€β”€ programs/\n")
β”œβ”€β”€ programs/
cat("β”‚   β”œβ”€β”€ sdtm/          # SDTM creation programs\n")
β”‚   β”œβ”€β”€ sdtm/          # SDTM creation programs
cat("β”‚   β”œβ”€β”€ adam/          # ADAM creation programs\n")
β”‚   β”œβ”€β”€ adam/          # ADAM creation programs
cat("β”‚   └── tlf/           # Tables, listings, figures programs\n")
β”‚   └── tlf/           # Tables, listings, figures programs
cat("└── outputs/           # Generated outputs\n")
└── outputs/           # Generated outputs
# Create safety population summary following good practices
safety_summary <- dm %>%
  filter(SAFFL == TRUE) %>%               # Keep only safety population
  select(USUBJID, AGE, SEX, TRT01A, BMICAT) %>%  # Select key variables
  arrange(AGE)                            # Sort by age

print(safety_summary)
# A tibble: 4 Γ— 5
  USUBJID   AGE SEX   TRT01A  BMICAT
  <chr>   <dbl> <chr> <chr>   <chr>
1 001-001    25 F     Placebo Normal
2 001-002    45 M     Drug A  Overweight
3 001-004    52 M     Placebo Normal
4 001-005    71 F     Drug A  Obese     
# Data summarization examples
# 1. BMI category counts
bmi_summary <- dm %>%
  count(BMICAT, name = "n_subjects")

# 2. Mean age by treatment
age_by_treatment <- dm %>%
  group_by(TRT01A) %>%
  summarise(
    n = n(),
    mean_age = mean(AGE),
    sd_age = sd(AGE)
  )

# 3. Elderly subjects count
elderly_summary <- dm %>%
  count(ELDERLY)

print("BMI Category Summary:")
[1] "BMI Category Summary:"
print(bmi_summary)
# A tibble: 3 Γ— 2
  BMICAT     n_subjects
  <chr>           <int>
1 Normal              2
2 Obese               2
3 Overweight          1
print("Age by Treatment:")
[1] "Age by Treatment:"
print(age_by_treatment)
# A tibble: 2 Γ— 4
  TRT01A      n mean_age sd_age
  <chr>   <int>    <dbl>  <dbl>
1 Drug A      3     61     14
2 Placebo     2     38.5   19.1
print("Elderly Summary:")
[1] "Elderly Summary:"
print(elderly_summary)
# A tibble: 2 Γ— 2
  ELDERLY     n
  <chr>   <int>
1 N           3
2 Y           2

βœ… Exercises 6 & 7: Help System and String Manipulation

Learning to get help and manipulate strings are crucial skills for clinical programming.

# Getting help examples
cat("R Help System Commands:\n")
R Help System Commands:
cat("?mean        # Help for specific function\n")
?mean        # Help for specific function
cat("??regression # Search for functions\n") 
??regression # Search for functions
cat("example(mean) # See function examples\n")
example(mean) # See function examples
cat("args(lm)     # View function arguments\n\n")
args(lm)     # View function arguments
# String manipulation for clinical data
dm <- dm %>%
  mutate(
    # Extract site number from USUBJID (part before the dash)
    SITE = str_extract(USUBJID, "\\d{3}"),  # Extracts first 3 digits
    
    # Create formatted subject label  
    SUBJ_LABEL = paste0("Subject ", USUBJID, " (Age: ", AGE, ")")
  )

print(dm)
# A tibble: 5 Γ— 15
  USUBJID   AGE WEIGHT HEIGHT SEX   TRT01A  RFSTDTC   SAFFL   BMI AGEGR1 ELDERLY
  <chr>   <dbl>  <dbl>  <dbl> <chr> <chr>   <chr>     <lgl> <dbl> <chr>  <chr>
1 001-001    25   65.2    165 F     Placebo 2024-01-… TRUE   23.9 Adult  N
2 001-002    45   78.5    170 M     Drug A  2024-01-… TRUE   27.2 Adult  N
3 001-003    67   85.1    160 F     Drug A  2024-01-… FALSE  33.2 Elder… Y
4 001-004    52   72.8    175 M     Placebo 2024-01-… TRUE   23.8 Adult  N
5 001-005    71   90.3    168 F     Drug A  2024-01-… TRUE   32   Elder… Y
# β„Ή 4 more variables: RFSTDT <date>, BMICAT <chr>, SITE <chr>, SUBJ_LABEL <chr>
# Example calculations for AI assistance practice
young_flag <- ifelse(dm$AGE < 30, "Y", "N")
treatment_weeks <- 12
total_days <- treatment_weeks * 7

cat("Example calculations:\n")
Example calculations:
cat("Young subjects (< 30 years):", sum(young_flag == "Y"), "\n")
Young subjects (< 30 years): 1 
cat("Treatment duration:", treatment_weeks, "weeks =", total_days, "days\n")
Treatment duration: 12 weeks = 84 days

βœ… Final Summary Report

Creating comprehensive summaries is essential for clinical data review.

# Create a clinical summary report
cat("=== DEMOGRAPHICS SUMMARY ===\n")
=== DEMOGRAPHICS SUMMARY ===
cat("Total subjects:", nrow(dm), "\n")
Total subjects: 5 
cat("Age range:", min(dm$AGE), "to", max(dm$AGE), "years\n") 
Age range: 25 to 71 years
cat("Female subjects:", sum(dm$SEX == "F"), "\n")
Female subjects: 3 
cat("Male subjects:", sum(dm$SEX == "M"), "\n")
Male subjects: 2 
cat("Elderly subjects (65+):", sum(dm$ELDERLY == "Y"), "\n")
Elderly subjects (65+): 2 
cat("Safety population:", sum(dm$SAFFL == TRUE), "\n")
Safety population: 4 
# Treatment distribution
cat("\nTreatment Distribution:\n")

Treatment Distribution:
print(table(dm$TRT01A))

 Drug A Placebo
      3       2 
# BMI category distribution
cat("\nBMI Category Distribution:\n")

BMI Category Distribution:
print(table(dm$BMICAT))

    Normal      Obese Overweight
         2          2          1 

🎯 Module 1 Learning Summary

You have successfully completed all R4DS Chapter integrations:

βœ… Chapter 1: Understanding the data science process in clinical context
βœ… Chapter 4: Working with vectors as R’s fundamental data structures
βœ… Chapter 6: RStudio workflow best practices for clinical programming
βœ… Chapter 8: Script organization and project management for reproducibility

Key Skills Developed:

  • Vector creation and manipulation for clinical data
  • Tibble creation and data structure understanding
  • RStudio interface navigation and productivity features
  • Code organization and documentation best practices
  • Project structure for regulatory compliance
  • Data summarization and reporting techniques

Clinical Programming Applications:

  • CDISC-compliant variable naming and structure
  • Regulatory-ready code documentation
  • Reproducible analysis workflows
  • Efficient data manipulation patterns

πŸš€ Ready for Module 2: Data Manipulation with dplyr’s five key verbs!