Module 1 Solution β Introduction to R and Data Science
π§ͺ Module 1 Solutions β R Fundamentals and Data Science
This document provides solutions for Module 1 exercises covering R fundamentals, vectors, RStudio workflow, and data science principles applied to clinical programming.
β Exercise 1: Working with Vectors (R4DS Ch. 4)
Understanding vectors is fundamental to R programming. Here we create the basic building blocks for clinical data.
# Load required packageslibrary(tidyverse)
ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.4 β readr 2.1.5
β forcats 1.0.0 β stringr 1.6.0
β ggplot2 4.0.0 β tibble 3.3.0
β lubridate 1.9.4 β tidyr 1.3.1
β purrr 1.2.0
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
USUBJID AGE WEIGHT HEIGHT
Length:5 Min. :25 Min. :65.20 Min. :160.0
Class :character 1st Qu.:45 1st Qu.:72.80 1st Qu.:165.0
Mode :character Median :52 Median :78.50 Median :168.0
Mean :52 Mean :78.38 Mean :167.6
3rd Qu.:67 3rd Qu.:85.10 3rd Qu.:170.0
Max. :71 Max. :90.30 Max. :175.0
SEX TRT01A RFSTDTC SAFFL
Length:5 Length:5 Length:5 Mode :logical
Class :character Class :character Class :character FALSE:1
Mode :character Mode :character Mode :character TRUE :4
BMI AGEGR1
Min. :23.80 Length:5
1st Qu.:23.90 Class :character
Median :27.20 Mode :character
Mean :28.02
3rd Qu.:32.00
Max. :33.20
Key Learning Points: - Tibbles show data types clearly - Better printing than data.frames - Stricter subsetting prevents common errors
β Exercise 3: RStudio Workflow Practice (R4DS Ch. 6)
Proper workflow practices ensure reproducible and maintainable clinical programming code.
# Practice code organization - add comments explaining clinical contextdm <- dm %>%mutate(# Create elderly flag based on protocol definition (age >= 65)ELDERLY =ifelse(AGE >=65, "Y", "N"),# Convert character date to proper Date format for calculationsRFSTDT =ymd(RFSTDTC),# Create BMI categories following clinical guidelinesBMICAT =case_when( BMI <18.5~"Underweight", BMI <25~"Normal", BMI <30~"Overweight", BMI >=30~"Obese" ) )# Display the updated datasetprint(dm)
# A tibble: 5 Γ 13
USUBJID AGE WEIGHT HEIGHT SEX TRT01A RFSTDTC SAFFL BMI AGEGR1 ELDERLY
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lgl> <dbl> <chr> <chr>
1 001-001 25 65.2 165 F Placebo 2024-01-β¦ TRUE 23.9 Adult N
2 001-002 45 78.5 170 M Drug A 2024-01-β¦ TRUE 27.2 Adult N
3 001-003 67 85.1 160 F Drug A 2024-01-⦠FALSE 33.2 Elder⦠Y
4 001-004 52 72.8 175 M Placebo 2024-01-β¦ TRUE 23.8 Adult N
5 001-005 71 90.3 168 F Drug A 2024-01-⦠TRUE 32 Elder⦠Y
# βΉ 2 more variables: RFSTDT <date>, BMICAT <chr>
RStudio Best Practices Applied: - Clear, descriptive variable names - Comments explaining clinical context - Consistent code formatting and indentation - Logical grouping of related operations
β Exercise 4 & 5: Project Organization and Data Summarization
Understanding project structure and creating summaries are essential for clinical programming workflows.
# Project organization demonstrationcat("Current working directory:", getwd(), "\n")
Current working directory: C:/Users/chiar/OneDrive - Alma Mater Studiorum UniversitΓ di Bologna/Desktop/Repo/beginR/training_material/module 1 - intro
# Create safety population summary following good practicessafety_summary <- dm %>%filter(SAFFL ==TRUE) %>%# Keep only safety populationselect(USUBJID, AGE, SEX, TRT01A, BMICAT) %>%# Select key variablesarrange(AGE) # Sort by ageprint(safety_summary)
# A tibble: 4 Γ 5
USUBJID AGE SEX TRT01A BMICAT
<chr> <dbl> <chr> <chr> <chr>
1 001-001 25 F Placebo Normal
2 001-002 45 M Drug A Overweight
3 001-004 52 M Placebo Normal
4 001-005 71 F Drug A Obese
# Data summarization examples# 1. BMI category countsbmi_summary <- dm %>%count(BMICAT, name ="n_subjects")# 2. Mean age by treatmentage_by_treatment <- dm %>%group_by(TRT01A) %>%summarise(n =n(),mean_age =mean(AGE),sd_age =sd(AGE) )# 3. Elderly subjects countelderly_summary <- dm %>%count(ELDERLY)print("BMI Category Summary:")
[1] "BMI Category Summary:"
print(bmi_summary)
# A tibble: 3 Γ 2
BMICAT n_subjects
<chr> <int>
1 Normal 2
2 Obese 2
3 Overweight 1
print("Age by Treatment:")
[1] "Age by Treatment:"
print(age_by_treatment)
# A tibble: 2 Γ 4
TRT01A n mean_age sd_age
<chr> <int> <dbl> <dbl>
1 Drug A 3 61 14
2 Placebo 2 38.5 19.1
print("Elderly Summary:")
[1] "Elderly Summary:"
print(elderly_summary)
# A tibble: 2 Γ 2
ELDERLY n
<chr> <int>
1 N 3
2 Y 2
β Exercises 6 & 7: Help System and String Manipulation
Learning to get help and manipulate strings are crucial skills for clinical programming.
# Getting help examplescat("R Help System Commands:\n")
R Help System Commands:
cat("?mean # Help for specific function\n")
?mean # Help for specific function
cat("??regression # Search for functions\n")
??regression # Search for functions
cat("example(mean) # See function examples\n")
example(mean) # See function examples
cat("args(lm) # View function arguments\n\n")
args(lm) # View function arguments
# String manipulation for clinical datadm <- dm %>%mutate(# Extract site number from USUBJID (part before the dash)SITE =str_extract(USUBJID, "\\d{3}"), # Extracts first 3 digits# Create formatted subject label SUBJ_LABEL =paste0("Subject ", USUBJID, " (Age: ", AGE, ")") )print(dm)
# A tibble: 5 Γ 15
USUBJID AGE WEIGHT HEIGHT SEX TRT01A RFSTDTC SAFFL BMI AGEGR1 ELDERLY
<chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lgl> <dbl> <chr> <chr>
1 001-001 25 65.2 165 F Placebo 2024-01-β¦ TRUE 23.9 Adult N
2 001-002 45 78.5 170 M Drug A 2024-01-β¦ TRUE 27.2 Adult N
3 001-003 67 85.1 160 F Drug A 2024-01-⦠FALSE 33.2 Elder⦠Y
4 001-004 52 72.8 175 M Placebo 2024-01-β¦ TRUE 23.8 Adult N
5 001-005 71 90.3 168 F Drug A 2024-01-⦠TRUE 32 Elder⦠Y
# βΉ 4 more variables: RFSTDT <date>, BMICAT <chr>, SITE <chr>, SUBJ_LABEL <chr>
# Example calculations for AI assistance practiceyoung_flag <-ifelse(dm$AGE <30, "Y", "N")treatment_weeks <-12total_days <- treatment_weeks *7cat("Example calculations:\n")
You have successfully completed all R4DS Chapter integrations:
β Chapter 1: Understanding the data science process in clinical context β Chapter 4: Working with vectors as Rβs fundamental data structures β Chapter 6: RStudio workflow best practices for clinical programming β Chapter 8: Script organization and project management for reproducibility
Key Skills Developed:
Vector creation and manipulation for clinical data
Tibble creation and data structure understanding
RStudio interface navigation and productivity features
Code organization and documentation best practices
Project structure for regulatory compliance
Data summarization and reporting techniques
Clinical Programming Applications:
CDISC-compliant variable naming and structure
Regulatory-ready code documentation
Reproducible analysis workflows
Efficient data manipulation patterns
π Ready for Module 2: Data Manipulation with dplyrβs five key verbs!