Programming with R
Instructor’s Guide
- R is an open source implementation of S, a programming environment for data analysis and graphics.
- It is a data analysis software,
- A programming language,
- An environment for statistical analysis,
- It’s open-source,
- It’s a community!
- Why learn R?
- More analytical methods: >6,000 packages extending R’s capabilities; >1,000 packages available from the BioConductor Project. Many new developments in statistics appear first as R package,
- More flexible in the type of data it can analyze,
- Powerful; full matrix capabilities similar to MATLAB.
- Open
- R’s procedures (functions) are open: you can have a look under the hood and modify them,
- Cross-platform (Windows, Mac, Linux, Unix).
- Integration and Communication
- Graphics capabilities, publication-level quality,
- Connect easily with other programming languages,
- Integration with document publishing, through LaTeX or Markdown.
- And it’s free!
- To Learn More
Overall
This lesson is written as an introduction to R, but its real purpose is to introduce the single most important idea in programming: how to solve problems by building functions, each of which can fit in a programmer’s working memory. In order to teach that, we must teach people a little about the mechanics of manipulating data with lists and file I/O so that their functions can do things they actually care about. Our teaching order tries to show practical uses of every idea as soon as it is introduced; instructors should resist the temptation to explain the “other 90%” of the language as well.
The secondary goal of this lesson is to give them a usable mental model of how programs run (what computer science educators call a notional machine so that they can debug things when they go wrong. In particular, they must understand how function call stacks work.
The final example asks them to build a command-line tool that works with the Unix pipe-and-filter model. We do this because it is a useful skill and because it helps learners see that the software they use isn’t magical. Tools like grep
might be more sophisticated than the programs our learners can write at this point in their careers, but it’s crucial they realize this is a difference of scale rather than kind.
The R
novice inflammation contains a lot of material to cover. Remember this lesson does not spend a lot of time on data types, data structure, etc. It is also on par with the similar lesson on Python. The objective is to explain modular programming with the concepts of functions, loops, flow control, and defensive programming (i.e. SWC best practices). Supplementary material is available for R specifics (Addressing Data, Data Types and Structure, Understanding Factors, Introduction to RStudio, Reading and Writing .csv, Loops in R, Best Practices for Using R and Designing Programs, Dynamic Reports with knitr, Making Packages in R).
A typical, half-day, lesson would use the first three lessons:
An additional half-day could add the next two lessons:
Time-permitting, you can fit in one of these shorter lessons that cover bigger picture ideas like best practices for organizing code, reproducible research, and creating packages:
Analyzing Patient Data
Check learners are reading files from the correct location (set working directory); remind them of the shell lesson
Provide shortcut for the assignment operator (
<-
) (RStudio: Alt+- on Windows/Linux; Option+- on Mac)
dat <- read.csv("data/inflammation-01.csv", header = FALSE)
animal <- c("m", "o", "n", "k", "e", "y")
# Challenge - Slicing (subsetting data)
animal[4:1] # first 4 characters in reverse order
[1] "k" "n" "o" "m"
animal[-1] # remove first character
[1] "o" "n" "k" "e" "y"
animal[-4] # remove fourth character
[1] "m" "o" "n" "e" "y"
animal[-1:-4] # remove first to fourth characters
[1] "e" "y"
animal[c(5, 2, 3)] # new character vector
[1] "e" "o" "n"
# Challenge - Subsetting data
max(dat[5, 3:7])
[1] 3
sd_day_inflammation <- apply(dat, 2, sd)
plot(sd_day_inflammation)
Addressing Data
- Note that the data frame
dat
is not the same set of data as in other lessons
Data Types and Structure
- Lesson on data types and structures
Understanding Factors
Introduction to RStudio
Reading and Writing .csv
Creating Functions
# Challenge - Create a function
fence <- function(original, wrapper) {
answer <- c(wrapper, original, wrapper)
return(answer)
}
# Challenge - A more advanced function
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
# Challenge - rescale
rescale <- function(v) {
# Rescales a vector, v, to lie in the range 0 to 1.
L <- min(v)
H <- max(v)
result <- (v - L) / (H - L)
return(result)
}
# Challenge - A function with default argument values
rescale <- function(v, lower = 0, upper = 1) {
# Rescales a vector, v, to lie in the range lower to upper.
L <- min(v)
H <- max(v)
result <- (v - L) / (H - L) * (upper - lower) + lower
return(result)
}
answer <- rescale(dat[, 4], lower = 2, upper = 5)
min(answer)
[1] 2
max(answer)
[1] 5
answer <- rescale(dat[, 4], lower = -5, upper = -2)
min(answer)
[1] -5
max(answer)
[1] -2
Analyzing Multiple Data Sets
- The transition from the previous lesson to this one might be challenging for a very novice audience. Do not rush through the challenges, maybe drop some.
# Challenge - Using loops
print_N <- function(N) {
nseq <- seq(N)
for (num in nseq) {
print(num)
}
}
print_N(3)
[1] 1
[1] 2
[1] 3
total <- function(vec) {
#calculates the sum of the values in a vector
vec_sum <- 0
for (num in vec) {
vec_sum <- vec_sum + num
}
return(vec_sum)
}
ex_vec <- c(4, 8, 15, 16, 23, 42)
total(ex_vec)
[1] 108
expo <- function(base, power) {
result <- 1
for (i in seq(power)) {
result <- result * base
}
return(result)
}
expo(2, 4)
[1] 16
# Challenge - Using loops to analyze multiple files
analyze_all <- function(pattern) {
# Runs the function analyze for each file in the current working directory
# that contains the given pattern.
filenames <- list.files(path = "data", pattern = pattern, full.names = TRUE)
for (f in filenames) {
analyze(f)
}
}
Loops in R
Making Choices
Making Choices
# Challenge - Using conditions to change behaviour
plot_dist <- function(x, threshold) {
if (length(x) > threshold) {
boxplot(x)
} else {
stripchart(x)
}
}
plot_dist <- function(x, threshold, use_boxplot = TRUE) {
if (length(x) > threshold & use_boxplot) {
boxplot(x)
} else if (length(x) > threshold & !use_boxplot) {
hist(x)
} else {
stripchart(x)
}
}
# Challenge - Changing behaviour of the plot command
analyze <- function(filename, output = NULL) {
# Plots the average, min, and max inflammation over time.
# Input:
# filename: character string of a csv file
# output: character string of pdf file for saving
if (!is.null(output)) {
pdf(output)
}
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation, type = "l")
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation, type = "l")
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation, type = "l")
if (!is.null(output)) {
dev.off()
}
}
Best Practices for Using R and Designing Programs
Command-Line Programs
# Challenge - A simple command line program
cat arith.R
main <- function() {
# Performs addition or subtraction from the command line.
#
# Takes three arguments:
# The first and third are the numbers.
# The second is either + for addition or - for subtraction.
#
# Ex. usage:
# Rscript arith.R 1 + 2
# Rscript arith.R 3 - 4
#
args <- commandArgs(trailingOnly = TRUE)
num1 <- as.numeric(args[1])
operation <- args[2]
num2 <- as.numeric(args[3])
if (operation == "+") {
answer <- num1 + num2
cat(answer)
} else if (operation == "-") {
answer <- num1 - num2
cat(answer)
} else {
stop("Invalid input. Use + for addition or - for subtraction.")
}
}
main()
cat find-pattern.R
main <- function() {
# Finds all files in the current directory that contain a given pattern.
#
# Takes one argument: the pattern to be searched.
#
# Ex. usage:
# Rscript find-pattern.R csv
#
args <- commandArgs(trailingOnly = TRUE)
pattern <- args[1]
files <- list.files(pattern = pattern)
cat(files, sep = "\n")
}
main()
## Challenge - A command line program with arguments
cat check.R
main <- function() {
# Checks that all csv files have the same number of rows and columns.
#
# Takes multiple arguments: the names of the files to be checked.
#
# Ex. usage:
# Rscript check.R inflammation-*
#
args <- commandArgs(trailingOnly = TRUE)
first_file <- read.csv(args[1], header = FALSE)
first_dim <- dim(first_file)
# num_rows <- dim(args[1])[1] # nrow(args[1])
# num_cols <- dim(args[1])[2] # ncol(args[1])
for (filename in args[-1]) {
new_file <- read.csv(filename, header = FALSE)
new_dim <- dim(new_file)
if (new_dim[1] != first_dim[1] | new_dim[2] != first_dim[2]) {
cat("Not all the data files have the same dimensions.")
}
}
}
main()
# Challenge - Shorter command line arguments
cat readings-usage.R
main <- function() {
args <- commandArgs(trailingOnly = TRUE)
action <- args[1]
filenames <- args[-1]
if (!(action %in% c("--min", "--mean", "--max"))) {
usage()
} else if (length(filenames) == 0) {
process(file("stdin"), action)
} else {
for (f in filenames) {
process(f, action)
}
}
}
process <- function(filename, action) {
dat <- read.csv(file = filename, header = FALSE)
if (action == "--min") {
values <- apply(dat, 1, min)
} else if (action == "--mean") {
values <- apply(dat, 1, mean)
} else if (action == "--max") {
values <- apply(dat, 1, max)
}
cat(values, sep = "\n")
}
usage <- function() {
cat("usage: Rscript readings-usage.R [--min, --mean, --max] filenames", sep = "\n")
}
main()
# Challenge - Implementing wc in R
cat line-count.R
main <- function() {
args <- commandArgs(trailingOnly = TRUE)
if (length(args) > 0) {
for (filename in args) {
input <- readLines(filename)
num_lines <- length(input)
cat(filename)
cat(" ")
cat(num_lines, sep = "\n")
}
} else {
input <- readLines(file("stdin"))
num_lines <- length(input)
cat(num_lines, sep = "\n")
}
}
main()