Starting out in R, we tend to use it interactively, running one command at a time. In this section we will look at writing R code that performs repetitive task using “for-loops”, packages up commonly needed sequences of operations in “functions”, and makes decisions using “if-statements”.
We are using a couple of packages from CRAN in this tutorial, which we can install with install.packages
. Don’t run this if you are using our biotraining server, the packages are already installed!
# Install the entire Tidyverse collection of packages with:
#
# install.packages("tidyverse")
#
# or just install packages needed for this section with:
#
# install.packages(c(
# "readr", # read tabular data
# "dplyr" # general data frame manipulation
# ))
Starting out in R mostly involves using functions that other people had written. We’re now going to see how to create our own functions. The general pattern of a function is:
FUNCTION_NAME <- function(ARGUMENT_NAME1, ARGUMENT_NAME2, ...) {
FUNCTION_BODY
}
Here is an example:
fahr_to_kelvin <- function(temp) {
(temp-32) * (5/9) + 273.15
}
We define fahr_to_kelvin
by assigning it a function
. The list of argument names are containted within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}
). The statements in the body have been indented by four spaces, which makes the code easier to read but does not affect how the code operates.
When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. The expression inside the function is evaluated and the result is “returned” back to whoever asked for it.
Let’s try running our function. Calling our own function is no different from calling any other function:
# freezing point of water
fahr_to_kelvin(32)
## [1] 273.15
# boiling point of water
fahr_to_kelvin(212)
## [1] 373.15
Here are some other equivalent ways to write this function.
The body of the function within the {}
can contain multiple lines of code, “statements”. Only the last line is returned.
fahr_to_kelvin <- function(temp) {
kelvin <- (temp-32) * (5/9) + 273.15
kelvin
}
The braces are not actually needed if the function only contains one statement.
fahr_to_kelvin <- function(temp) (temp-32) * (5/9) + 273.15
return
can be used to return from a function immediately. Further code will not be run.
fahr_to_kelvin <- function(temp) {
kelvin <- (temp-32) * (5/9) + 273.15
return(kelvin)
plot(1:10)
}
Statements in the function may be split over several lines, so long as it is clear that the statement is incomplete due to an unclosed bracket or an operator in need of a right hand argument.
fahr_to_kelvin <- function(temp) {
kelvin <-
(temp-32) * (5/9) +
273.15
return(kelvin)
}
Spacing and layout is largely for our own sanity.
fahr_to_kelvin<-
function(
temp){(temp
-32 )*( 5
/9 )+
273.15}
Beware! however of accidentally finishing a statement early.
fahr_to_kelvin_broken <- function(temp) {
(temp-32) * (5/9)
+ 273.15
}
fahr_to_kelvin_broken(212)
## [1] 273.15
The first line in the body of the function is computed then discarded. The second line, + 273.15
, is valid R so we don’t even get an error.
Finally, the computer doesn’t understand or care what names we give to functions, arguments, or variables.
charmander <- function(bulbasaur) {
mew <- (bulbasaur-32) * (5/9) + 273.15
mew
}
Now that we’ve seen how to turn Fahrenheit into Kelvin, it’s easy to turn Kelvin into Celsius:
kelvin_to_celsius <- function(temp) {
temp - 273.15
}
#absolute zero in Celsius
kelvin_to_celsius(0)
## [1] -273.15
What about converting Fahrenheit to Celsius? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:
fahr_to_celsius <- function(temp) {
temp_k <- fahr_to_kelvin(temp)
temp_c <- kelvin_to_celsius(temp_k)
temp_c
}
# freezing point of water in Celsius
fahr_to_celsius(32.0)
## [1] 0
This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here–typically half a dozen to a few dozen lines–but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.
You might have noticed that all of these functions have an argument called temp
. Why hasn’t this caused confusion and chaos? The answer is that arguments to a function and variables defined in the body of a function are local to a call to that function. Within the body of a function, R code can see variables from its own local environment, and variables from the global environment, but not variables local to other function calls. While running code R maintains a stack of local environments. When a call to a function returns, all of its local arguments and variables disappear.
Tip: There are several ways to see this in action.
Scatter cat
s through your functions showing the values of arguments and variables at various points.
Insert a call to browser()
in one of your functions. This will pause execution and let you interact with the local environment in a function. Once in the browser, type “help” for a list of commands. Besides examining variables, you can step through the function from the point it paused with “n” or, at a finer grain, step into functions it calls with “s”. When you are done type “f”.
You can debug an existing function using debugonce
. This is like temporarily adding a call to browser()
at the top of the function.
# debugging a function
debugonce(fahr_to_celsius)
fahr_to_celsius(212)
Write a function to calculate the length of the hypotenuse of a right angled triangle using Pythagorus’s rule, given the lengths of the other sides.
Hint: sqrt
calculates the square root of a number.
Testing your code is important. Invent a test case for your code consisting of:
Confirm that your function works as expected.
There are three common approaches to repetitive tasks in R:
Method 1 is error prone, and also hard to fix if you find a mistake in your code.
We will now look at method 2, using for-loops. The general pattern of a loop is:
for(VARIABLE_NAME in VECTOR) {
FOR_LOOP_BODY
}
Let’s look at an example. Suppose we have a calculation we need to perform on a series of numbers.
i <- 10
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 20
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 30
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 40
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
i <- 50
cat("i is",i,"\n")
cat("i squared is",i*i,"\n")
## i is 10
## i squared is 100
## i is 20
## i squared is 400
## i is 30
## i squared is 900
## i is 40
## i squared is 1600
## i is 50
## i squared is 2500
Apart from the i <- ...
lines, the code is the same each time. This can be re-written as a for-loop:
for(i in c(10,20,30,40,50)) { # 1
cat("i is",i,"\n") # 2 4 6 8 10
cat("i squared is",i*i,"\n") # 3 5 7 9 11
} #
#
cat("done\n") # 12
# --order-of-execution-->
#
# i= - a a b b c c d d e e e ...
## i is 10
## i squared is 100
## i is 20
## i squared is 400
## i is 30
## i squared is 900
## i is 40
## i squared is 1600
## i is 50
## i squared is 2500
## done
We can name the loop variable anything we like. in
is part of the for
syntax. The body of the loop is enclosed in curly braces { }
. For a single-line loop body the braces aren’t needed, but it is good practice to always include them.
The loop variable i
is assigned each element in the vector in turn, and then the loop body runs. There’s no possibility here or making a mistake when copying and pasting thhe code, and if we later discover we need to change this code, the for-loop version will be much easier to change. It will also be possible to run this code a different number of times by changing the vector used.
cat
s or print
s to the loop body to check what is going on.myvec <- c(10,20,30,40)
total <- 0
for(item in myvec) {
total <- total + item
}
total
myvec
?myvec <- c(10,20,30,40)
for(index in 1:4) {
myvec[index] <- myvec[index] * 2
}
1*2*3*4* ... *10
.numbers <- 1:10
... your code here ...
Let’s say we want to examine the quality of some FASTQ files, which contain reads from a DNA sequencing machine. An experiment has been performed over several days, and we want to run a program called “fastqc” on each of the FASTQ files.
FASTQ is a text format containg a series of DNA sequences and associated quality information. Examine “Day0.fastq” in the “r-more-files” folder.
(This data is a very small portion of the reads from an experiment inducing pluripotency in a mouse cell line, GSE70022.)
Software outside of R is normally run from a “terminal” window, which is rather like the console in RStudio. You type in a command and the operating system runs it for you. The commands you type in are interpreted by something called a “shell”, because it acts like a shell around the operating system. For more information, see the Software Carpentry course on the Unix shell.
In RStudio, you can open a terminal using “Tools/Terminal/New Terminal”. Try this and type in:
uptime
The “uptime” command tells you how long the computer has been running for. R can give a command to the shell using the system
function.
system("uptime")
Here is how we want to run FastQC for day 0. To R the command has no meaning beyond being a character string. It will be interpreted by the shell, which is external to R. The shell will then run FastQC for us.
system("fastqc --extract --outdir . r-more-files/Day0.fastq")
In the “Files” tab of the bottom right pane in RStudio, you should see that some new files and a directory have been created. Click on “Day0_fastqc.html” to examine it. Also examine the file in the “Day0_fastqc” folder called “summary.txt”.
If you don’t have FastQC, or it failed to run for some reason: The expected output files can be found in the folder “r-more-files/fastqc-output”.
Typing out the fastqc command for each day will get repetitive. Let’s use a for-loop to automate this task. The paste0
function lets us “paste” together character strings, so we can use that to construct the commands to run, like this:
# construct a command to run
day <- 0
command <- paste0("fastqc --extract --outdir . r-more-files/Day", day, ".fastq")
command
## [1] "fastqc --extract --outdir . r-more-files/Day0.fastq"
We want to do this for each day, and then use system
to run the resulting command:
# run fastqc on each of the files
days <- c(0,4,7,10,15,20)
for(day in days) {
command <- paste0("fastqc --extract --outdir . r-more-files/Day", day, ".fastq")
cat("Running the command:", command, "\n")
system(command)
}
## Running the command: fastqc --extract --outdir . r-more-files/Day0.fastq
## Running the command: fastqc --extract --outdir . r-more-files/Day4.fastq
## Running the command: fastqc --extract --outdir . r-more-files/Day7.fastq
## Running the command: fastqc --extract --outdir . r-more-files/Day10.fastq
## Running the command: fastqc --extract --outdir . r-more-files/Day15.fastq
## Running the command: fastqc --extract --outdir . r-more-files/Day20.fastq
Note: If you weren’t able to run FastQC earlier, the expected output files can be found in the folder “r-more-files/fastqc-output”. You will need to adjust the filenames appropriately in the R code below.
The summary.txt files are in “tab separated value” format. This is similar to comma separated value format we’ve seen in the introductory R day, but instead of using a comma it uses a special character called a tab to delimit columns. Tabs show up as variable amounts of space in a text editor. In R, they can be written "\t"
.
In base R we could use the function read.delim
to read this file. It’s quite similar to read.csv
. However, let’s use the Tidyverse package readr
to load the file. To use readr
we first need to load it with library
.
library(readr)
We can now read the file like this:
read_tsv("Day0_fastqc/summary.txt")
## Parsed with column specification:
## cols(
## PASS = col_character(),
## `Basic Statistics` = col_character(),
## Day0.fastq = col_character()
## )
## # A tibble: 10 x 3
## PASS `Basic Statistics` Day0.fastq
## <chr> <chr> <chr>
## 1 PASS Per base sequence quality Day0.fastq
## 2 PASS Per tile sequence quality Day0.fastq
## 3 PASS Per sequence quality scores Day0.fastq
## 4 WARN Per base sequence content Day0.fastq
## 5 WARN Per sequence GC content Day0.fastq
## 6 PASS Per base N content Day0.fastq
## 7 PASS Sequence Length Distribution Day0.fastq
## 8 PASS Sequence Duplication Levels Day0.fastq
## 9 WARN Overrepresented sequences Day0.fastq
## 10 PASS Adapter Content Day0.fastq
Oh! There aren’t column headings in this file. We need to tell read_tsv
this.
read_tsv("Day0_fastqc/summary.txt", col_names=FALSE)
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character()
## )
## # A tibble: 11 x 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 WARN Per base sequence content Day0.fastq
## 6 WARN Per sequence GC content Day0.fastq
## 7 PASS Per base N content Day0.fastq
## 8 PASS Sequence Length Distribution Day0.fastq
## 9 PASS Sequence Duplication Levels Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## 11 PASS Adapter Content Day0.fastq
We should tidy this up a bit before using it. Columns should have meaningful names, and the PASS/WARN/FAIL grading looks like it should be a factor:
filename <- "Day0_fastqc/summary.txt"
sumtab <- read_tsv(filename, col_names=FALSE)
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character()
## )
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
## # A tibble: 11 x 3
## grade test file
## <fct> <chr> <chr>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 WARN Per base sequence content Day0.fastq
## 6 WARN Per sequence GC content Day0.fastq
## 7 PASS Per base N content Day0.fastq
## 8 PASS Sequence Length Distribution Day0.fastq
## 9 PASS Sequence Duplication Levels Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## 11 PASS Adapter Content Day0.fastq
We expect to have to examine many FastQC reports in future. It will be convenient to have this as a function!
filename
is going to be different each time we use this code, so it wants to be an argument to the function. The rest of the code goes in the body of the function. The returned value will be sumtab
.
load_fastqc <- function(filename) {
sumtab <- read_tsv(filename, col_names=FALSE)
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
}
load_fastqc("Day0_fastqc/summary.txt")
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## X3 = col_character()
## )
## # A tibble: 11 x 3
## grade test file
## <fct> <chr> <chr>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 WARN Per base sequence content Day0.fastq
## 6 WARN Per sequence GC content Day0.fastq
## 7 PASS Per base N content Day0.fastq
## 8 PASS Sequence Length Distribution Day0.fastq
## 9 PASS Sequence Duplication Levels Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## 11 PASS Adapter Content Day0.fastq
We now want to load each of the summary.txt
files we created earlier. One perfectly valid way would be to use a for-loop. However here we’re going to use an “apply”-style function. “apply”-style functions are functions that call some other function repeatedly on subsets of a data set. R has a variety of functions with names such as apply
and tapply
, and several other functions where the idea is similar such as aggregate
. apply
itself is for use with matrices and produces a vector as output. Other functions in this family have different input and output types.
The flavour of apply we need here is lapply
, which takes a vector and calls a function on each element in turn of that vector (i.e. the “subsets” here are simply individual elements). The result is a list containing the returned values, hence the name lapply
.
First let’s work out the names of the files we want to load.
days <- c(0,4,7,10,15,20)
filenames <- paste0("Day", days, "_fastqc/summary.txt")
filenames
## [1] "Day0_fastqc/summary.txt" "Day4_fastqc/summary.txt"
## [3] "Day7_fastqc/summary.txt" "Day10_fastqc/summary.txt"
## [5] "Day15_fastqc/summary.txt" "Day20_fastqc/summary.txt"
Now we can use lapply
to apply our function to each of the filenames.
sumtabs <- lapply(filenames, load_fastqc)
We now have a list of data frames. Lists are a little different from the vectors we usually work with. See section 20.5 in the “R for data science” book for a comprehensive explanation. Here are some ways to examine this object:
sumtabs
class(sumtabs)
length(sumtabs)
str(sumtabs)
When we print sumtabs
on the console, it gives us a hint how to access individual elements: using the [[ ]]
syntax:
sumtabs[[1]]
sumtabs[[2]]
This is an ungainly data structure, we would like to turn it into a single a single big data frame. The dplyr
package provides a function to do this:
library(dplyr)
bigtab <- bind_rows(sumtabs)
bigtab
## # A tibble: 66 x 3
## grade test file
## <fct> <chr> <chr>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 WARN Per base sequence content Day0.fastq
## 6 WARN Per sequence GC content Day0.fastq
## 7 PASS Per base N content Day0.fastq
## 8 PASS Sequence Length Distribution Day0.fastq
## 9 PASS Sequence Duplication Levels Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## # ... with 56 more rows
table(bigtab$test, bigtab$grade)
##
## FAIL WARN PASS
## Adapter Content 0 0 6
## Basic Statistics 0 0 6
## Overrepresented sequences 0 4 2
## Per base N content 0 0 6
## Per base sequence content 5 1 0
## Per base sequence quality 0 0 6
## Per sequence GC content 2 4 0
## Per sequence quality scores 0 0 6
## Per tile sequence quality 0 0 6
## Sequence Duplication Levels 0 0 6
## Sequence Length Distribution 0 0 6
One final programming concept you need to know about is the “if-statement”. An if-statement lets us do something only if some logical value is TRUE
, or do one thing if it is TRUE
and another if it is FALSE
.
In the introductory R day, we used logical vectors to query a data frame. Everything we learned about making and manipulating logical vectors is applicable here, but with the restriction that we need a single logical value, i.e. a logical vector of length 1.
The general pattern of an if statement is:
if (LOGICAL_VALUE) {
THING_TO_DO_IF_TRUE
} else {
THING_TO_DO_IF_FALSE
}
Here is an example:
num <- 37 # 1
if (num > 100) { # 2
cat("greater\n") #
} else { #
cat("not greater\n") # 3
} #
cat("done\n") # 4
# --time-->
## not greater
## done
The second line of this code uses an if-statement to tell R that we want to make a choice. If the following test is TRUE
, the body of the if
(i.e., the lines in the curly braces underneath it) are executed. If the test is FALSE
, the body of the else
is executed instead. Only one or the other is ever executed.
In the example above, the test num > 100
returns the value FALSE
, which is why the code inside the if
block was skipped and the code inside the else
block was run instead.
If-statements don’t have to include an else
. If there isn’t one, R simply does nothing if the test is FALSE:
num <- 53 # 1
if (num > 100) { # 2
cat("num is greater than 100\n") #
} #
cat("done\n") # 3
# --time-->
## done
We can also chain several tests together when there are more than two options. As an example, here is a function that returns the sign of a number:
sign <- function(num) {
if (num > 0) { # line 1
return(1) # line 2
} else if (num == 0) { # line 3
return(0) # line 4
} else {
return(-1) # line 5
}
}
sign(-3)
## [1] -1
sign(0)
## [1] 0
sign(2/3)
## [1] 1
Which lines of the function sign
executed when it was called above, and in what order?
Our load_fastqc
function will currently fail with an error if the file it is passed does not exist.
load_fastqc("nosuchfile.txt")
## Error: 'nosuchfile.txt' does not exist in current working directory ('/Users/pfh/git/r-more/topics').
Depending on how we intend to use it, we might instead want the function to issue a warning and return NULL.
load_fastqc <- function(filename) {
# Check arguments are sane
if (!file.exists(filename)) {
warning("No such file: ", filename)
return(NULL)
}
# Load and tidy data
sumtab <- read_tsv(filename, col_names=FALSE)
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
}
load_fastqc("nosuchfile.txt")
## Warning in load_fastqc("nosuchfile.txt"): No such file: nosuchfile.txt
## NULL
Checking input arguments before proceeding is a common pattern when writing R functions.
Having developed some useful functions, we might want to re-use them in a future project or share them with a friend. We can store R code in a file, usually with extension “.R”, and load it with the source
function.
Put your load_fastqc
function in a file called “fastqc.R”. It uses the readr
library, so be sure to load the library in the file as well.
# fastqc.R file should contain:
library(readr)
load_fastqc <- function(filename) {
sumtab <- read_tsv(filename, col_names=FALSE)
colnames(sumtab) <- c("grade", "test", "file")
sumtab$grade <- factor(sumtab$grade, c("FAIL","WARN","PASS"))
sumtab
}
# From the console:
source("fastqc.R")
Everything in the file runs as though it were typed on the console.
Similarly, if you are going to be doing several different things with a data set, you could write a .R file to load the data into a tidy form, and several other .R scripts to do various different things with it.
What other R code from this lesson could we put in a .R file? Or a .Rmd file?
How should we break up a large project (paper/thesis/software package) into files?
What about managing data files?
How should we share a project with others?
Packages are the next step up from sourcing .R files. They let you write code that other people can install and then load with library
. Packages generally contain documentation for all functions, and one or more vignettes describing how to use the package.
Hadley Wickham has a great package called devtools
that takes a lot of the pain out of package writing.
# Create an empty package template
library(devtools)
create("mypack")
# ... Edit mylibrary/DESCRIPTION file
# ... Write .R files in mylibrary/R folder
# Load package. Use this during development.
load_all("mypack")
# Build package, including converting inline documentation to .Rd files using roxygen2.
# Check for common problems and missing documentation.
# A CRAN package must pass all checks.
check("mypack")
Packages are most easily distributed by placing them on GitHub or GitLab. They can then be installed by others using the devtools
functions install_github
or install_gitlab
. Once a package is mature and well documented, it can be submitted to CRAN or Bioconductor.
# To install from GitHub:
devtools::install_github("myusername/mypack")
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.7.6 readr_1.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 bindr_0.1.1 knitr_1.20 magrittr_1.5
## [5] hms_0.4.2 tidyselect_0.2.4 R6_2.2.2 rlang_0.2.2
## [9] fansi_0.3.0 stringr_1.3.1 tools_3.5.0 utf8_1.1.4
## [13] cli_1.0.1 htmltools_0.3.6 yaml_2.2.0 rprojroot_1.3-2
## [17] digest_0.6.17 assertthat_0.2.0 tibble_1.4.2 crayon_1.3.4
## [21] bindrcpp_0.2.2 purrr_0.2.5 codetools_0.2-15 glue_1.3.0
## [25] evaluate_0.11 rmarkdown_1.10 stringi_1.2.4 compiler_3.5.0
## [29] pillar_1.3.0 backports_1.1.2 pkgconfig_2.0.2