This lesson covers packages primarily by Hadley Wickham for tidying data and then working with it in tidy form, collectively known as the “tidyverse”.

The packages we are using in this lesson are all from CRAN, so we can install them with install.packages. Don’t run this if you are using our biotraining server, the packages are already installed!

# install.packages(c(
#     "tidyverse",
#     "viridis",
#     "broom"
# ))
library(tidyverse) # Load all "tidyverse" libraries.
# OR
# library(readr)   # Read tabular data.
# library(tidyr)   # Data frame tidying functions.
# library(dplyr)   # General data frame manipulation.
# library(ggplot2) # Flexible plotting.

library(viridis)   # Viridis color scale.

These packages usually have useful documentation in the form of “vignettes”. These are readable on the CRAN website, or within R:

vignette()
vignette(package="dplyr")
vignette("dplyr", package="dplyr")

Let’s continue our examination of the FastQC output. If you’re starting fresh for this lesson, you can load the necessary data frame with:

bigtab <- read_csv("r-more-files/fastqc.csv")
## Parsed with column specification:
## cols(
##   grade = col_character(),
##   test = col_character(),
##   file = col_character()
## )

ggplot2 revisited

We saw ggplot2 in the introductory R day. Recall that we could assign columns of a data frame to aesthetics–x and y position, color, etc–and then add “geom”s to draw the data.

With ggplot2 we can easily view the whole data set.

ggplot(bigtab, aes(x=file,y=test,color=grade)) + 
    geom_point()

With categorical data on the x and y axes, a better geom to use is geom_tile.

ggplot(bigtab, aes(x=file,y=test,fill=grade)) + 
    geom_tile()

Publication quality images

ggplot2 offers a very wide variety of ways to adjust a plot. For categorical aesthetics, usually the first step is ensuring the relevant column is a factor with a meaningful level order.

y_order <- sort(unique(bigtab$test), decreasing=T)  # y axis plots from bottom to top, so reverse
bigtab$test <- factor(bigtab$test, levels=y_order)

x_order <- unique(bigtab$file)
bigtab$file <- factor(bigtab$file, levels=x_order)

# Only necessary if not continuing from previous lesson on programming!
color_order <- c("FAIL", "WARN", "PASS")
bigtab$grade <- factor(bigtab$grade, levels=color_order)

myplot <- ggplot(bigtab, aes(x=file, y=test, fill=grade)) + 
    geom_tile(color="black", size=0.5) +           # Black border on tiles
    scale_fill_manual(                             # Colors, as color hex codes
        values=c("#ee0000","#ffee00","#00aa00")) +
    labs(x="", y="", fill="") +                    # Remove axis labels
    coord_fixed() +                                # Square tiles
    theme_minimal() +                              # Minimal theme, no grey background
    theme(panel.grid=element_blank(),              # No underlying grid lines
          axis.text.x=element_text(                # Vertical text on x axis
              angle=90,vjust=0.5,hjust=0))              
myplot

Plots can be saved with ggsave. Width and height are given in inches, and an image resolution in Dots Per Inch should also be given. The width and height will determine the relative size of text and graphics, so increasing the resolution is best done by adjusting the DPI. Compare the files produced by:

ggsave("plot1.png", myplot, width=5,  height=5,  dpi=600)
ggsave("plot2.png", myplot, width=10, height=10, dpi=300)

It may be necessary to edit a plot further in a program such as Inkscape or Adobe Illustrator. To allow this, the image needs to be saved in a “vector” format such as SVG, EPS, or PDF. In this case, the DPI argument isn’t needed.

dplyr

dplyr gives us a collection of convenient “verbs” for manipulating data frames. The name is a contraction of “Data frame apPLYeR”, as some of these verbs have a similar flavour to apply/lapply/tapply/etc.

Each verb takes a data frame as input and returns a modified version of it. The idea is that complex operations can be performed by stringing together a series of simpler operations in a pipeline.

# input         +--------+        +--------+        +--------+      result
#  data   %>%   |  verb  |  %>%   |  verb  |  %>%   |  verb  |  ->   data
#   frame       +--------+        +--------+        +--------+        frame

tibbles

bigtab
## # A tibble: 72 x 3
##    grade test                         file      
##    <fct> <fct>                        <fct>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 FAIL  Per base sequence content    Day0.fastq
##  6 WARN  Per sequence GC content      Day0.fastq
##  7 PASS  Per base N content           Day0.fastq
##  8 PASS  Sequence Length Distribution Day0.fastq
##  9 PASS  Sequence Duplication Levels  Day0.fastq
## 10 WARN  Overrepresented sequences    Day0.fastq
## # ... with 62 more rows

As we used readr to read the data, bigtab is a “tibble”, which is the Tidyverse’s improved data frame. The dplyr verbs also all return their results as tibbles. You can also create tibbles explicitly with the tibble function (or data_frame). One convenient feature is that it only shows a few rows of the data frame when you print it. If you do want to see the whole table, you can use as.data.frame, or use the View function to view it in a tab.

as.data.frame(bigtab)
View(bigtab)

The n and width arguments to print can also be used to print more rows or columns respectively.

print(bigtab, n=100, width=1000)

filter

Say we want to know all the tests that failed. filter provides a convenient way to get these.

filter(bigtab, grade == "FAIL")
## # A tibble: 8 x 3
##   grade test                      file       
##   <fct> <fct>                     <fct>      
## 1 FAIL  Per base sequence content Day0.fastq 
## 2 FAIL  Per base sequence content Day4.fastq 
## 3 FAIL  Per base sequence content Day7.fastq 
## 4 FAIL  Per sequence GC content   Day7.fastq 
## 5 FAIL  Per base sequence content Day10.fastq
## 6 FAIL  Per base sequence content Day15.fastq
## 7 FAIL  Per base sequence content Day20.fastq
## 8 FAIL  Per sequence GC content   Day20.fastq

Something is magic here: we do not have grade in our environment, only within the data frame. Arguments to filter can use any column of the data frame as a variable. dplyr uses this idea heavily, arguments to dplyr verbs often behave in a special way.

arrange

Rather than filtering, we might instead want to sort the data so the most important rows are at the top. arrange sorts a data frame by one or more columns.

arrange(bigtab, grade)
## # A tibble: 72 x 3
##    grade test                      file       
##    <fct> <fct>                     <fct>      
##  1 FAIL  Per base sequence content Day0.fastq 
##  2 FAIL  Per base sequence content Day4.fastq 
##  3 FAIL  Per base sequence content Day7.fastq 
##  4 FAIL  Per sequence GC content   Day7.fastq 
##  5 FAIL  Per base sequence content Day10.fastq
##  6 FAIL  Per base sequence content Day15.fastq
##  7 FAIL  Per base sequence content Day20.fastq
##  8 FAIL  Per sequence GC content   Day20.fastq
##  9 WARN  Per sequence GC content   Day0.fastq 
## 10 WARN  Overrepresented sequences Day0.fastq 
## # ... with 62 more rows
# desc( ) can be used to reverse the sort order
arrange(bigtab, desc(grade))
## # A tibble: 72 x 3
##    grade test                         file      
##    <fct> <fct>                        <fct>     
##  1 PASS  Basic Statistics             Day0.fastq
##  2 PASS  Per base sequence quality    Day0.fastq
##  3 PASS  Per tile sequence quality    Day0.fastq
##  4 PASS  Per sequence quality scores  Day0.fastq
##  5 PASS  Per base N content           Day0.fastq
##  6 PASS  Sequence Length Distribution Day0.fastq
##  7 PASS  Sequence Duplication Levels  Day0.fastq
##  8 PASS  Adapter Content              Day0.fastq
##  9 PASS  Kmer Content                 Day0.fastq
## 10 PASS  Basic Statistics             Day4.fastq
## # ... with 62 more rows

select

dplyr’s select function is for subsetting, renaming, and reordering columns. Old columns can be referred to by name or by number.

select(bigtab, test,grade) 
## # A tibble: 72 x 2
##    test                         grade
##    <fct>                        <fct>
##  1 Basic Statistics             PASS 
##  2 Per base sequence quality    PASS 
##  3 Per tile sequence quality    PASS 
##  4 Per sequence quality scores  PASS 
##  5 Per base sequence content    FAIL 
##  6 Per sequence GC content      WARN 
##  7 Per base N content           PASS 
##  8 Sequence Length Distribution PASS 
##  9 Sequence Duplication Levels  PASS 
## 10 Overrepresented sequences    WARN 
## # ... with 62 more rows
select(bigtab, 2,1)
## # A tibble: 72 x 2
##    test                         grade
##    <fct>                        <fct>
##  1 Basic Statistics             PASS 
##  2 Per base sequence quality    PASS 
##  3 Per tile sequence quality    PASS 
##  4 Per sequence quality scores  PASS 
##  5 Per base sequence content    FAIL 
##  6 Per sequence GC content      WARN 
##  7 Per base N content           PASS 
##  8 Sequence Length Distribution PASS 
##  9 Sequence Duplication Levels  PASS 
## 10 Overrepresented sequences    WARN 
## # ... with 62 more rows
select(bigtab, foo=file, bar=test, baz=grade)
## # A tibble: 72 x 3
##    foo        bar                          baz  
##    <fct>      <fct>                        <fct>
##  1 Day0.fastq Basic Statistics             PASS 
##  2 Day0.fastq Per base sequence quality    PASS 
##  3 Day0.fastq Per tile sequence quality    PASS 
##  4 Day0.fastq Per sequence quality scores  PASS 
##  5 Day0.fastq Per base sequence content    FAIL 
##  6 Day0.fastq Per sequence GC content      WARN 
##  7 Day0.fastq Per base N content           PASS 
##  8 Day0.fastq Sequence Length Distribution PASS 
##  9 Day0.fastq Sequence Duplication Levels  PASS 
## 10 Day0.fastq Overrepresented sequences    WARN 
## # ... with 62 more rows

You can also remove specific columns like this:

select(bigtab, -file)
## # A tibble: 72 x 2
##    grade test                        
##    <fct> <fct>                       
##  1 PASS  Basic Statistics            
##  2 PASS  Per base sequence quality   
##  3 PASS  Per tile sequence quality   
##  4 PASS  Per sequence quality scores 
##  5 FAIL  Per base sequence content   
##  6 WARN  Per sequence GC content     
##  7 PASS  Per base N content          
##  8 PASS  Sequence Length Distribution
##  9 PASS  Sequence Duplication Levels 
## 10 WARN  Overrepresented sequences   
## # ... with 62 more rows

The magic here is that column variables refer to their column number. select subsets columns rather like subsetting vectors with [ ]. So -file becomes -1, and negative numbers remove items. We can also refer to ranges of columns, such as grade:file, which would become 1:3, which is c(1,2,3).

Joins

Say we want to convert PASS/WARN/FAIL into a numeric score, so we can produce some numerical summaries. The scoring scheme will be:

fwp <- c("FAIL","WARN","PASS")
scoring <- tibble(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))

# Or: 
# scoring <- data.frame(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))

scoring
## # A tibble: 3 x 2
##   grade score
##   <fct> <dbl>
## 1 FAIL    0  
## 2 WARN    0.5
## 3 PASS    1

In the introduction to R day, we saw the merge function. dplyr has its own version of this in the form of several functions: left_join, right_join, inner_join, full_join, anti_join.

The difference between these functions is what happens when there is a row in one data frame without a corresponding row in the other data frame. inner_join discards such rows. full_join always keeps them, filling in missing data with NA. left_join always keeps rows from the first data frame. right_join always keeps rows from the second data frame. anti_join is a bit different, it gives you rows from the first data frame that aren’t in the second data frame.

Often we use a join to augment a data frame with some extra information. left_join is a good default choice for this as it will never delete rows from the data frame that is being augmented.

scoretab <- left_join(bigtab, scoring, by="grade")
scoretab
## # A tibble: 72 x 4
##    grade test                         file       score
##    <fct> <fct>                        <fct>      <dbl>
##  1 PASS  Basic Statistics             Day0.fastq   1  
##  2 PASS  Per base sequence quality    Day0.fastq   1  
##  3 PASS  Per tile sequence quality    Day0.fastq   1  
##  4 PASS  Per sequence quality scores  Day0.fastq   1  
##  5 FAIL  Per base sequence content    Day0.fastq   0  
##  6 WARN  Per sequence GC content      Day0.fastq   0.5
##  7 PASS  Per base N content           Day0.fastq   1  
##  8 PASS  Sequence Length Distribution Day0.fastq   1  
##  9 PASS  Sequence Duplication Levels  Day0.fastq   1  
## 10 WARN  Overrepresented sequences    Day0.fastq   0.5
## # ... with 62 more rows

“grade” is here acting as the key, allowing corresponding rows in the data frames to be joined together.

One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.

Joins are the key tool for integrating different types of data, based on some common key such as gene id.

See also: match and merge in base R.

Challenge

Using the newly created score column:

  1. Filter scoretab to get only “WARN” or “FAIL” grades.

  2. Sort the result so that “FAIL”s are at the top.

mutate

mutate lets us add or overwrite columns by computing a new value for them.

mutate(scoretab, doublescore = score*2)
## # A tibble: 72 x 5
##    grade test                         file       score doublescore
##    <fct> <fct>                        <fct>      <dbl>       <dbl>
##  1 PASS  Basic Statistics             Day0.fastq   1             2
##  2 PASS  Per base sequence quality    Day0.fastq   1             2
##  3 PASS  Per tile sequence quality    Day0.fastq   1             2
##  4 PASS  Per sequence quality scores  Day0.fastq   1             2
##  5 FAIL  Per base sequence content    Day0.fastq   0             0
##  6 WARN  Per sequence GC content      Day0.fastq   0.5           1
##  7 PASS  Per base N content           Day0.fastq   1             2
##  8 PASS  Sequence Length Distribution Day0.fastq   1             2
##  9 PASS  Sequence Duplication Levels  Day0.fastq   1             2
## 10 WARN  Overrepresented sequences    Day0.fastq   0.5           1
## # ... with 62 more rows

Even though this is called “mutate”, it is not literally modifying the input. Rather it is producing a copy of the data frame that has the modifiction.

This is equivalent to:

scoretab2 <- scoretab
scoretab2$doublescore <- scoretab2$score * 2

summarize

summarize lets us compute summaries of data.

summarize(scoretab, total=sum(score))
## # A tibble: 1 x 1
##   total
##   <dbl>
## 1    60

We really want to summarize the data grouped by file, so we can see if there are any particularly bad files. This is achieved using the group_by “adjective”. group_by gives the data frame a grouping using one or more columns, which modifies the subsequent call to summarize.

group_by(scoretab, file)
## # A tibble: 72 x 4
## # Groups:   file [6]
##    grade test                         file       score
##    <fct> <fct>                        <fct>      <dbl>
##  1 PASS  Basic Statistics             Day0.fastq   1  
##  2 PASS  Per base sequence quality    Day0.fastq   1  
##  3 PASS  Per tile sequence quality    Day0.fastq   1  
##  4 PASS  Per sequence quality scores  Day0.fastq   1  
##  5 FAIL  Per base sequence content    Day0.fastq   0  
##  6 WARN  Per sequence GC content      Day0.fastq   0.5
##  7 PASS  Per base N content           Day0.fastq   1  
##  8 PASS  Sequence Length Distribution Day0.fastq   1  
##  9 PASS  Sequence Duplication Levels  Day0.fastq   1  
## 10 WARN  Overrepresented sequences    Day0.fastq   0.5
## # ... with 62 more rows
summarize(group_by(scoretab, file), average_score=mean(score))
## # A tibble: 6 x 2
##   file        average_score
##   <fct>               <dbl>
## 1 Day0.fastq          0.833
## 2 Day4.fastq          0.875
## 3 Day7.fastq          0.833
## 4 Day10.fastq         0.833
## 5 Day15.fastq         0.833
## 6 Day20.fastq         0.792

The special function n() can be used within summarize to get the number of rows. (This also works in mutate, but is most useful in summarize.)

summarize(group_by(scoretab, grade), count=n())
## # A tibble: 3 x 2
##   grade count
##   <fct> <int>
## 1 FAIL      8
## 2 WARN      8
## 3 PASS     56

So summarize lets us do the things we do with base R using table and tapply, but staying tidy by using data frames.

Tip: group_by also affects other verbs, such as mutate. This is more advanced than we will be going into today. Grouping can be removed with ungroup.

See also functions for common uses of summarize: count, distinct.

The pipe %>%

We often want to string together a series of dplyr functions. This is achieved using dplyr’s pipe operator, %>%. This takes the value on the left, and passes it as the first argument to the function call on the right. So the previous summarization could be written:

scoretab %>% group_by(grade) %>% summarize(count=n())

%>% isn’t limited to dplyr functions. It’s an alternative way of writing any R code.

rep(paste("hello", "world"), 5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
"hello" %>% paste("world") %>% rep(5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"

The idea of adding geoms in ggplot2 is rather like the dplyr pipe. ggplot2 predates dplyr, and Hadley Wickham has had a progression of ideas. It will probably be possible to use %>% instead of + in some successor to ggplot2 in the not too distant future.

tidyr

tidyr is the Tidyverse package for getting data frames to tidy. Recall that in a tidy data frame:

I’m not convinced tidyr is a great package, but the ideas behind it are very important. dplyr uses magic arguments to produces very beautiful R code, but in tidyr it gets a bit confusing. Anyway, let’s work through a toy example:

untidy <- read_csv(
    "country,     male-young, male-old, female-young, female-old
     Australia,            1,        2,            3,          4
     New Zealand,          5,        6,            7,          8")
untidy
## # A tibble: 2 x 5
##   country     `male-young` `male-old` `female-young` `female-old`
##   <chr>              <int>      <int>          <int>        <int>
## 1 Australia              1          2              3            4
## 2 New Zealand            5          6              7            8

The first problem is that rows are not distinct units of observation, there are actually four observations per row. This is fixed using gather.

gathered <- gather(untidy, key=group, value=cases, -country)
gathered
## # A tibble: 8 x 3
##   country     group        cases
##   <chr>       <chr>        <int>
## 1 Australia   male-young       1
## 2 New Zealand male-young       5
## 3 Australia   male-old         2
## 4 New Zealand male-old         6
## 5 Australia   female-young     3
## 6 New Zealand female-young     7
## 7 Australia   female-old       4
## 8 New Zealand female-old       8

We give a data frame, a name for the key and value columns to be created. Remaining arguments behave like select, here we are saying that the country column shouldn’t be gathered.

In the introductory R day we saw melt from the reshape2 package, which did a similar thing to matrices.

spread is the opposite operation, spreading a column into many columns, which generally makes data less tidy but is sometimes necessary. A common application is producing a scatter plot, where the x and y axes need to be two different columns even though they measure the same type of thing. Data may also be easier to look at in a table in spread form.

spread(bigtab, key=file, value=grade)
## # A tibble: 12 x 7
##    test  Day0.fastq Day4.fastq Day7.fastq Day10.fastq Day15.fastq
##    <fct> <fct>      <fct>      <fct>      <fct>       <fct>      
##  1 Sequ… PASS       PASS       PASS       PASS        PASS       
##  2 Sequ… PASS       PASS       PASS       PASS        PASS       
##  3 Per … PASS       PASS       PASS       PASS        PASS       
##  4 Per … PASS       PASS       PASS       PASS        PASS       
##  5 Per … WARN       WARN       FAIL       WARN        WARN       
##  6 Per … PASS       PASS       PASS       PASS        PASS       
##  7 Per … FAIL       FAIL       FAIL       FAIL        FAIL       
##  8 Per … PASS       PASS       PASS       PASS        PASS       
##  9 Over… WARN       PASS       PASS       WARN        WARN       
## 10 Kmer… PASS       PASS       PASS       PASS        PASS       
## 11 Basi… PASS       PASS       PASS       PASS        PASS       
## 12 Adap… PASS       PASS       PASS       PASS        PASS       
## # ... with 1 more variable: Day20.fastq <fct>

In our toy example, we have a further problem that the “group” column contains two pieces of data. This can be fixed with separate. By default separate splits on any non-alphanumeric characters, but different separator characters can be specified.

separate(gathered, col=group, into=c("gender","age"))
## # A tibble: 8 x 4
##   country     gender age   cases
##   <chr>       <chr>  <chr> <int>
## 1 Australia   male   young     1
## 2 New Zealand male   young     5
## 3 Australia   male   old       2
## 4 New Zealand male   old       6
## 5 Australia   female young     3
## 6 New Zealand female young     7
## 7 Australia   female old       4
## 8 New Zealand female old       8

All of this would typically be written as a single pipline:

tidied <- untidy %>%
    gather(key=group, value=cases, -country) %>%
    separate(group, into=c("gender","age"))

Advanced: tidyr has a number of other useful data frame manipulations. For completeness, we mention nest. This creates a list column in a tibble, the list column containing one tibble per row. Yo, Hadley put tibbles in your tibble so you can dplyr while you dplyr. The inverse operation is unnest. unnest is rather like the bind_rows function we saw earlier. nest and unnest go well with creating or modifying list columns in a mutate with lapply (or the purrr package).

# Advanced
nested <- nest(gathered, -country)
nested
nested$data
unnest(nested)

Challenge

You receive data on a set of points. The points are in two dimensions (dim), and each point has x and y coordinates. Unfortunately it looks like this:

df <- read_csv(
    "dim, A_1, A_2, B_1, B_2, B_3, B_4, B_5
     x,   2,   4,   1,   2,   3,   4,   5
     y,   4,   4,   2,   1,   1,   1,   2")
  1. Tidy the data by gathering all of the columns except dim. What what does each row now represent?

  2. We want to plot the points as a scatter-plot, using either plot or ggplot. Spread the gathered data so that this is possible. Now what do the rows represent?

  3. What other tidying operation could be applied to this data?

An RNA-Seq example

We will now put this all together by looking at a small-scale gene expression data set. To simplify somewhat: The data was generated from a multiplex PCR targetting 44 genes. The PCR product is quantified by counting reads from high throughput sequencing. The experiment is of yeast released synchronously into cell cycling, so that we can see which genes are active at different points in the cycle. Several strains of yeast have been used, a wildtype and several gene knock-downs. Each has nine time points, covering two cell cycles (two hours).

This is much smaller than a typical RNA-Seq dataset. We are only looking at 44 genes, rather than all of the thousands of genes in the yeast genome.

Recall the three activities involved in data analysis. This dataset requires all of them.

Tidying

Hint: You can select from the top of a pipeline to partway through and press Ctrl-Enter or Command-Enter to see the value part-way through the pipeline.

# Use readr's read_csv function to load the file
counts_untidy <- read_csv("r-more-files/read-counts.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Feature = col_character()
## )
## See spec(...) for full column specifications.
counts <- counts_untidy %>%
    gather(key=sample, value=count, -Feature, factor_key=TRUE) %>%
    separate(sample, sep=":", into=c("strain","time"), convert=TRUE, remove=FALSE) %>%
    mutate(
        strain = factor(strain, levels=c("WT","SET1","RRP6","SET1-RRP6"))
    ) %>%
    filter(time >= 2) %>%
    select(sample, strain, time, gene=Feature, count)

summary(counts)
##      sample           strain         time        gene          
##  WT:2   :  45   WT       :405   Min.   : 2   Length:1620       
##  WT:3   :  45   SET1     :405   1st Qu.: 4   Class :character  
##  WT:4   :  45   RRP6     :405   Median : 6   Mode  :character  
##  WT:5   :  45   SET1-RRP6:405   Mean   : 6                     
##  WT:6   :  45                   3rd Qu.: 8                     
##  WT:7   :  45                   Max.   :10                     
##  (Other):1350                                                  
##      count       
##  Min.   :     0  
##  1st Qu.:   172  
##  Median :   666  
##  Mean   :  3612  
##  3rd Qu.:  2548  
##  Max.   :143592  
## 

Time point 1 was omitted because it isn’t actually part of the time series, it’s from cells in an unsynchronized state. If we wanted to be even tidier, the remaining time points could be recoded with the actual time within the experiment – they are at 15 minute intervals.

Transformation and normalization

ggplot(counts, aes(x=sample, y=count)) + 
    geom_boxplot() + 
    coord_flip()

We immediately see from a Tukey box plot that the data is far from normal – there are outliers many box-widths to the right of the box. Perhaps a log transformation is in order.

ggplot(counts, aes(x=sample, y=log2(count))) + 
    geom_boxplot() + 
    coord_flip()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

ggplot(counts, aes(x=log2(count), group=sample)) + 
    geom_line(stat="density")
## Warning: Removed 14 rows containing non-finite values (stat_density).

Log transformation has greatly improved things, although now there are more outliers left of the whiskers, and some zeros we will need to be careful of. We also see one of the samples has less reads than the others.

The gene SRP68 was included as a control housekeeping gene. Its expression level should be the same in all samples. We will divide counts in each sample by the count for SRP68 to correct for library size, then log-transform the data. We add a small constant when log-transforming so that zeros do not become negative infinity.

normalizer <- counts %>%
    filter(gene == "SRP68") %>%
    select(sample, norm=count)

moderation <- 0.5/mean(normalizer$norm)

counts_norm <- counts %>%
    left_join(normalizer, by="sample") %>%
    mutate(
        norm_count = count/norm, 
        log_norm_count = log2(norm_count+moderation)
    )

ggplot(counts_norm, aes(x=sample, y=log_norm_count)) + 
    geom_boxplot() + 
    coord_flip()

Note: Here all of the genes are potentially differentially expressed. For a full RNA-Seq data set we would probably assume most genes aren’t differentially expressed, so rather than nominating a normalizing gene we would normalize by total library size with, typically with a slight adjustment using the “TMM” method as calculated by calcNormFactors in edgeR:

# For a full sized RNA-Seq dataset:
library(edgeR)
mat <- counts_untidy %>% select(-Feature) %>% as.matrix
adjusted_lib_sizes <- colSums(mat) * calcNormFactors(mat)
normalizer_by_tmm <- tibble(sample=names(adjusted_lib_sizes), norm=adjusted_lib_sizes)

Visualization

The data is now suitably transformed and normalized for visualization. ggplot2 gives us a lot of freedom to juxtapose information from different columns.

Challenge

  1. Get all the rows in counts_norm relating to the histone gene “HHT1”.

  2. Plot this data with ggplot2. Use time as the x-axis, log_norm_count as the y-axis, and color the data by strain. Try using geoms: geom_point(), geom_line().

ggplot( ... , aes(x= ... , y= ... , color= ... )) + ...

Extensions:

Compare plots of log_norm_count, norm_count, and count.

Experiment with other geoms and other ways to assign columns to aesthetics.

Whole dataset visualization

ggplot(counts_norm, aes(x=sample, y=gene, fill=log_norm_count)) + 
    geom_tile() + 
    scale_fill_viridis() + 
    theme_minimal() +
    theme(axis.text.x=element_text(           # Vertical text on x axis
              angle=90,vjust=0.5,hjust=1))              

ggplot(counts_norm, aes(x=time, y=gene, fill=log_norm_count)) + 
    geom_tile() + 
    facet_grid(~ strain) + 
    scale_fill_viridis() + 
    theme_minimal()

ggplot(counts_norm, aes(x=time, y=strain, fill=log_norm_count)) + 
    geom_tile() + 
    facet_wrap(~ gene) + 
    scale_fill_viridis() + 
    theme_minimal()

ggplot(counts_norm, aes(x=time, y=log_norm_count, color=strain, group=strain)) + 
    geom_line() + 
    facet_wrap(~ gene, scale="free")

Exercises

  1. Which are the three most variable genes?

Hint: intToUtf8(utf8ToInt("xvh#jurxsbe|/#vxppdul}h/#vg/#dqg#duudqjh")-3)

  1. Different genes have different average expression levels, but what we are interested in is how they change over time. Further normalize the data by subtracting the average for each gene from log_norm_count.

Appendix: Tidy linear modelling

The idea of this section is to give a taste of what is possible, so that you know what topics to read up on or ask an expert for help with if this is the sort of analysis you need. It will also provide a little context for advanced forms of RNA-Seq differential expression analysis.

sut476 <- counts_norm %>% filter(gene=="SUT476") 
sut476_wt <- sut476 %>% filter(strain=="WT")

ggplot(sut476_wt,aes(x=time,y=log_norm_count)) +
    geom_point() +
    geom_smooth(method="lm")

ggplot2 includes a handy geom for fitting a curve or line to data, but what exactly is it doing? What if we want to work with that fitted curve ourselves? ggplot2 can use various curve fitting methods, the simplest of which is a linear model with “lm”.

model <- lm(log_norm_count ~ time, data=sut476_wt)
model
## 
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
## 
## Coefficients:
## (Intercept)         time  
##    -5.21805      0.08813
# Note: The p-values listed by summary aren't always meaningful.
# It makes no sense to remove the intercept term but retain the time term.
# Significance testing can better be done with anova() and the multcomp package.
summary(model)
## 
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9458 -0.4717 -0.1152  0.2420  1.5235 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.21805    0.66749  -7.817 0.000106 ***
## time         0.08813    0.10219   0.862 0.417017    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7915 on 7 degrees of freedom
## Multiple R-squared:  0.09605,    Adjusted R-squared:  -0.03309 
## F-statistic: 0.7438 on 1 and 7 DF,  p-value: 0.417
model.matrix(log_norm_count ~ time, data=sut476_wt)
##   (Intercept) time
## 1           1    2
## 2           1    3
## 3           1    4
## 4           1    5
## 5           1    6
## 6           1    7
## 7           1    8
## 8           1    9
## 9           1   10
## attr(,"assign")
## [1] 0 1

A model can be tested against a simpler null model using anova. Is the fit sufficiently better to justify the extra complexity?

null_model <- lm(log_norm_count ~ 1, data=sut476_wt)
anova(null_model, model)
## Analysis of Variance Table
## 
## Model 1: log_norm_count ~ 1
## Model 2: log_norm_count ~ time
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1      8 4.8518                           
## 2      7 4.3858  1   0.46601 0.7438  0.417

A linear trend is not supported, in comparison to the null hypothesis that expression is constant, by this data. Also useful for hypothesis testing, the multcomp package can be used to test comparisons of interest such as a set of pairwise comparisons. This is similar to testing of contrasts in packages for differential expression analysis such as limma, edgeR, or DESeq2.

augment from the broom package gives various diagnostic statistics.

library(broom)

augment(model, sut476_wt) %>% head
## # A tibble: 6 x 15
##   sample strain  time gene  count  norm norm_count log_norm_count .fitted
##   <fct>  <fct>  <int> <chr> <dbl> <dbl>      <dbl>          <dbl>   <dbl>
## 1 WT:2   WT         2 SUT4…     8    92     0.0870          -3.52   -5.04
## 2 WT:3   WT         3 SUT4…    10   503     0.0199          -5.63   -4.95
## 3 WT:4   WT         4 SUT4…    28   989     0.0283          -5.13   -4.87
## 4 WT:5   WT         5 SUT4…    23  1236     0.0186          -5.72   -4.78
## 5 WT:6   WT         6 SUT4…    31  1122     0.0276          -5.16   -4.69
## 6 WT:7   WT         7 SUT4…    47  1054     0.0446          -4.48   -4.60
## # ... with 6 more variables: .se.fit <dbl>, .resid <dbl>, .hat <dbl>,
## #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
# augment() can also be used to produce predictions for new inputs
augment(model, newdata=sut476_wt[1:5,])
## # A tibble: 5 x 10
##   sample strain  time gene  count  norm norm_count log_norm_count .fitted
## * <fct>  <fct>  <int> <chr> <dbl> <dbl>      <dbl>          <dbl>   <dbl>
## 1 WT:2   WT         2 SUT4…     8    92     0.0870          -3.52   -5.04
## 2 WT:3   WT         3 SUT4…    10   503     0.0199          -5.63   -4.95
## 3 WT:4   WT         4 SUT4…    28   989     0.0283          -5.13   -4.87
## 4 WT:5   WT         5 SUT4…    23  1236     0.0186          -5.72   -4.78
## 5 WT:6   WT         6 SUT4…    31  1122     0.0276          -5.16   -4.69
## # ... with 1 more variable: .se.fit <dbl>
# Let's look at the model fit
augment(model, sut476_wt) %>%
ggplot(aes(x=time)) + 
    geom_point(aes(y=log_norm_count)) +
    geom_line(aes(y=.fitted))

We see the initial timepoint in the wildtype strain is wildly off the linear trend. Possibly early time points need to be removed, or (more likely!) a linear trend is not a good model of this data.

Linear models can be much more complicated than this. They can include multiple explanatory variables, including categorical variables, and interactions between them, and even polynomials.

model2 <- lm(log_norm_count ~ strain*poly(time,3), data=sut476)
model2
## 
## Call:
## lm(formula = log_norm_count ~ strain * poly(time, 3), data = sut476)
## 
## Coefficients:
##                    (Intercept)                      strainSET1  
##                        -4.6893                          1.4081  
##                     strainRRP6                 strainSET1-RRP6  
##                         4.2133                          3.5053  
##                 poly(time, 3)1                  poly(time, 3)2  
##                         1.3653                          2.8553  
##                 poly(time, 3)3       strainSET1:poly(time, 3)1  
##                        -2.4314                          2.0542  
##      strainRRP6:poly(time, 3)1  strainSET1-RRP6:poly(time, 3)1  
##                        -2.0608                         -0.5208  
##      strainSET1:poly(time, 3)2       strainRRP6:poly(time, 3)2  
##                        -3.9125                         -2.9297  
## strainSET1-RRP6:poly(time, 3)2       strainSET1:poly(time, 3)3  
##                        -2.1218                          1.9517  
##      strainRRP6:poly(time, 3)3  strainSET1-RRP6:poly(time, 3)3  
##                         2.1368                          2.2349
augment(model2, sut476) %>%
ggplot(aes(x=time, color=strain, group=strain)) + 
    geom_point(aes(y=log_norm_count), alpha=0.5) +
    geom_line(aes(y=.fitted))

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] broom_0.5.0       bindrcpp_0.2.2    viridis_0.5.1    
##  [4] viridisLite_0.3.0 forcats_0.3.0     stringr_1.3.1    
##  [7] dplyr_0.7.6       purrr_0.2.5       readr_1.1.1      
## [10] tidyr_0.8.1       tibble_1.4.2      ggplot2_3.0.0    
## [13] tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4 reshape2_1.4.3   haven_1.1.2      lattice_0.20-35 
##  [5] colorspace_1.3-2 htmltools_0.3.6  yaml_2.2.0       utf8_1.1.4      
##  [9] rlang_0.2.2      pillar_1.3.0     glue_1.3.0       withr_2.1.2     
## [13] modelr_0.1.2     readxl_1.1.0     bindr_0.1.1      plyr_1.8.4      
## [17] munsell_0.5.0    gtable_0.2.0     cellranger_1.1.0 rvest_0.3.2     
## [21] evaluate_0.11    labeling_0.3     knitr_1.20       fansi_0.3.0     
## [25] Rcpp_0.12.18     scales_1.0.0     backports_1.1.2  jsonlite_1.5    
## [29] gridExtra_2.3    hms_0.4.2        digest_0.6.17    stringi_1.2.4   
## [33] grid_3.5.0       rprojroot_1.3-2  cli_1.0.1        tools_3.5.0     
## [37] magrittr_1.5     lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.2 
## [41] xml2_1.2.0       lubridate_1.7.4  assertthat_0.2.0 rmarkdown_1.10  
## [45] httr_1.3.1       rstudioapi_0.7   R6_2.2.2         nlme_3.1-137    
## [49] compiler_3.5.0