This lesson covers packages primarily by Hadley Wickham for tidying data and then working with it in tidy form, collectively known as the “tidyverse”.
The packages we are using in this lesson are all from CRAN, so we can install them with install.packages
. Don’t run this if you are using our biotraining server, the packages are already installed!
# install.packages(c(
# "tidyverse",
# "viridis",
# "broom"
# ))
library(tidyverse) # Load all "tidyverse" libraries.
# OR
# library(readr) # Read tabular data.
# library(tidyr) # Data frame tidying functions.
# library(dplyr) # General data frame manipulation.
# library(ggplot2) # Flexible plotting.
library(viridis) # Viridis color scale.
These packages usually have useful documentation in the form of “vignettes”. These are readable on the CRAN website, or within R:
vignette()
vignette(package="dplyr")
vignette("dplyr", package="dplyr")
Let’s continue our examination of the FastQC output. If you’re starting fresh for this lesson, you can load the necessary data frame with:
bigtab <- read_csv("r-more-files/fastqc.csv")
## Parsed with column specification:
## cols(
## grade = col_character(),
## test = col_character(),
## file = col_character()
## )
We saw ggplot2 in the introductory R day. Recall that we could assign columns of a data frame to aesthetics–x and y position, color, etc–and then add “geom”s to draw the data.
With ggplot2
we can easily view the whole data set.
ggplot(bigtab, aes(x=file,y=test,color=grade)) +
geom_point()
With categorical data on the x and y axes, a better geom to use is geom_tile
.
ggplot(bigtab, aes(x=file,y=test,fill=grade)) +
geom_tile()
ggplot2
offers a very wide variety of ways to adjust a plot. For categorical aesthetics, usually the first step is ensuring the relevant column is a factor with a meaningful level order.
y_order <- sort(unique(bigtab$test), decreasing=T) # y axis plots from bottom to top, so reverse
bigtab$test <- factor(bigtab$test, levels=y_order)
x_order <- unique(bigtab$file)
bigtab$file <- factor(bigtab$file, levels=x_order)
# Only necessary if not continuing from previous lesson on programming!
color_order <- c("FAIL", "WARN", "PASS")
bigtab$grade <- factor(bigtab$grade, levels=color_order)
myplot <- ggplot(bigtab, aes(x=file, y=test, fill=grade)) +
geom_tile(color="black", size=0.5) + # Black border on tiles
scale_fill_manual( # Colors, as color hex codes
values=c("#ee0000","#ffee00","#00aa00")) +
labs(x="", y="", fill="") + # Remove axis labels
coord_fixed() + # Square tiles
theme_minimal() + # Minimal theme, no grey background
theme(panel.grid=element_blank(), # No underlying grid lines
axis.text.x=element_text( # Vertical text on x axis
angle=90,vjust=0.5,hjust=0))
myplot
Plots can be saved with ggsave
. Width and height are given in inches, and an image resolution in Dots Per Inch should also be given. The width and height will determine the relative size of text and graphics, so increasing the resolution is best done by adjusting the DPI. Compare the files produced by:
ggsave("plot1.png", myplot, width=5, height=5, dpi=600)
ggsave("plot2.png", myplot, width=10, height=10, dpi=300)
It may be necessary to edit a plot further in a program such as Inkscape or Adobe Illustrator. To allow this, the image needs to be saved in a “vector” format such as SVG, EPS, or PDF. In this case, the DPI argument isn’t needed.
dplyr
gives us a collection of convenient “verbs” for manipulating data frames. The name is a contraction of “Data frame apPLYeR”, as some of these verbs have a similar flavour to apply
/lapply
/tapply
/etc.
Each verb takes a data frame as input and returns a modified version of it. The idea is that complex operations can be performed by stringing together a series of simpler operations in a pipeline.
# input +--------+ +--------+ +--------+ result
# data %>% | verb | %>% | verb | %>% | verb | -> data
# frame +--------+ +--------+ +--------+ frame
bigtab
## # A tibble: 72 x 3
## grade test file
## <fct> <fct> <fct>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 FAIL Per base sequence content Day0.fastq
## 6 WARN Per sequence GC content Day0.fastq
## 7 PASS Per base N content Day0.fastq
## 8 PASS Sequence Length Distribution Day0.fastq
## 9 PASS Sequence Duplication Levels Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## # ... with 62 more rows
As we used readr
to read the data, bigtab
is a “tibble”, which is the Tidyverse’s improved data frame. The dplyr
verbs also all return their results as tibbles. You can also create tibbles explicitly with the tibble
function (or data_frame
). One convenient feature is that it only shows a few rows of the data frame when you print it. If you do want to see the whole table, you can use as.data.frame
, or use the View
function to view it in a tab.
as.data.frame(bigtab)
View(bigtab)
The n
and width
arguments to print
can also be used to print more rows or columns respectively.
print(bigtab, n=100, width=1000)
Say we want to know all the tests that failed. filter
provides a convenient way to get these.
filter(bigtab, grade == "FAIL")
## # A tibble: 8 x 3
## grade test file
## <fct> <fct> <fct>
## 1 FAIL Per base sequence content Day0.fastq
## 2 FAIL Per base sequence content Day4.fastq
## 3 FAIL Per base sequence content Day7.fastq
## 4 FAIL Per sequence GC content Day7.fastq
## 5 FAIL Per base sequence content Day10.fastq
## 6 FAIL Per base sequence content Day15.fastq
## 7 FAIL Per base sequence content Day20.fastq
## 8 FAIL Per sequence GC content Day20.fastq
Something is magic here: we do not have grade
in our environment, only within the data frame. Arguments to filter
can use any column of the data frame as a variable. dplyr
uses this idea heavily, arguments to dplyr
verbs often behave in a special way.
Rather than filtering, we might instead want to sort the data so the most important rows are at the top. arrange
sorts a data frame by one or more columns.
arrange(bigtab, grade)
## # A tibble: 72 x 3
## grade test file
## <fct> <fct> <fct>
## 1 FAIL Per base sequence content Day0.fastq
## 2 FAIL Per base sequence content Day4.fastq
## 3 FAIL Per base sequence content Day7.fastq
## 4 FAIL Per sequence GC content Day7.fastq
## 5 FAIL Per base sequence content Day10.fastq
## 6 FAIL Per base sequence content Day15.fastq
## 7 FAIL Per base sequence content Day20.fastq
## 8 FAIL Per sequence GC content Day20.fastq
## 9 WARN Per sequence GC content Day0.fastq
## 10 WARN Overrepresented sequences Day0.fastq
## # ... with 62 more rows
# desc( ) can be used to reverse the sort order
arrange(bigtab, desc(grade))
## # A tibble: 72 x 3
## grade test file
## <fct> <fct> <fct>
## 1 PASS Basic Statistics Day0.fastq
## 2 PASS Per base sequence quality Day0.fastq
## 3 PASS Per tile sequence quality Day0.fastq
## 4 PASS Per sequence quality scores Day0.fastq
## 5 PASS Per base N content Day0.fastq
## 6 PASS Sequence Length Distribution Day0.fastq
## 7 PASS Sequence Duplication Levels Day0.fastq
## 8 PASS Adapter Content Day0.fastq
## 9 PASS Kmer Content Day0.fastq
## 10 PASS Basic Statistics Day4.fastq
## # ... with 62 more rows
dplyr’s select
function is for subsetting, renaming, and reordering columns. Old columns can be referred to by name or by number.
select(bigtab, test,grade)
## # A tibble: 72 x 2
## test grade
## <fct> <fct>
## 1 Basic Statistics PASS
## 2 Per base sequence quality PASS
## 3 Per tile sequence quality PASS
## 4 Per sequence quality scores PASS
## 5 Per base sequence content FAIL
## 6 Per sequence GC content WARN
## 7 Per base N content PASS
## 8 Sequence Length Distribution PASS
## 9 Sequence Duplication Levels PASS
## 10 Overrepresented sequences WARN
## # ... with 62 more rows
select(bigtab, 2,1)
## # A tibble: 72 x 2
## test grade
## <fct> <fct>
## 1 Basic Statistics PASS
## 2 Per base sequence quality PASS
## 3 Per tile sequence quality PASS
## 4 Per sequence quality scores PASS
## 5 Per base sequence content FAIL
## 6 Per sequence GC content WARN
## 7 Per base N content PASS
## 8 Sequence Length Distribution PASS
## 9 Sequence Duplication Levels PASS
## 10 Overrepresented sequences WARN
## # ... with 62 more rows
select(bigtab, foo=file, bar=test, baz=grade)
## # A tibble: 72 x 3
## foo bar baz
## <fct> <fct> <fct>
## 1 Day0.fastq Basic Statistics PASS
## 2 Day0.fastq Per base sequence quality PASS
## 3 Day0.fastq Per tile sequence quality PASS
## 4 Day0.fastq Per sequence quality scores PASS
## 5 Day0.fastq Per base sequence content FAIL
## 6 Day0.fastq Per sequence GC content WARN
## 7 Day0.fastq Per base N content PASS
## 8 Day0.fastq Sequence Length Distribution PASS
## 9 Day0.fastq Sequence Duplication Levels PASS
## 10 Day0.fastq Overrepresented sequences WARN
## # ... with 62 more rows
You can also remove specific columns like this:
select(bigtab, -file)
## # A tibble: 72 x 2
## grade test
## <fct> <fct>
## 1 PASS Basic Statistics
## 2 PASS Per base sequence quality
## 3 PASS Per tile sequence quality
## 4 PASS Per sequence quality scores
## 5 FAIL Per base sequence content
## 6 WARN Per sequence GC content
## 7 PASS Per base N content
## 8 PASS Sequence Length Distribution
## 9 PASS Sequence Duplication Levels
## 10 WARN Overrepresented sequences
## # ... with 62 more rows
The magic here is that column variables refer to their column number. select
subsets columns rather like subsetting vectors with [ ]
. So -file
becomes -1
, and negative numbers remove items. We can also refer to ranges of columns, such as grade:file
, which would become 1:3
, which is c(1,2,3)
.
Say we want to convert PASS/WARN/FAIL into a numeric score, so we can produce some numerical summaries. The scoring scheme will be:
fwp <- c("FAIL","WARN","PASS")
scoring <- tibble(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))
# Or:
# scoring <- data.frame(grade=factor(fwp,levels=fwp), score=c(0,0.5,1))
scoring
## # A tibble: 3 x 2
## grade score
## <fct> <dbl>
## 1 FAIL 0
## 2 WARN 0.5
## 3 PASS 1
In the introduction to R day, we saw the merge
function. dplyr has its own version of this in the form of several functions: left_join
, right_join
, inner_join
, full_join
, anti_join
.
The difference between these functions is what happens when there is a row in one data frame without a corresponding row in the other data frame. inner_join
discards such rows. full_join
always keeps them, filling in missing data with NA. left_join
always keeps rows from the first data frame. right_join
always keeps rows from the second data frame. anti_join
is a bit different, it gives you rows from the first data frame that aren’t in the second data frame.
Often we use a join to augment a data frame with some extra information. left_join
is a good default choice for this as it will never delete rows from the data frame that is being augmented.
scoretab <- left_join(bigtab, scoring, by="grade")
scoretab
## # A tibble: 72 x 4
## grade test file score
## <fct> <fct> <fct> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1
## 2 PASS Per base sequence quality Day0.fastq 1
## 3 PASS Per tile sequence quality Day0.fastq 1
## 4 PASS Per sequence quality scores Day0.fastq 1
## 5 FAIL Per base sequence content Day0.fastq 0
## 6 WARN Per sequence GC content Day0.fastq 0.5
## 7 PASS Per base N content Day0.fastq 1
## 8 PASS Sequence Length Distribution Day0.fastq 1
## 9 PASS Sequence Duplication Levels Day0.fastq 1
## 10 WARN Overrepresented sequences Day0.fastq 0.5
## # ... with 62 more rows
“grade” is here acting as the key, allowing corresponding rows in the data frames to be joined together.
One important thing that all the join functions do: if multiple rows have the same key in either data frame, all ways of combining the two sets of rows will be included in the result. So, here, rows from the scoring data frame have been copied many times in the output.
Joins are the key tool for integrating different types of data, based on some common key such as gene id.
See also: match
and merge
in base R.
Using the newly created score
column:
Filter scoretab
to get only “WARN” or “FAIL” grades.
Sort the result so that “FAIL”s are at the top.
mutate
lets us add or overwrite columns by computing a new value for them.
mutate(scoretab, doublescore = score*2)
## # A tibble: 72 x 5
## grade test file score doublescore
## <fct> <fct> <fct> <dbl> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1 2
## 2 PASS Per base sequence quality Day0.fastq 1 2
## 3 PASS Per tile sequence quality Day0.fastq 1 2
## 4 PASS Per sequence quality scores Day0.fastq 1 2
## 5 FAIL Per base sequence content Day0.fastq 0 0
## 6 WARN Per sequence GC content Day0.fastq 0.5 1
## 7 PASS Per base N content Day0.fastq 1 2
## 8 PASS Sequence Length Distribution Day0.fastq 1 2
## 9 PASS Sequence Duplication Levels Day0.fastq 1 2
## 10 WARN Overrepresented sequences Day0.fastq 0.5 1
## # ... with 62 more rows
Even though this is called “mutate”, it is not literally modifying the input. Rather it is producing a copy of the data frame that has the modifiction.
This is equivalent to:
scoretab2 <- scoretab
scoretab2$doublescore <- scoretab2$score * 2
summarize
lets us compute summaries of data.
summarize(scoretab, total=sum(score))
## # A tibble: 1 x 1
## total
## <dbl>
## 1 60
We really want to summarize the data grouped by file, so we can see if there are any particularly bad files. This is achieved using the group_by
“adjective”. group_by
gives the data frame a grouping using one or more columns, which modifies the subsequent call to summarize
.
group_by(scoretab, file)
## # A tibble: 72 x 4
## # Groups: file [6]
## grade test file score
## <fct> <fct> <fct> <dbl>
## 1 PASS Basic Statistics Day0.fastq 1
## 2 PASS Per base sequence quality Day0.fastq 1
## 3 PASS Per tile sequence quality Day0.fastq 1
## 4 PASS Per sequence quality scores Day0.fastq 1
## 5 FAIL Per base sequence content Day0.fastq 0
## 6 WARN Per sequence GC content Day0.fastq 0.5
## 7 PASS Per base N content Day0.fastq 1
## 8 PASS Sequence Length Distribution Day0.fastq 1
## 9 PASS Sequence Duplication Levels Day0.fastq 1
## 10 WARN Overrepresented sequences Day0.fastq 0.5
## # ... with 62 more rows
summarize(group_by(scoretab, file), average_score=mean(score))
## # A tibble: 6 x 2
## file average_score
## <fct> <dbl>
## 1 Day0.fastq 0.833
## 2 Day4.fastq 0.875
## 3 Day7.fastq 0.833
## 4 Day10.fastq 0.833
## 5 Day15.fastq 0.833
## 6 Day20.fastq 0.792
The special function n()
can be used within summarize
to get the number of rows. (This also works in mutate
, but is most useful in summarize
.)
summarize(group_by(scoretab, grade), count=n())
## # A tibble: 3 x 2
## grade count
## <fct> <int>
## 1 FAIL 8
## 2 WARN 8
## 3 PASS 56
So summarize
lets us do the things we do with base R using table
and tapply
, but staying tidy by using data frames.
Tip: group_by
also affects other verbs, such as mutate. This is more advanced than we will be going into today. Grouping can be removed with ungroup
.
See also functions for common uses of summarize: count
, distinct
.
We often want to string together a series of dplyr
functions. This is achieved using dplyr
’s pipe operator, %>%
. This takes the value on the left, and passes it as the first argument to the function call on the right. So the previous summarization could be written:
scoretab %>% group_by(grade) %>% summarize(count=n())
%>%
isn’t limited to dplyr
functions. It’s an alternative way of writing any R code.
rep(paste("hello", "world"), 5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
"hello" %>% paste("world") %>% rep(5)
## [1] "hello world" "hello world" "hello world" "hello world" "hello world"
The idea of adding geoms in ggplot2
is rather like the dplyr
pipe. ggplot2
predates dplyr
, and Hadley Wickham has had a progression of ideas. It will probably be possible to use %>%
instead of +
in some successor to ggplot2
in the not too distant future.
tidyr
is the Tidyverse package for getting data frames to tidy. Recall that in a tidy data frame:
I’m not convinced tidyr
is a great package, but the ideas behind it are very important. dplyr
uses magic arguments to produces very beautiful R code, but in tidyr
it gets a bit confusing. Anyway, let’s work through a toy example:
untidy <- read_csv(
"country, male-young, male-old, female-young, female-old
Australia, 1, 2, 3, 4
New Zealand, 5, 6, 7, 8")
untidy
## # A tibble: 2 x 5
## country `male-young` `male-old` `female-young` `female-old`
## <chr> <int> <int> <int> <int>
## 1 Australia 1 2 3 4
## 2 New Zealand 5 6 7 8
The first problem is that rows are not distinct units of observation, there are actually four observations per row. This is fixed using gather
.
gathered <- gather(untidy, key=group, value=cases, -country)
gathered
## # A tibble: 8 x 3
## country group cases
## <chr> <chr> <int>
## 1 Australia male-young 1
## 2 New Zealand male-young 5
## 3 Australia male-old 2
## 4 New Zealand male-old 6
## 5 Australia female-young 3
## 6 New Zealand female-young 7
## 7 Australia female-old 4
## 8 New Zealand female-old 8
We give a data frame, a name for the key and value columns to be created. Remaining arguments behave like select
, here we are saying that the country column shouldn’t be gathered.
In the introductory R day we saw melt
from the reshape2
package, which did a similar thing to matrices.
spread
is the opposite operation, spreading a column into many columns, which generally makes data less tidy but is sometimes necessary. A common application is producing a scatter plot, where the x and y axes need to be two different columns even though they measure the same type of thing. Data may also be easier to look at in a table in spread form.
spread(bigtab, key=file, value=grade)
## # A tibble: 12 x 7
## test Day0.fastq Day4.fastq Day7.fastq Day10.fastq Day15.fastq
## <fct> <fct> <fct> <fct> <fct> <fct>
## 1 Sequ… PASS PASS PASS PASS PASS
## 2 Sequ… PASS PASS PASS PASS PASS
## 3 Per … PASS PASS PASS PASS PASS
## 4 Per … PASS PASS PASS PASS PASS
## 5 Per … WARN WARN FAIL WARN WARN
## 6 Per … PASS PASS PASS PASS PASS
## 7 Per … FAIL FAIL FAIL FAIL FAIL
## 8 Per … PASS PASS PASS PASS PASS
## 9 Over… WARN PASS PASS WARN WARN
## 10 Kmer… PASS PASS PASS PASS PASS
## 11 Basi… PASS PASS PASS PASS PASS
## 12 Adap… PASS PASS PASS PASS PASS
## # ... with 1 more variable: Day20.fastq <fct>
In our toy example, we have a further problem that the “group” column contains two pieces of data. This can be fixed with separate
. By default separate
splits on any non-alphanumeric characters, but different separator characters can be specified.
separate(gathered, col=group, into=c("gender","age"))
## # A tibble: 8 x 4
## country gender age cases
## <chr> <chr> <chr> <int>
## 1 Australia male young 1
## 2 New Zealand male young 5
## 3 Australia male old 2
## 4 New Zealand male old 6
## 5 Australia female young 3
## 6 New Zealand female young 7
## 7 Australia female old 4
## 8 New Zealand female old 8
All of this would typically be written as a single pipline:
tidied <- untidy %>%
gather(key=group, value=cases, -country) %>%
separate(group, into=c("gender","age"))
Advanced: tidyr
has a number of other useful data frame manipulations. For completeness, we mention nest
. This creates a list column in a tibble, the list column containing one tibble per row. Yo, Hadley put tibbles in your tibble so you can dplyr while you dplyr. The inverse operation is unnest
. unnest
is rather like the bind_rows
function we saw earlier. nest
and unnest
go well with creating or modifying list columns in a mutate
with lapply
(or the purrr
package).
# Advanced
nested <- nest(gathered, -country)
nested
nested$data
unnest(nested)
You receive data on a set of points. The points are in two dimensions (dim
), and each point has x and y coordinates. Unfortunately it looks like this:
df <- read_csv(
"dim, A_1, A_2, B_1, B_2, B_3, B_4, B_5
x, 2, 4, 1, 2, 3, 4, 5
y, 4, 4, 2, 1, 1, 1, 2")
Tidy the data by gathering all of the columns except dim
. What what does each row now represent?
We want to plot the points as a scatter-plot, using either plot
or ggplot
. Spread the gathered data so that this is possible. Now what do the rows represent?
What other tidying operation could be applied to this data?
We will now put this all together by looking at a small-scale gene expression data set. To simplify somewhat: The data was generated from a multiplex PCR targetting 44 genes. The PCR product is quantified by counting reads from high throughput sequencing. The experiment is of yeast released synchronously into cell cycling, so that we can see which genes are active at different points in the cycle. Several strains of yeast have been used, a wildtype and several gene knock-downs. Each has nine time points, covering two cell cycles (two hours).
This is much smaller than a typical RNA-Seq dataset. We are only looking at 44 genes, rather than all of the thousands of genes in the yeast genome.
Recall the three activities involved in data analysis. This dataset requires all of them.
Hint: You can select from the top of a pipeline to partway through and press Ctrl-Enter or Command-Enter to see the value part-way through the pipeline.
# Use readr's read_csv function to load the file
counts_untidy <- read_csv("r-more-files/read-counts.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## Feature = col_character()
## )
## See spec(...) for full column specifications.
counts <- counts_untidy %>%
gather(key=sample, value=count, -Feature, factor_key=TRUE) %>%
separate(sample, sep=":", into=c("strain","time"), convert=TRUE, remove=FALSE) %>%
mutate(
strain = factor(strain, levels=c("WT","SET1","RRP6","SET1-RRP6"))
) %>%
filter(time >= 2) %>%
select(sample, strain, time, gene=Feature, count)
summary(counts)
## sample strain time gene
## WT:2 : 45 WT :405 Min. : 2 Length:1620
## WT:3 : 45 SET1 :405 1st Qu.: 4 Class :character
## WT:4 : 45 RRP6 :405 Median : 6 Mode :character
## WT:5 : 45 SET1-RRP6:405 Mean : 6
## WT:6 : 45 3rd Qu.: 8
## WT:7 : 45 Max. :10
## (Other):1350
## count
## Min. : 0
## 1st Qu.: 172
## Median : 666
## Mean : 3612
## 3rd Qu.: 2548
## Max. :143592
##
Time point 1 was omitted because it isn’t actually part of the time series, it’s from cells in an unsynchronized state. If we wanted to be even tidier, the remaining time points could be recoded with the actual time within the experiment – they are at 15 minute intervals.
ggplot(counts, aes(x=sample, y=count)) +
geom_boxplot() +
coord_flip()
We immediately see from a Tukey box plot that the data is far from normal – there are outliers many box-widths to the right of the box. Perhaps a log transformation is in order.
ggplot(counts, aes(x=sample, y=log2(count))) +
geom_boxplot() +
coord_flip()
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
ggplot(counts, aes(x=log2(count), group=sample)) +
geom_line(stat="density")
## Warning: Removed 14 rows containing non-finite values (stat_density).
Log transformation has greatly improved things, although now there are more outliers left of the whiskers, and some zeros we will need to be careful of. We also see one of the samples has less reads than the others.
The gene SRP68 was included as a control housekeeping gene. Its expression level should be the same in all samples. We will divide counts in each sample by the count for SRP68 to correct for library size, then log-transform the data. We add a small constant when log-transforming so that zeros do not become negative infinity.
normalizer <- counts %>%
filter(gene == "SRP68") %>%
select(sample, norm=count)
moderation <- 0.5/mean(normalizer$norm)
counts_norm <- counts %>%
left_join(normalizer, by="sample") %>%
mutate(
norm_count = count/norm,
log_norm_count = log2(norm_count+moderation)
)
ggplot(counts_norm, aes(x=sample, y=log_norm_count)) +
geom_boxplot() +
coord_flip()
Note: Here all of the genes are potentially differentially expressed. For a full RNA-Seq data set we would probably assume most genes aren’t differentially expressed, so rather than nominating a normalizing gene we would normalize by total library size with, typically with a slight adjustment using the “TMM” method as calculated by calcNormFactors
in edgeR
:
# For a full sized RNA-Seq dataset:
library(edgeR)
mat <- counts_untidy %>% select(-Feature) %>% as.matrix
adjusted_lib_sizes <- colSums(mat) * calcNormFactors(mat)
normalizer_by_tmm <- tibble(sample=names(adjusted_lib_sizes), norm=adjusted_lib_sizes)
The data is now suitably transformed and normalized for visualization. ggplot2
gives us a lot of freedom to juxtapose information from different columns.
Get all the rows in counts_norm
relating to the histone gene “HHT1”.
Plot this data with ggplot2. Use time
as the x-axis, log_norm_count
as the y-axis, and color the data by strain
. Try using geoms: geom_point()
, geom_line()
.
ggplot( ... , aes(x= ... , y= ... , color= ... )) + ...
Extensions:
Compare plots of log_norm_count
, norm_count
, and count
.
Experiment with other geoms and other ways to assign columns to aesthetics.
ggplot(counts_norm, aes(x=sample, y=gene, fill=log_norm_count)) +
geom_tile() +
scale_fill_viridis() +
theme_minimal() +
theme(axis.text.x=element_text( # Vertical text on x axis
angle=90,vjust=0.5,hjust=1))
ggplot(counts_norm, aes(x=time, y=gene, fill=log_norm_count)) +
geom_tile() +
facet_grid(~ strain) +
scale_fill_viridis() +
theme_minimal()
ggplot(counts_norm, aes(x=time, y=strain, fill=log_norm_count)) +
geom_tile() +
facet_wrap(~ gene) +
scale_fill_viridis() +
theme_minimal()
ggplot(counts_norm, aes(x=time, y=log_norm_count, color=strain, group=strain)) +
geom_line() +
facet_wrap(~ gene, scale="free")
Hint: intToUtf8(utf8ToInt("xvh#jurxsbe|/#vxppdul}h/#vg/#dqg#duudqjh")-3)
log_norm_count
.The idea of this section is to give a taste of what is possible, so that you know what topics to read up on or ask an expert for help with if this is the sort of analysis you need. It will also provide a little context for advanced forms of RNA-Seq differential expression analysis.
sut476 <- counts_norm %>% filter(gene=="SUT476")
sut476_wt <- sut476 %>% filter(strain=="WT")
ggplot(sut476_wt,aes(x=time,y=log_norm_count)) +
geom_point() +
geom_smooth(method="lm")
ggplot2
includes a handy geom for fitting a curve or line to data, but what exactly is it doing? What if we want to work with that fitted curve ourselves? ggplot2
can use various curve fitting methods, the simplest of which is a linear model with “lm”.
model <- lm(log_norm_count ~ time, data=sut476_wt)
model
##
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
##
## Coefficients:
## (Intercept) time
## -5.21805 0.08813
# Note: The p-values listed by summary aren't always meaningful.
# It makes no sense to remove the intercept term but retain the time term.
# Significance testing can better be done with anova() and the multcomp package.
summary(model)
##
## Call:
## lm(formula = log_norm_count ~ time, data = sut476_wt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9458 -0.4717 -0.1152 0.2420 1.5235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.21805 0.66749 -7.817 0.000106 ***
## time 0.08813 0.10219 0.862 0.417017
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7915 on 7 degrees of freedom
## Multiple R-squared: 0.09605, Adjusted R-squared: -0.03309
## F-statistic: 0.7438 on 1 and 7 DF, p-value: 0.417
model.matrix(log_norm_count ~ time, data=sut476_wt)
## (Intercept) time
## 1 1 2
## 2 1 3
## 3 1 4
## 4 1 5
## 5 1 6
## 6 1 7
## 7 1 8
## 8 1 9
## 9 1 10
## attr(,"assign")
## [1] 0 1
A model can be tested against a simpler null model using anova
. Is the fit sufficiently better to justify the extra complexity?
null_model <- lm(log_norm_count ~ 1, data=sut476_wt)
anova(null_model, model)
## Analysis of Variance Table
##
## Model 1: log_norm_count ~ 1
## Model 2: log_norm_count ~ time
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 4.8518
## 2 7 4.3858 1 0.46601 0.7438 0.417
A linear trend is not supported, in comparison to the null hypothesis that expression is constant, by this data. Also useful for hypothesis testing, the multcomp
package can be used to test comparisons of interest such as a set of pairwise comparisons. This is similar to testing of contrasts in packages for differential expression analysis such as limma
, edgeR
, or DESeq2
.
augment
from the broom
package gives various diagnostic statistics.
library(broom)
augment(model, sut476_wt) %>% head
## # A tibble: 6 x 15
## sample strain time gene count norm norm_count log_norm_count .fitted
## <fct> <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 WT:2 WT 2 SUT4… 8 92 0.0870 -3.52 -5.04
## 2 WT:3 WT 3 SUT4… 10 503 0.0199 -5.63 -4.95
## 3 WT:4 WT 4 SUT4… 28 989 0.0283 -5.13 -4.87
## 4 WT:5 WT 5 SUT4… 23 1236 0.0186 -5.72 -4.78
## 5 WT:6 WT 6 SUT4… 31 1122 0.0276 -5.16 -4.69
## 6 WT:7 WT 7 SUT4… 47 1054 0.0446 -4.48 -4.60
## # ... with 6 more variables: .se.fit <dbl>, .resid <dbl>, .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
# augment() can also be used to produce predictions for new inputs
augment(model, newdata=sut476_wt[1:5,])
## # A tibble: 5 x 10
## sample strain time gene count norm norm_count log_norm_count .fitted
## * <fct> <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 WT:2 WT 2 SUT4… 8 92 0.0870 -3.52 -5.04
## 2 WT:3 WT 3 SUT4… 10 503 0.0199 -5.63 -4.95
## 3 WT:4 WT 4 SUT4… 28 989 0.0283 -5.13 -4.87
## 4 WT:5 WT 5 SUT4… 23 1236 0.0186 -5.72 -4.78
## 5 WT:6 WT 6 SUT4… 31 1122 0.0276 -5.16 -4.69
## # ... with 1 more variable: .se.fit <dbl>
# Let's look at the model fit
augment(model, sut476_wt) %>%
ggplot(aes(x=time)) +
geom_point(aes(y=log_norm_count)) +
geom_line(aes(y=.fitted))
We see the initial timepoint in the wildtype strain is wildly off the linear trend. Possibly early time points need to be removed, or (more likely!) a linear trend is not a good model of this data.
Linear models can be much more complicated than this. They can include multiple explanatory variables, including categorical variables, and interactions between them, and even polynomials.
model2 <- lm(log_norm_count ~ strain*poly(time,3), data=sut476)
model2
##
## Call:
## lm(formula = log_norm_count ~ strain * poly(time, 3), data = sut476)
##
## Coefficients:
## (Intercept) strainSET1
## -4.6893 1.4081
## strainRRP6 strainSET1-RRP6
## 4.2133 3.5053
## poly(time, 3)1 poly(time, 3)2
## 1.3653 2.8553
## poly(time, 3)3 strainSET1:poly(time, 3)1
## -2.4314 2.0542
## strainRRP6:poly(time, 3)1 strainSET1-RRP6:poly(time, 3)1
## -2.0608 -0.5208
## strainSET1:poly(time, 3)2 strainRRP6:poly(time, 3)2
## -3.9125 -2.9297
## strainSET1-RRP6:poly(time, 3)2 strainSET1:poly(time, 3)3
## -2.1218 1.9517
## strainRRP6:poly(time, 3)3 strainSET1-RRP6:poly(time, 3)3
## 2.1368 2.2349
augment(model2, sut476) %>%
ggplot(aes(x=time, color=strain, group=strain)) +
geom_point(aes(y=log_norm_count), alpha=0.5) +
geom_line(aes(y=.fitted))
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] broom_0.5.0 bindrcpp_0.2.2 viridis_0.5.1
## [4] viridisLite_0.3.0 forcats_0.3.0 stringr_1.3.1
## [7] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1
## [10] tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0
## [13] tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.4 reshape2_1.4.3 haven_1.1.2 lattice_0.20-35
## [5] colorspace_1.3-2 htmltools_0.3.6 yaml_2.2.0 utf8_1.1.4
## [9] rlang_0.2.2 pillar_1.3.0 glue_1.3.0 withr_2.1.2
## [13] modelr_0.1.2 readxl_1.1.0 bindr_0.1.1 plyr_1.8.4
## [17] munsell_0.5.0 gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2
## [21] evaluate_0.11 labeling_0.3 knitr_1.20 fansi_0.3.0
## [25] Rcpp_0.12.18 scales_1.0.0 backports_1.1.2 jsonlite_1.5
## [29] gridExtra_2.3 hms_0.4.2 digest_0.6.17 stringi_1.2.4
## [33] grid_3.5.0 rprojroot_1.3-2 cli_1.0.1 tools_3.5.0
## [37] magrittr_1.5 lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.2
## [41] xml2_1.2.0 lubridate_1.7.4 assertthat_0.2.0 rmarkdown_1.10
## [45] httr_1.3.1 rstudioapi_0.7 R6_2.2.2 nlme_3.1-137
## [49] compiler_3.5.0