6 Next steps

6.1 Deepen your understanding

Our number one recommendation is to read the book “R for Data Science” by Garrett Grolemund and Hadley Wickham.

The R Manuals are the place to look if you need a precise definition of how R behaves.

6.2 Expand your vocabulary

Have a look at these cheat sheets to see what is possible with R.

Posit’s collection of cheat sheets cover some important newer packages in R.
An old-school cheat sheet for dinosaurs and people wishing to go deeper.
A Bioconductor cheat sheet for biological data.
The R Graph Gallery for visual inspiration.
The R Graphics Cookbook

6.3 Find packages for specific data types

CRAN contains over 20,000 R packages.
The CRAN task views provide recommendations for specific topics.
Bioconductor is another package repository, specfically for working with high-throughput biological data.

6.4 Some pointers on models and statistics

Statistical tasks such as model fitting, hypothesis testing, confidence interval calculation, and prediction are a large part of R, and one we haven’t demonstrated fully today.

For any standard statistical test, there will usually be an R function to perform it. Examples include t.test from the previous sections, and also wilcox.test, fisher.test, chisq.test, and cor.test. Before applying these functions, you may need to use the methods we’ve learned today to subset and transform your data, or perform some preliminary summarization such as averaging technical replicates. To make sure there are no problems that might invalidate the results from these tests, always visualize your data. If you are performing many tests, adjust for multiple testing with p.adjust.

Going beyond this, linear models and the linear model formula syntax ~ are core to much of what R has to offer statistically. Many statistical techniques take linear models as their starting point, including limma for differential gene expression, glm for logistic regression and generalized linear models, survival analysis with coxph, and mixed models to characterize variation within populations. I have developed some workshop material on linear models, available here.

“Statistical Models in S” by J.M. Chambers and T.J. Hastie is the primary reference for this, although there are some small differences between R and its predecessor S.
- The emmeans package will allow you to sensibly interpret models you obtain. Directly interpreting coefficients in models is sometimes misleading. This package fills in the interpretation step, filling in a missing piece of the original framework.
“An Introduction to Statistical Learning” by G. James, D. Witten, T. Hastie and R. Tibshirani can be seen as further development of the ideas in “Statistical Models in S”, and is available online. It has more of a machine learning than a statistics flavour to it. (The distinction is fuzzy!)
“Modern Applied Statistics with S” by W.N. Venable and B.D. Ripley is a well respected reference covering R and S.
“Linear Models with R” and “Extending the Linear Model with R” by J. Faraway cover linear models, with many practical examples.
Machine learning is a whole further world of packages…

The Carpentries run workshops on scientific computing and data science topics worldwide. The style of this present workshop is very much based on theirs. Their material is all available on their website.
Many further resources and tutorials exist online.