RNAsik for RNAseq
N.B This workflow assumes model organism has a reference genome. If the reference genome isn't applicable, different workflow might be required.
RNAsik pipeline was build in house for processing RNA-seq(uencing) data. It is written in BigDataScript (bds), which is domain specific language (DSL), that makes writing pipelines easy as well as making them robust. To get a bit more technical, bds runs on java virtual machine (JVM) and therefore requires Java.
In simple terms any pipeline is a wrapper of several tools that makes it easier and arguably faster to get to the end goal. The three core parts to any RNA-seq analysis are:
- mapping to the reference genome
- counting reads mapped into features e.g genes
- doing differential expression (DE) statistics
The pipeline does the first two parts and Degust does the third part. Degust itself, in simple terms, a wrapper around limma and edgeR R packages. In theory and practice one can take output from RNAsik pipeline, which is a table of counts where every gene is a row and every column is a sample and use those with any other R packages that do DE analysis.
In actual terms both RNAsik and Degust provide complete experience, not only you'll get your list of DE genes and QC metrics, but will be able to get full inside into your experimental design and the outcome of that.
RNAsik does read alignment and read counting and cleaning and improvements of your table of counts, which makes Degust analysis one upload away.
RNAsik wraps these tools making your RNAseq analysis more streamline. It also has "sanity checks" inbuilt, checking command line options, checking if options are valid files/directories and it will talk to you so don't sweat :) but do read the error messages. Degust is exceptionally good for exploratory data visualisation and analysis. Both tools can also server as a nice proxy for learning bioinformatics as they provide command line and R code for doing the analysis. Last but not least thanks to MultiQC
RNAsik provides an aggregate of different metrics in one place - multiqc report. This is a good place to start understanding your data.
The central bits of information are:
- Are there differences in library sizes?
- Is there any issues with mapping rates?
- Is there any issues with reads assignment rates?
However there is so many other questions you can ask including:
- What is duplication rate?
- What is multi-mapping rate?
- What is intragenic and interagenic rates?
As mentioned above multiqc report is a great first step in the attempt to answer those questions. A lot of the time everything looks fairly good and consistent allowing downstream analysis. Sometimes user can tweak certain individual parameters which can improve results, other times it comes down to experimental design and/or library preparation and sequencing issues. Either way one need to make this "first iteration" in order to see room for improvement.
How to cite
Tsyganov, Kirill, Andrew James Perry, Stuart Kenneth Archer, and David Powell. 2018. “RNAsik: A Pipeline for Complete and Reproducible RNA-Seq Analysis That Runs Anywhere with Speed and Ease.” Journal of Open Source Software 3: 583.
It is hard to give full acknowlegment to all contributors. The nature of the open source projects such that contributors can come and go, however they leave behind valuable contributions and need to get full credit for that. Please look at RNAsik GitHub repository to get a full sense of who is contributing. In particular one can look at number of commits, issues triaging and handling and pull requests (PRs). Please also remember that every contribution matters, nothing is too small!
Raw fastq files have been analysed with RNAsik pipeline (Tsyganov et al. 2018) to produce raw genes count matrix and various quality control metrics. For this analysis RNAsik pipeline (Tsyganov et al. 2018) ran with STAR aligner option (Dobin et al. 2013) and reads were quantified with featureCounts (Liao, Smyth, and Shi 2014). The reference GTF and FASTA files were downloaded from Ensembl database. Raw counts were then analysed with Degust (Powell 2015) web tool to do differential expression analysis to produce list of differentially expressed genes and several quality plots including classical multidimensional scaling (MDS) and MA plots. In this analysis limma voom (Law et al. 2014) was used for differential expression analysis. Degust (Powell 2015) largely follows limma voom workflow with typical conts per million (CPM) library size normalisation and trimmed mean of M values (TMM) normalisation (Robinson and Oshlack 2010) for RNA composition normalisation.
Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal RNA-seq Aligner.” Bioinformatics 29 (1): 15–21. http://dx.doi.org/10.1093/bioinformatics/bts635.
Law, Charity W, Yunshun Chen, Wei Shi, and Gordon K Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-seq Read Counts.” Genome Biol. 15 (2): R29. http://dx.doi.org/10.1186/gb-2014-15-2-r29.
Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “FeatureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30. http://dx.doi.org/10.1093/bioinformatics/btt656.
Powell, David. 2015. “Degust: Powerfull and User Friendly Front-End Data Analsysis, Visualisation and Exploratory Tool for Rna-Sequencing.” github. http://degust.erc.monash.edu.
Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-seq Data.” Genome Biol. 11 (3): R25. http://dx.doi.org/10.1186/gb-2010-11-3-r25.
Tsyganov, Kirill, Andrew James Perry, Stuart Kenneth Archer, and David Powell. 2018. “RNAsik: A Pipeline for Complete and Reproducible RNA-seq Analysis That Runs Anywhere with Speed and Ease.” Journal of Open Source Software 3: 583. https://www.theoj.org/joss-papers/joss.00583/10.21105.joss.00583.pdf.
MBP team photo