Documentation
Quick start
Install
RNAsik has been packaged with conda
. The conda
package currently leaves at this anaconda channel
conda create --name rnasik-1.5.5 --channel serine --yes rnasik=1.5.5=1
conda activate rnasik-1.5.5
The build number zero (rnasik=1.5.5=0
) does not include external tools such as STAR aligner, subread, etc. The main purpose to instal RNAsik
quickly. Full install with all of the dependencies can get upto 30 minutes. It appears that the slowest part is resolving python dependencies while installing MultiQC package.
Align raw reads
RNAsik -align star \
-fastaRef /path/to/reference.fasta \
-fqDir /path/to/raw-data/directory
Count gene features
RNAsik -counts \
-gtfFile path/to/annotation.gtf
The lot
RNAsik -fqDir /path/to/raw-data/directory \
-align star \
-refFiles /path/to/refDir \
-counts \
-all
Data set for testing
N.B
RNAsik
pipeline is some what resource hungry. This isn'tRNAsik
fault per say, because it "simply" wraps other tools. STAR aligner required fair amount of RAM and cpus. For a large genome like mouse it required around 30 Gb of RAM and the more cpus you have the quicker you'll map. I would advise not run the pipeline with less than 4 cores, which is default. This testing data set is of yeast and requires about 14 Gb of RAM. Also read this comment, in summary you might haveRNAsik
failing on system that are under minimum system resources, this is work in progess and should be fixed in the future.
I figured that for testing you need smallish data set as well as species with a small genome, as indexing of genome takes a while for larger genome e.g mouse I found this study GSE103004 which looks like an open access. If you follow that link you should hit front GEO page for that study. You can find your way to actual data (SRA files) files, but I always find it's a bit convoluted, so hit here is a link to data files.
I've already prepared raw-data (fastq) files for you. I also reduced number of samples and sub-sampled reads to speed up your test run. Firstly though let me explain to you how to get full data set.
-
download sratoolkit which is set of tools from NCBI that you'll need to download sra files and then extract/convert those to fastq files.
-
fastq-dump --gzip --split-files SRR3407195
this is a command that you'll want to run to get one particular sra file, not--split-files
options, you need to use that if you data is paired end. If you don't use that flag, then you are going to end up with a single fastq file that has reads interleaved or truncated/merged in a funny way (had some issues like that in the past)
However you don't want run that command several times, so use a loop
while read s; do fastq-dump --gzip --split-files $s > $s.log 2>&1 &done < SRR_Acc_List.txt
You can download SRR_Acc_List.txt file at this page (mentioned that page before). That list has 9 sra files corresponding to 9 samples, where each samples was paired end and therefore total number of files is double - 18.
Also note that default marking when extracting from sra for R1 and R2 is _1
and _2
respectively and so if you are running RNAsik
on that full data set you'll need to pass -pairIds "_1,_2"
flag, default is -pairIds "_R1,_R2"
If you want nicely labeled bam and then counts you can pass -samplesSheet samplesSheet.txt
. I haven't implemented url based samples sheets, so you'll need to download one before hand from here. I'll include handling of url based samples sheets into roadmap, so watch that space !
If you ran RNAsik
on a full data set and then used Degust for DGE analysis you should get these results.
Note that report was generated using pandoc
with custome template and igv_links
were created using mk_igv_links
script located in scripts/
directory
Try it out
RNAsik -fqDir https://bioinformatics.erc.monash.edu/home/kirill/sikTestData/rawData/fqFiles.txt \
-align star \
-fastaRef ftp://ftp.ensembl.org/pub/release-91/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz \
-gtfFile ftp://ftp.ensembl.org/pub/release-91/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.91.gtf.gz \
-counts \
-paired \
-all
Introduction
As mentioned previously in about section very first step in RNA-seq analysis is to map your raw reads (FASTQ) to the reference genome following by counting of reads that map onto a feature. But there is always more you could do with your data, in fact almost always only by doing more you can get deeper inside into your biological experiment and the system you are studying. And so RNAsik uses these tools to get as much out of your data as possible in an streamline run:
- STAR aligner for mapping
- featureCounts from subread package for read counting
- samtools for coverage calculation and general bam files filtering
- picard tools also for general bam fiels filtering
- QualiMap for intragenic and interegenic rates
- FastQC for QC metrics on yor fastq files
- MultiQC for wraping everying into nice, single page report
As one can imagine every one of those tools has several number of options and by running RNAsik-pipeline you get predefined - subjective run. Obviously it all comes from years of experience and continues development and improvement. Use can always pass his/her own options through -extraOptions
flag for more fine turning.
Alternatively as, hinted above, user can leverage of RNAsik to run everything separately with fine control over the individual run. RNAsik produces .html report with all commands options specified.
Installation
Using conda
download miniconda .sh
installer
bash Miniconda3-latest-Linux-x86_64.sh -b -p miniconda
source miniconda/etc/profile.d/conda.sh
wget https://raw.githubusercontent.com/MonashBioinformaticsPlatform/RNAsik-pipe/master/supplementary/conda_envs/1.5.4.yaml
conda env create --name rnasik-1.5.4 --file rnasik-1.5.4.yaml
conda activate rnasik-1.5.4
Alternative installation methods
User input
Reference files
Input File | Description |
---|---|
FASTA file | Most often this is your genomic reference sequence. It is a FASTA file holding raw DNA sequences where different features e.g chromosomes are labeled uniquely with a header line starting with '>'. [FASTA Format Description](https://en.wikipedia.org/wiki/FASTA_format) |
GTF/GFF/SAF file | This is your gene annotation file (i.e coordinates of your genes, exons and other genomic features). This should be linked and associated with your genomic reference file. SAF (simple annotation format) is something that featureCounts use and it supported by the pipeline |
Raw data
Input File | Description |
---|---|
FASTQ file | These are your raw files that are provided by the sequencing facility to you, they can be gzipped (.fq, .fastq, .fq.gz, .fastq.gz) |
User input explained
Annotation files
Annotation file would central for differential expression (DE) analysis without one you won't be able to do one. You could have very well assembled genome with very good mapping rate, but unless you know where your genes are on that genome i.e start and end coordinates for your features e.g genes you won't be able to deduce any information about those features and therefore compare between conditions. Below is an example of bear minimum information you need for feature counting.
GeneID Chr Start End Strand
497097 chr1 3204563 3207049 -
497097 chr1 3411783 3411982 -
497097 chr1 3660633 3661579 -
There few entities that provide genome annotation, some cover more species than other. There will be of course individuals that simply provide annotation for one particular species, perhaps for more rare model organisms.
There are also different annotation file formats out there, which makes a little hard to provide RNAsik
support for all of them. Currently RNAsik
can only work with GFF
, GTF
or SAF
file formats. There are many compatibilities issues between formats, but more importantly certain bits of information are only found in some of the files. The example above show SAF
file format and as you can see that includes not human redable gene names nor biotype. GFF
also often doesn't have biotype information, but on the other hand has product tag, which has short description, for protein coding at least, of resulting protein product, GTF
lacks that information. Because of all these little nuances it can be hard to capture all of the desirable information.
Most tools in the pipeline prefer GTF
, some can only work with GTF
. I guess main reason for this is that every line is self contained and the format has been fairly predictable/stable.
If for whatever reason you can't get hold of GFF/GTF
files and your annotation comes in GenBank
(very common for bacterial genomes) or Bed
files, don't panic and try to parse those files into SAF
format. There are plans to include gb_parse.py
script that should help most people with GenBank
files.
Irrespective of which reference file distributor and which annotation file you are going to use, it is highly recommended that both of those files come from the same distributor. Most common distributors are Ensembl, UCSC and NCBI.
Raw data files
Raw data is something that you should take good care of. You can regenerate all other data files, but you can't really regenerate you raw data, not unless you have lots of money and time. So be sure to back your fastq
files up and never mess/do (i.e modify) your original fastq files. If you want to try something out, make a copy and do whatever you are doing on a copy. Also there will never be a need to unzip your fastq file. All of you fastq file should be gziped and have file extension .fastq.gz
or fq.gz
or something similar.
RNAsik
will search recursively your -fqDir
and find all fastq files. RNAsik
can handle nested directories as long as your data is homogeneous i.e all data belongs to the same library type single-end or paired-end. If data is paired end, RNAsik
uses -pairIds
value to figure out read pairs. You can check that all of your fastq files had been found by looking into sikRun/logs/samples/fqFiles
.
After obtaining a list of all fastq files RNAsik
tries to be smart and attempts to group fastq files into samples, that is R1 and R2 reads are grouped, but also any fastq files that had been split across lanes should also be grouped. You should end up, after the run, with the same number of bam files as you have samples. Again you can check grouping in sikRun/logs/samples/fqMap
RNAsik
fastq grouping works in two modes:
- smart guessing it is a little involved but essentially it uses regular expression to check if fastq files have common suffix and therefore belong to the same sample. It heavily relies on clear labeling of R1 and R2 reads for paired-end data.
- a more straight forward mode is simply to use samples sheet file, which is any text file with two columns separated by a tab character,
old_prefix\tnew_prefix
. Prefix in this case is your sample name, unique bit of the file.
Samples sheet in a bit more details; If you have four samples, two wild-type and two controls, you should have four bam files after the analysis. However you number of fastq files is rather variable, depending on your sequencing. For paired-end sequencing you are going to end up with 2 fastq files per sample and 8 fastq files all up. If your sequencing was also split across lanes, say two lanes, then you are going to have 4 fastq file per each samples and 16 fastq files in total. RNAsik
tries to simplify this for you.
RNAsik output
Directories breakdown
Directories | Description |
---|---|
refFiles/ | Contains the reference files (FASTA and GTF) and indices (aligner index) used in the analysis run |
alignerFiles/ | Containts all other, additional files from alignment run, but the bam files. e.g aligner specific log files, splice junction files |
bamFiles/ | Contains pre-processed BAM files, i.e sorted and duplicates marked as well as indexed, all using samtools and picard tools. These BAMs can be used in [IGV](http://software.broadinstitute.org/software/igv/) to view read alignments |
countFiles/ | Contains read count files, "raw" - from `featureCounts`, degust ready counts and filtered for protein_coding features only |
coverageFiles/ | Contains bigWig files for every bam (sample) file - from `bedtools genomecov` and `bedGrapToBigWig` USCS binary. Can load those into IGV |
fastqReport/ | Contains FastQC HTML reports for individual FASTQ files |
qualiMapResults/ | Contains int(ra|er)genic rates from each BAM file. Each BAM has its own directory with metric files. These results generated using `QualiMap rnaseq` command |
fastqDir/ | If you are going to pull your FASTQ file over http in tarball, then tarball will be unarchived here |
multiqc_data/ | Directory created by MultiQC holding a parsed text file, it doesn't serve any purpose for html file |
logs/ | Directory that holds subdirectories, self explanatory, with logs files |
Files breakdown
Files | Description |
---|---|
geneIds.txt | Hold four additional columns that get added into read counts file, that has postfix "-withNames. Gene.id, Chrom, Gene.Name, Biotype. |
strandInfo.txt | Contains guesses, based on `featureCounts` `.summary` files, strand informataion |
multiqc_report.html | This is the report file produced by MultiQC tool. A stand alone html file and can be viewed in any browser |
RNAsik output explained
I hope that the directories and files naming is some what self explanatory, but here is a bit more detailed explanation of those.
Bam files
First thing you most certainly going to get out of the pipeline is your bam files, those will be placed into bamFiles/
directory. I don't really understand why, but featureCounts
works best (fastest) with name sorted BAM files a.k.a unsorted. There is really two types of sorting, sorted by coordinates, often preferred as you can index those bam files and then have quick access to random parts of the file, second type is sorted by name, which insures that in paired-end experiment R1 and R2 pairs are interleaved, one after another, but you can't index those). STAR
aligner can output either of those files. I'm however outputting "unsorted" bam file and then in the second step sorting it with picard SamSort
tool. There are a couple of reasons for that:
- other aligners don't sort e.g bwa and therefore assuming sorted bam file won't work well
- even though
STAR
is pretty amazing (honestly), but I still rather prefer one tool for one job, hence why I also don't count reads withSTAR
The bam files from bamFiles/
are only used with featureCounts
and then picard
suite converts them into sorted and marked duplicate bam files, which are now placed into mdupsFiles/
directory. The rest of the analysis based on these bam files. I'm still deciding what to do with "raw" bam files in bamFiles/
directory. They should be removed after run have finished, but if you have to re-run the pipeline to get additional things (which you can, it will resolve all dependencies and only run new tasks) those bams are now gone and will get regenerated, which will trigger the rest of pipeline to re-run, which is unwanted result. This is why I'm not auto-removing those bam files, rather doing manually after I'm sure.
Count files
Probably the second most important thing in the pipeline is getting read counts. That is given some genome annotation count how many of mapped reads actually ended up mapping into know annotation. For classical differential expression analysis we are interested in protein coding genes only, which pipeline attempts to filter for, but there are other biotypes that we can differentially compare.
The pipeline attempts to guess the strand (directionality) of your library. In theory sequencing provider that had made your libraries should be able to tell you that, but sometimes they get it wrong or simply that information never reaches us (bioinformaticians) hence the guessing.
Pipelines runs featureCounts
three times forcing reads to forward strand only, forcing to reverse strand only and allowing counting on both strand (non stranded library). featureCounts
is very nice and it provides summary table that has number of assigned to feature reads. One can simply compare forward and reverse stranded counts and deduce the strand of the library. In essence this formula is used forward-reverse/forward+reverse
to obtain the ration, if ration is about 0.9 then library is stranded and sign indicates the strand type, if however ration is about 0.1 then library is non stranded, anything else will indication undetermined and strandInfo.txt
file with default to NonStranded,1
note the number one after the comment indicating status code, meaning exit with error. If you see that in your strandInfo.txt
file you'll need to manually inspect your *.summary
files from featureCounts
and make decision about which library type to go with. Actual implementation of strand guessing can be found in this script scripts/strand_guessing.py
.
featureCounts
by default for any given run outputs two files, counts (e.g NonStrandedCounts.txt
) and summary (e.g NonStrandedCounts.txt.summary
). RNAsik
attempts to "clean up" counts file, which includes removing and addition of certain columns to make counts files more informative. The columns that are added can be found in geneIds.txt
. If for what ever reason your geneIds.txt
is empty then all the other files with postfix -withNames
going to be empty too. You could try to regenerate geneIds.txt
file using scripts/get_gene_ids.py
script and then scripts/mk_counts_file.py
to obtain "clean" table of counts. Doing so isn't strictly needed however additional information such as human understandable gene name and biotypy often come very handy in understanding differential expression. Having a biotype in counts file also allows you to filter for specific biotype e.g protein_coding
or snRNA
provided your annotation file has that information.
Command line options
Read alignment
Options | Usage |
---|---|
-align | specify your aligner of choice [star|starWithAnn|hisat|bwa] |
-fqDir | specify path to your raw data directory. `RNAsik` will search that path recursively, so don't worry about nested directores |
-fastaRef | specify path to your reference FASTA file, i.e file that holds your refrence genome |
-paired | specify if data is paired end (RNASik looks for R1 and R2 in the FASTQ filename representing Read 1 and Read 2 |
Read counting
Options | Usage |
---|---|
-counts | flag if you'd like to get read counts |
-gtfFile | specify path to your reference annotation file [GTF|GFF|SAF] |
Reads metrics
Options | Usage |
---|---|
-all | This is an aggregate flag that is a short hand to get everything pipeline has to offer |
-qc | Flag to collect several different QC metrics on your BAM and counts data |
-exonicRate | flag if you'd like to get Int(ra|er)genic rates for your reads, using QualiMap tool |
-multiqc | flag if you'd like to get general report that summarises different log files including `STAR`, `featureCounts`, `FastQC` and `QualiMap` |
Extra options
Options | Usage |
---|---|
-refFiles | path to refFiles/ directory previously generated by RNAsik run |
-mdups | flag to mark duplicates |
-trim | perform FASTQ adapter and quality trimming |
-cov | get coverage plots, bigWig |
-umi | flag to deduplicate using UMI information |
-samplesSheet | specify name of a tab separated text file, two columns,the first with old prefixes to be removed by new prefixes in the second column |
-genomeIdx | specify path to pre-existing alignment index |
-outDir | give a name to your analysis output directory [sikRun] |
-extn | provide your fastq files extntion. [".fastq.gz"] |
-pairIds | provide type identification, default is [`_R1,_R2`] |
-extraOpts | provide key=value pairs, one per line, with key being tool name and value is a string of options e.g `star="--outWigType bedGraph"` |
-configFile | specify your own config file with key=value pairs, one per line, for all tools |
Unusual user case
-bamsDir | specify path to BAMs directory. Use if bams were generated outside of the pipeline |
RNAsik config file
Below is a list of all options that are supported so by RNAsik
so far with the defatul values.
User can pass additional config file through -configFile
options which takes precedence.
User config dosen't have to hold all key=value pairs, just the one user wants to change.
For example if I want to use different STAR
executable. I would create new file (name of the file isn't relevant),
and inside I just put.
starExe = new/path/to/STAR
RNAsik
will inherit the rest of the options from the global config.
Two things to note:
- you can't have comment on the same line as options
- don't quote values
Executable paths
starExe = STAR
hisat2Exe = hisat2
bwaExe = bwa
samtoolsExe = samtools
bedtoolsExe = bedtools
countsExe = featureCounts
fastqcExe = fastqc
pythonExe = python
picardExe = picard
jeExe = je
qualimapExe = qualimap
multiqcExe = multiqc
skewerExe = skewer
cpFileExe = cp
Memory settings
starIdxMem = 40000000000
starAlignMem = 40000000000
bwaIdxMem = -1
bwaAlignMem = -1
hisat2IdxMem = -1
hisat2AlignMem = -1
samtoolsSortMem = 3000000000
samtoolsQcMem = -1
picardMarkDupsMem = 6000000000
picardQcMem = 6000000000
jeMem = picardMarkDupsMem
picardQcMem = -1
picardCreateDictMem = 4000000000
countsMem = -1
bedtoolsMem = 4000000000
fastqcMem = -1
multiqcMem = -1
skewerMem = -1
cpFileMem = 4000000000
Threads settings
starIdxCpu = 24
starAlignCpu = 10
bwaIdxCpu = 30
bwaAlignCpu = 10
hisat2IdxCpu = 30
hisat2AlignCpu = 10
samtoolsSortCpu = 4
samtoolsQcCpu = 2
picardMarkDupsCpu = 2
jeCpu = picardMarkDupsCpu
picardQcCpu = 2
picardCreateDictCpu = 2
bedtoolsCpu = 2
countsCpu = 5
fastqcCpu = 8
multiqcCpu = 2
skewerCpu = 5
cpFileCpu = 2
Python scripts
There a few python scripts that are used in the pipeline to do various things, mostly to do with read counts post processing.
Hopefully script name are self explanatory or have a good hint to what they do. All script have help menu, simply run the scripts
with -h|--help
to get more information. All script also output to stdout (print to the screen) in order to save the output you'll
need to redirect to a file with >
symbol.
All script set to be executable so you should be able just run them. If for some reason you cannot run them just use python
prefix
in front of the script name. Here are some examples on how to run python scripts
./get_geneids.py
OR
/full_path/to/get_geneids.py
OR
python get_geneids.py
get_geneids.py
Converts GTF|GFF files into four columns, tab delimited file Gene.Id\tChrom\tGene.Name\tBiotype
Note that Gene.Name
and Biotype
are subject to availability, i.e if that information isn't present
in your annotation file then script (obviously) can't parse it out.
We need this information to augment our counts file. This isn't essential for DGE but makes
our exploratory analysis easier.
mk_cnts_file.py
Once we have our counts files from featureCounts
and we've created geneIds.txt
file with previous command
We can add that extra meta information to our counts file.
Note that this script filter on biotype, default is [protein_coding]. Also note that if biotype information is
absent then your -withNames-proteinCoding.txt
file will be empty since no such biotype was found.
You can use --biotype all
to get unfiltered - all genes. This is usually located in -withNames.txt
file
strand_guessing.py
This script takes three different counts files i.e Forward, Reverse and Non stranded counts and attempts to
guess which strand the data is. We can get this information from the sequencing facility and most Illumina
library preparation kits are reverse stranded these days, but sometimes things happened and this approach
gives us correct answers from the data itself.
The output of this script is three values, comma separated; StrandType,StrandValue,ExitCode. e.g ReverseStandedCounts,-0.924,0
The first value is obvious, tells you the type of strand it guessed. The second value is calculated like this
strnd_val = float(forward - reverse) / float(forward + reverse)
which in essence allows us to estimate strand type. It is useful to look at that value as well.
Magnitude is the confidence and the sign is directrion, negative is reverse, positive is forward stranded
The third value what's called an exit code, zero == all good. It means the script is pretty confident that it guessed it right. The other
possible value could be 1, 1 == not good. It means the script couldn't guess what the strand was and simply
defaulted to NonStrandedCounts
. You should never see exit code 1 with anything else by non stranded counts
i.e NonStrandedCounts,1
. If you do, please report this as a bug.
mk_igv_links.py
This script isn't used in the pipeline actually, but it comes rather handy if you want to visualise your
coverage plots, bigWig (.bw
) files or bam (.bam
) files
Case study
Once your RNAsik
run had finished, have a look at strandInfo.txt
file
cat sikRun/countFiles/strandInfo.txt
If you are seeing NonStrandedCounts,1
you will need to manually check and figure out what strand your
data actually is. You will need to have a look at the .summary
files for each of the three counts files.
This is standard, summary, output file from featureCounts
. For simplicity sake just look at the first sample's
raw counts, second column and we are just going to look at "Assigned" row, which is second row. So we are focusing
on second column, second row cell. We want to compare the number in that cell across three different strands counts.
If the data is stranded then that strand going to get dominant number of reads. If reverse stranded counts getting vastly more reads then forward stranded counts then the data is reverse stranded. It is possible that non stranded counts in this case get essentially the same number of reads as reverse stranded. This is okay, but the data still is reverse stranded. Even if your non stranded counts have slightly more counts then your most dominant strand, (in this case) reverse stranded, this is still a reverse stranded data.
On the other hand if you are seeing roughly 50/50 split between forward and reverse reads, then this is non stranded data.
Now that you know what you are looking for, lets say, your data actually appears forward stranded and you would like to get
-withNames.txt
and withNames-proteinCoding.txt
files. You will need to use mk_cnts_file.py
script
./mk_cnt_file.py -h
usage: mk_cnts_file.py --counts_dir <path/to/coutns_dir> --gene_ids <path/to/geneIds.txt
This script summarises log files information into html table
optional arguments:
-h, --help show this help message and exit
--counts_file COUNTS_FILE
path to directory with featureCounts files
--gene_ids GENE_IDS path to geneIds.txt file, format EnsmblId Chrm
GeneName Biotype
--samples_sheet SAMPLES_SHEET
path to samplesSheet.txt file, format old_prefix
new_prefix
--biotype BIOTYPE specify biotype of interest [protein_coding], 'all' is
special to include everything
--counts_col COUNTS_COL
Indicate which column begins read counts. default [7],
which is featureCounts default, all columns before 7th
are metadata cols
To get all feature types
./mk_cnt_file.py --counts_file sikRun/countFiles/ForwardStrandedCounts.txt \
--gene_ids sikRun/countFiles/geneIds.txt \
--samples_sheet sikRun/samplesSheet.txt \
--biotype all > ForwardStrandedCounts-withNames.txt
To get just protein_coding features
./mk_cnt_file.py --counts_file sikRun/countFiles/ForwardStrandedCounts.txt \
--gene_ids sikRun/countFiles/geneIds.txt \
--samples_sheet sikRun/samplesSheet.txt \
--biotype protein_coding > ForwardStrandedCounts-withNames-proteinCoding.txt
Metric files explained
A lot of the tools output some sort of metrics files, to stdout/stderr or a file. Those metrics are important for analysis and QC checks going forward with the data. All of those metrics files are summarised into multiqc report. This section will attempt to explain in details each one of those metrics
STAR
Log.final.out
This is straight out of the STAR docs
Log.final.out: summary mapping statistics after mapping job is complete, very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged over all reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately). Most of the informa- tion is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.
Picard
These are all picard tools that are used in the pipeline
CreateSequenceDictionary
SortSam
MarkDuplicates
BuildBamIndex
CollectAlignmentSummaryMetrics
CollectInsertSizeMetrics
CollectGcBiasMetrics
EstimateLibraryComplexity