Counting transcripts
6.14 featureCounts
featureCounts
is a popular tool often used within RNA-seq pipelines (and as part of the Rsubread
package in R).
It provides summarization and quantification of mapped read.
aligned reads (BAM) ==> counts matrix (genes as rows, samples as columns)
When used for RNA-seq read counting, featureCounts
calls a ‘hit’ if a read overlaps an exon in the gene by of 1 bp or more.
For paired end reads, typically the ‘fragment’ is only counted if both R1
and R2
reads map to the gene (-B
option).
Counts are summarized at the gene level.
Sometimes a read will map to two or more genomic locations. These ‘multi-mappers’ are ambiguous are not counted, since we cannot be certain of their origin.
Here’s an example of how some reads from an unstranded library might align with a set of genes in the genome. Green reads are those that would be counted - the overlap an exon or UTR with no ambiguity. Red reads are those that would not be counted - the either do no overlap any exon or UTR, or are ambiguous in their assignment.
6.14.1 Strandedness
Sequencing libraries can be ‘stranded’ or ‘unstranded’, depending on the protocol used.
In an ‘unstranded’ library both strands of the cDNA for a transcript are sequenced such that we will have reads for both the coding and non-coding strand, and can’t differentiate between them.
For an ‘unstranded’ library, when two genes overlap on opposite strands, a read in that region cannot be unambigiously assigned.
In a ‘stranded’ library, the reads always come from the coding strand, resolving this ambiguity.
See Figure 6, Griffith et al, 2015 for a nice example of read assignment for strand specific libraries.
6.14.2 Read counting challenge
Challenge: Count reads for gene X
Use this IGV-Web session for sample SRR2155413, find one of the genes below and count the reads that align to regions of the mature transcript(s). (You can also use this IGV desktop app session, but you’ll need the IGV Desktop app installed)
How do your manual counts compare with the value found in the featureCounts
counts matrix ?
Use the search box at the top of the IGV viewer to find one of:
-
GNRH1
(ENSG00000147437
) - this is a short gene with only one intron - Answer: (SRR2155413 = 6 reads) -
GHET1
(ENSG00000281189
) - a lncRNA - Answer: (SRR2155413 = 7 reads) -
INSL3
(ENSG00000248099
) - Answer: (SRR2155413 = 5 reads) - most reads are mapped to the second intron and not counted. The counted reads are in exon3 and the 3’ UTR.
The locus search box will work with gene names and gene IDs as well as chromosomal coordinates.
You can also search for your favorite gene - chances are there will be many mapped reads and many introns, making manual counting impractical and error prone
6.15 References genomes and annotations
A ‘reference’ genome is an polished, high quality genome assembly agreed upon by cooperating international genome sequencing consortia. For model organisms, these typically represent the best source of truth about the nucleotide sequence of a the genome, at the time of release.
The reference genome is provided as a set of nucleotide sequences, usually as (gzip compressed) FASTA format - files with extensions like .fasta.gz
, .fa.gz
or .fna.gz
.
6.15.1 Reference genome identifiers
Assemblies usually have an ‘assembly name’ (eg GRCh38.p14
) and an accession (eg GCA_000001405.29
).
Each database/organization that releases a copy of the reference genome use their own file naming convention,
but on their website they will cross-reference to the corresponding GenBank/RefSeq accession (eg GCA_000001405.29
)
- the accession is the unambigious identifier that should be used in publications.
For human and mouse, there are two (or three) major releases in use:
Human:
-
GRCh37
(akahg19
, released 2009) -
GRCh38
(GCA_000001405
orGCF_000001405
, akahg38
, released 2013) -
T2T-CHM13
(akahs1
, released 2022)
Mouse:
-
GRCm38
(akamm10
, released 2012) -
GRCm39
(akamm39
, released 2020)
eg, The nucleotide sequence of the Ensembl human genome release GRCh38.p14
is equivalent to the NCBI GenBank GCA_000001405.29
release.
The chromosome names (22
vs chr22
) and sometimes the presence of organelle sequences (mitochondria, MT
or chrMT
) can vary between providers.
‘Patch’ releases within a major release GRCh38
, like GRCh38.p14
, don’t change the genomic coordinates within the primary chromosomes, but they can add additional sequences.
These extra sequences may be unplaced contigs (chromosome unknown) or unlocalized contigs (chromosome known, but not exact location),
alternative loci sequences that represent population variation (haplotypes, alternative alleles), common human viruses (eg EBV, Epstein-Barr virus)
Reference genome FASTA files often also come in ‘soft-masked’ (repeat/low complexity regions in lowercase), ‘hard-masked’ (repeat/low complexity regions as N
s) and ‘unmasked’ (all uppercase) flavors.
You should use the ‘soft-masked’ or ‘unmasked’ version for RNA-seq mapping (aligners typically ignore the soft-masking). This allows noisy reads from repeat regions to align there rather than somewhere they shouldn’t.
6.16 Annotations
While the nucleotide sequence of the primary chromosomes for a given release is the same between each database/organization (eg NCBI, Ensembl, ENCODE, UCSF), the annotations can vary (sometimes considerably). This is where each group gets to be creative with their own methodology for curating manual and computational gene annotations.
The structure of Ensembl annotation files tend to be more consistent (uniform use of gene_id, gene_name attributes etc), especially for non-model organisms.
How should I identify the reference genome and annotation in my publication
Use the Genbank or RefSeq database identifier (including .patch
level). Use the
eg. Reads were aligned to the soft-masked human reference genome
GCA_000001405.29
from Ensembl release 112. Gene annotations for quantification were from Ensembl release 112.
eg. We used the human genome assembly
GCF_000001405.40
and corresponding annotations from NCBI for short read alignment and transcript quantification at the gene level.
Salmon and Kallisto
The way featureCounts
assigns reads to transcripts is relatively straightforward; ambiguous data is ignored (eg multi-mappers)
Other more complex methods, such as Salmon and Kallisto have been developed to try and better use all the available data and account for bias, while dealing with this ambiguity.
Salmon uses a Bayesian statistical model that takes into account many parameters associated with a read and transcriptome, to estimate transcript abundance. This includes:
- mapping locations (with probabilistic handling of multiple mapping locations)
- fragment length distribution (for paired end data)
- positional bias (coverage along the transcript)
- sequence-specific bias: Salmon corrects for biases related to the nucleotide sequences surrounding read start sites.
- GC bias
If you’d like to better understand Salmon, take a look at this talk by one of the authors, Carl Kingsford. YouTube, Slides