Most R packages are available on the CRAN package repository, and installed using install.packages(). Bioconductor is another specialized repository devoted to bioinformatics R packages. Here is how to install packages from Bioconductor:
## Don't run this if you are using our server, the packages are already installed!
## Loads the Bioconductor installer:
# source("https://bioconductor.org/biocLite.R")
#
## Install a basic set of packages
# biocLite()
#
## Install further packages used in this tutorial
# biocLite(c(
# "Biostrings",
# "GenomicRanges",
# "GenomicFeatures",
# "rtracklayer",
# "BSgenome",
# "plyranges"
#))
The packages are loaded using library, as we loaded tidyverse earlier:
library(tidyverse)
library(Biostrings) # provides DNAString, DNAStringSet, etc
library(GenomicRanges) # provides GRanges
library(GenomicFeatures) # provides TxDb
library(rtracklayer) # provides import
library(BSgenome) # provides getSeq
library(plyranges) # allow dplyr functions to work with GRanges
Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website, or within R:
vignette()
vignette(package="GenomicRanges")
vignette("GenomicRangesHOWTOs")
Package Biostrings offers classes for storing DNA strings, DNAString, amino acid sequences, AAString, or anything else in a BString. These are like character strings, but a variety of biologically meaningful functions can be applied to them.
myseq <- DNAString("ACCATTGATTAT")
myseq
## 12-letter "DNAString" instance
## seq: ACCATTGATTAT
class(myseq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
reverseComplement(myseq)
## 12-letter "DNAString" instance
## seq: ATAATCAATGGT
translate(myseq)
## 4-letter "AAString" instance
## seq: TIDY
subseq(myseq, 3,5)
## 3-letter "DNAString" instance
## seq: CAT
myseq[3:5]
## 3-letter "DNAString" instance
## seq: CAT
as.character(myseq)
## [1] "ACCATTGATTAT"
You can see a complete set of functions that work with DNAString with:
methods(class="DNAString")
You can see get help on the DNAString class with:
?"DNAString-class"
Reverse complement the following DNA sequence and then translate to an amino acid sequence:
TTCCATTTCCAT
Often we want to work with a list of sequences, such as chromosomes.
myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset
## A DNAStringSet instance of length 2
## width seq names
## [1] 12 ACCATTGATTAT chrI
## [2] 8 ACGTACGT chrII
# A DNAStringSet is list-like
myset$chrII
## 8-letter "DNAString" instance
## seq: ACGTACGT
# or myset[["chrII"]]
# or myset[[2]]
We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges type. GRanges builds on IRanges, “integer ranges”. An IRanges has starts and ends. A GRanges additionally has sequence names and strand information.
A GRanges can contain multiple ranges, but we initially demonstrate it with a single range.
myrange <- GRanges("chrI", IRanges(3,5), "+")
myrange
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
getSeq(myset, myrange)
## A DNAStringSet instance of length 1
## width seq
## [1] 3 CAT
Note that if the range were on the reverse strand, getSeq would automatically take the reverse complement.
Accessing GRanges data:
seqnames(myrange)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : chrI
## Levels(1): chrI
start(myrange)
## [1] 3
end(myrange)
## [1] 5
strand(myrange)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : +
## Levels(3): + - *
as.data.frame(myrange)
## seqnames start end width strand
## 1 chrI 3 5 3 +
Based on what we saw for DNAString, how can we learn more about using GRanges and IRanges objects?
Files can be loaded using the import function from rtracklater.
seq <- import("BRM5012-files/ecoli.fasta")
gr <- import("BRM5012-files/ecoli.gtf")
DNA sequences are usually given in the FASTA file format. Genome annotations are available in a variety of text formats such as GFF3 and GTF. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.
The plyranges package makes tidyverse functions such as filter and arrange usable with GRanges, as well as providing operations that specifically make sense for GRanges such as finding overlapping ranges. (Many of these are also provided by the GenomicRanges package, but with a less “tidy” interface.)
Let’s look at the features associated with the gene “lacY”.
filter(gr, gene_name == "lacY")
## GRanges object with 6 ranges and 19 metadata columns:
## seqnames ranges strand | source type score
## <Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
## [1] Chromosome 363824-365077 - | ena gene <NA>
## [2] Chromosome 363824-365077 - | ena transcript <NA>
## [3] Chromosome 363824-365077 - | ena exon <NA>
## [4] Chromosome 363827-365077 - | ena CDS <NA>
## [5] Chromosome 365075-365077 - | ena start_codon <NA>
## [6] Chromosome 363824-363826 - | ena stop_codon <NA>
## phase gene_id gene_version gene_name gene_source
## <integer> <character> <character> <character> <character>
## [1] <NA> ER3413_351 1 lacY ena
## [2] <NA> ER3413_351 1 lacY ena
## [3] <NA> ER3413_351 1 lacY ena
## [4] 0 ER3413_351 1 lacY ena
## [5] 0 ER3413_351 1 lacY ena
## [6] 0 ER3413_351 1 lacY ena
## gene_biotype transcript_id transcript_version transcript_name
## <character> <character> <character> <character>
## [1] protein_coding <NA> <NA> <NA>
## [2] protein_coding AIZ53215 1 lacY-1
## [3] protein_coding AIZ53215 1 lacY-1
## [4] protein_coding AIZ53215 1 lacY-1
## [5] protein_coding AIZ53215 1 lacY-1
## [6] protein_coding AIZ53215 1 lacY-1
## transcript_source transcript_biotype exon_number exon_id
## <character> <character> <character> <character>
## [1] <NA> <NA> <NA> <NA>
## [2] ena protein_coding <NA> <NA>
## [3] ena protein_coding 1 AIZ53215-1
## [4] ena protein_coding 1 <NA>
## [5] ena protein_coding 1 <NA>
## [6] ena protein_coding 1 <NA>
## exon_version protein_id protein_version
## <character> <character> <character>
## [1] <NA> <NA> <NA>
## [2] <NA> <NA> <NA>
## [3] 1 <NA> <NA>
## [4] <NA> AIZ53215 1
## [5] <NA> <NA> <NA>
## [6] <NA> <NA> <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Look at the different types in the “type” column. Each “gene” may have multiple “transcript” features (isoforms). Each transcript in turn has a set of “exon” features, and if it is a protein coding gene, a set of “CDS” (coding sequence) features. The CDS features cover a subset of the bases covered by the exon features.
--------------------------------------------------> gene
--------------------------------------------> transcript
----------> ---> ----------------> exon
----> ---> ----------> CDS
-----------------------------------> transcript
--------> ----> ----------> exon
---> ----> --> CDS
Bacteria are simple in that each gene has one transcript, and each transcript has one exon and one coding sequence.
Suppose we want to extract all the coding sequences. We will use a method to do this that works even if genes have multiple transcripts and transcripts have multiple exons, as this is necessary when working with eukaryotes.
db <- makeTxDbFromGRanges(gr)
## Warning in .get_cds_IDX(type, phase): The "phase" metadata column contains non-NA values for features of
## type stop_codon. This information was ignored.
cds_list <- cdsBy(db, by="tx", use.names=TRUE)
cds_seqs <- extractTranscriptSeqs(seq, cds_list)
cds_lengths <- lengths(cds_seqs)
translate(cds_seqs)
## A AAStringSet instance of length 4051
## width seq names
## [1] 22 MKRISTTITTTITITTGNGAG* AIZ54182
## [2] 821 MRVLKFGGTSVANAERFLR...AAGVFADLLRTLSWKLGV* AIZ50783
## [3] 311 MVKVYAPASSANMSVGFDV...GFVHICRLDTAGARVLEN* AIZ51761
## [4] 429 MKLYNLKDHNEQVSFAQAV...HNLPADFAALRKLMMNHQ* AIZ52710
## [5] 99 VKKMQSIVLALSLVLVAPM...KKAPHDHHGGHGPGKHHR* AIZ53685
## ... ... ...
## [4047] 215 MARTKLKFRLHRAVIVLFC...LLTASKPVPEEEESEKKE* AIZ54159
## [4048] 556 VAQFVYTMHRVGKVVPPKR...LGADALEPKRIKYKRIAK* AIZ54165
## [4049] 171 MHQVVCATTNPAKIQAILQ...VYHQAVILALSPFHNAVY* AIZ54168
## [4050] 290 MDQAGIIRDLLIWLEGHLD...AGDRPINLRCELLIPIRR* AIZ54169
## [4051] 239 MQTPHILIVEDELVTRNTL...IIATIHGEGYRFCGDLED* AIZ54176
stops <- subseq(cds_seqs, cds_lengths-2, cds_lengths)
table(stops)
## stops
## TAA TAG TGA
## 2621 279 1151
This is a lot to take in. If you wish to use this in earnest you should read over the vignettes and reference documentation of the GenomicRanges, GenomicFeatures and plyranges packages.
table is similar to count that we saw earlier, but operates on vectors directly rather than data frames. If we wanted to do this the tidy way we need to first put things in a data frame. It can be a bit of a chore moving between Bioconductor and Tidyverse!
stops_df <- data_frame(stop=as.character(stops))
count(stops_df, stop)
## # A tibble: 3 x 2
## stop n
## <chr> <int>
## 1 TAA 2621
## 2 TAG 279
## 3 TGA 1151
The start codon is the first three bases of a coding sequence. Get the start codons of each coding sequence. What is the most common start codon?
The GenomicRanges and plyranges packages contain many useful operations on GRanges objects, such as obtaining flanking regions or overlapping or nearby ranges.
Suppose we wanted to look at the sequence immediately upstrand of each transcript. We can get the range for each transcript with:
tx <- transcripts(db) # or filter(gr, type == "transcript")
The flank function can then be used to get regions directly upstrand (the default) or downstrand of each range.
upstrand <- flank(tx, 20)
getSeq(seq, upstrand)
## A DNAStringSet instance of length 4257
## width seq
## [1] 20 TTACAGAGTACACAACATCC
## [2] 20 AAGGTAACGAGGTAACAACC
## [3] 20 ATGGAAGTTAGGAGTCTGAC
## [4] 20 CACGAGTACTGGAAAACTAA
## [5] 20 AATGATAAAAGGAGTAACCT
## ... ... ...
## [4253] 20 AAATTTTCAAAGGGTGGAAG
## [4254] 20 ACATAGAGGCGAAGTCCAAC
## [4255] 20 TTTTCTGAGTAATGCTGATT
## [4256] 20 TGAAAGGATGAGGATATTTT
## [4257] 20 TGGCAATTTAGGTAGCAAAC
By eyeball, there seems to be a fairly common “AGG” motif. De novo motif discovery could be applied to these sequences, either using external software such as meme or a package within Bioconductor such as motifRG.
We mention Seqinfo objects mostly because they are something you may trip over if they contain wrong or inconsistent information.
GRanges and DNAStringSet have a “seqinfo” attribute. This mostly acts as a safety check. One thing this specifies is which chromosomes are circular, and this alters the behaviour of some operations. Bacterial DNA is circular, so we can amend the seqinfo associated with our various objects to reflect this.
seqinfo(seq)
## Seqinfo object with 1 sequence from an unspecified genome:
## seqnames seqlengths isCircular genome
## Chromosome 4558660 NA <NA>
seqinfo(gr)
## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
## seqnames seqlengths isCircular genome
## Chromosome NA NA <NA>
si <- seqinfo(seq)
isCircular(si) <- TRUE
si
## Seqinfo object with 1 sequence (1 circular) from an unspecified genome:
## seqnames seqlengths isCircular genome
## Chromosome 4558660 TRUE <NA>
seqinfo(gr) <- si
It’s not uncommon for chromosomes to be given slightly different names in different files, eg with or without a “chr” prefix. The renameSeqlevels function can be used to fix this.
It’s also important to make sure the same genome assembly has been used for your data. For example the current human reference genome is “hg38” but some data still uses the older “hg19” genome. Genomic ranges based on hg19 would refer to the wrong locations within the hg38 assembly, and would need to be “lifted over” to hg38 if this is the genome assembly you are using.
This has been a whirlwind tour of one set of core classes in Bioconductor. The main thing you should take away from this is how to examine Bioconductor objects and where to look for documentation. There are many many Bioconductor packages for working with specific types of data, which we hope will be accessible now that you have a general sense of how Bioconductor packages work.