Installing and loading Bioconctor packages

Most R packages are available on the CRAN package repository, and installed using install.packages(). Bioconductor is another specialized repository devoted to bioinformatics R packages. Here is how to install packages from Bioconductor:

## Don't run this if you are using our server, the packages are already installed!
## Loads the Bioconductor installer:
# source("https://bioconductor.org/biocLite.R")
#
## Install a basic set of packages
# biocLite()
#
## Install further packages used in this tutorial
# biocLite(c(
#    "Biostrings",
#    "GenomicRanges",
#    "GenomicFeatures",
#    "rtracklayer",
#    "BSgenome", 
#    "plyranges"
#))

The packages are loaded using library, as we loaded tidyverse earlier:

library(tidyverse)

library(Biostrings)      # provides DNAString, DNAStringSet, etc
library(GenomicRanges)   # provides GRanges
library(GenomicFeatures) # provides TxDb
library(rtracklayer)     # provides import
library(BSgenome)        # provides getSeq
library(plyranges)       # allow dplyr functions to work with GRanges

Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website, or within R:

vignette()
vignette(package="GenomicRanges")
vignette("GenomicRangesHOWTOs")

DNAString

Package Biostrings offers classes for storing DNA strings, DNAString, amino acid sequences, AAString, or anything else in a BString. These are like character strings, but a variety of biologically meaningful functions can be applied to them.

myseq <- DNAString("ACCATTGATTAT")
myseq
##   12-letter "DNAString" instance
## seq: ACCATTGATTAT
class(myseq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
reverseComplement(myseq)
##   12-letter "DNAString" instance
## seq: ATAATCAATGGT
translate(myseq)
##   4-letter "AAString" instance
## seq: TIDY
subseq(myseq, 3,5)
##   3-letter "DNAString" instance
## seq: CAT
myseq[3:5]
##   3-letter "DNAString" instance
## seq: CAT
as.character(myseq)
## [1] "ACCATTGATTAT"

You can see a complete set of functions that work with DNAString with:

methods(class="DNAString")

You can see get help on the DNAString class with:

?"DNAString-class"

Challenge

Reverse complement the following DNA sequence and then translate to an amino acid sequence:

TTCCATTTCCAT

DNAStringSet

Often we want to work with a list of sequences, such as chromosomes.

myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset
##   A DNAStringSet instance of length 2
##     width seq                                          names               
## [1]    12 ACCATTGATTAT                                 chrI
## [2]     8 ACGTACGT                                     chrII
# A DNAStringSet is list-like
myset$chrII
##   8-letter "DNAString" instance
## seq: ACGTACGT
# or myset[["chrII"]]
# or myset[[2]]

GRanges

We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges type. GRanges builds on IRanges, “integer ranges”. An IRanges has starts and ends. A GRanges additionally has sequence names and strand information.

A GRanges can contain multiple ranges, but we initially demonstrate it with a single range.

myrange <- GRanges("chrI", IRanges(3,5), "+")
myrange
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
getSeq(myset, myrange)
##   A DNAStringSet instance of length 1
##     width seq
## [1]     3 CAT

Note that if the range were on the reverse strand, getSeq would automatically take the reverse complement.

Accessing GRanges data:

seqnames(myrange)
## factor-Rle of length 1 with 1 run
##   Lengths:    1
##   Values : chrI
## Levels(1): chrI
start(myrange)
## [1] 3
end(myrange)
## [1] 5
strand(myrange)
## factor-Rle of length 1 with 1 run
##   Lengths: 1
##   Values : +
## Levels(3): + - *
as.data.frame(myrange)
##   seqnames start end width strand
## 1     chrI     3   5     3      +

Question

Based on what we saw for DNAString, how can we learn more about using GRanges and IRanges objects?

Loading files

Files can be loaded using the import function from rtracklater.

seq <- import("BRM5012-files/ecoli.fasta")
gr <- import("BRM5012-files/ecoli.gtf")

DNA sequences are usually given in the FASTA file format. Genome annotations are available in a variety of text formats such as GFF3 and GTF. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.

Gene annotation: gene, transcript, exon, and CDS ranges

The plyranges package makes tidyverse functions such as filter and arrange usable with GRanges, as well as providing operations that specifically make sense for GRanges such as finding overlapping ranges. (Many of these are also provided by the GenomicRanges package, but with a less “tidy” interface.)

Let’s look at the features associated with the gene “lacY”.

filter(gr, gene_name == "lacY")
## GRanges object with 6 ranges and 19 metadata columns:
##         seqnames        ranges strand |   source        type     score
##            <Rle>     <IRanges>  <Rle> | <factor>    <factor> <numeric>
##   [1] Chromosome 363824-365077      - |      ena        gene      <NA>
##   [2] Chromosome 363824-365077      - |      ena  transcript      <NA>
##   [3] Chromosome 363824-365077      - |      ena        exon      <NA>
##   [4] Chromosome 363827-365077      - |      ena         CDS      <NA>
##   [5] Chromosome 365075-365077      - |      ena start_codon      <NA>
##   [6] Chromosome 363824-363826      - |      ena  stop_codon      <NA>
##           phase     gene_id gene_version   gene_name gene_source
##       <integer> <character>  <character> <character> <character>
##   [1]      <NA>  ER3413_351            1        lacY         ena
##   [2]      <NA>  ER3413_351            1        lacY         ena
##   [3]      <NA>  ER3413_351            1        lacY         ena
##   [4]         0  ER3413_351            1        lacY         ena
##   [5]         0  ER3413_351            1        lacY         ena
##   [6]         0  ER3413_351            1        lacY         ena
##         gene_biotype transcript_id transcript_version transcript_name
##          <character>   <character>        <character>     <character>
##   [1] protein_coding          <NA>               <NA>            <NA>
##   [2] protein_coding      AIZ53215                  1          lacY-1
##   [3] protein_coding      AIZ53215                  1          lacY-1
##   [4] protein_coding      AIZ53215                  1          lacY-1
##   [5] protein_coding      AIZ53215                  1          lacY-1
##   [6] protein_coding      AIZ53215                  1          lacY-1
##       transcript_source transcript_biotype exon_number     exon_id
##             <character>        <character> <character> <character>
##   [1]              <NA>               <NA>        <NA>        <NA>
##   [2]               ena     protein_coding        <NA>        <NA>
##   [3]               ena     protein_coding           1  AIZ53215-1
##   [4]               ena     protein_coding           1        <NA>
##   [5]               ena     protein_coding           1        <NA>
##   [6]               ena     protein_coding           1        <NA>
##       exon_version  protein_id protein_version
##        <character> <character>     <character>
##   [1]         <NA>        <NA>            <NA>
##   [2]         <NA>        <NA>            <NA>
##   [3]            1        <NA>            <NA>
##   [4]         <NA>    AIZ53215               1
##   [5]         <NA>        <NA>            <NA>
##   [6]         <NA>        <NA>            <NA>
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Look at the different types in the “type” column. Each “gene” may have multiple “transcript” features (isoforms). Each transcript in turn has a set of “exon” features, and if it is a protein coding gene, a set of “CDS” (coding sequence) features. The CDS features cover a subset of the bases covered by the exon features.

--------------------------------------------------> gene

-------------------------------------------->       transcript
---------->         --->    ---------------->       exon
      ---->         --->    ---------->             CDS


               -----------------------------------> transcript
               -------->       ---->    ----------> exon
                    --->       ---->    -->         CDS

Bacteria are simple in that each gene has one transcript, and each transcript has one exon and one coding sequence.

Suppose we want to extract all the coding sequences. We will use a method to do this that works even if genes have multiple transcripts and transcripts have multiple exons, as this is necessary when working with eukaryotes.

db <- makeTxDbFromGRanges(gr)
## Warning in .get_cds_IDX(type, phase): The "phase" metadata column contains non-NA values for features of
##   type stop_codon. This information was ignored.
cds_list <- cdsBy(db, by="tx", use.names=TRUE)
cds_seqs <- extractTranscriptSeqs(seq, cds_list)
cds_lengths <- lengths(cds_seqs)

translate(cds_seqs)
##   A AAStringSet instance of length 4051
##        width seq                                       names               
##    [1]    22 MKRISTTITTTITITTGNGAG*                    AIZ54182
##    [2]   821 MRVLKFGGTSVANAERFLR...AAGVFADLLRTLSWKLGV* AIZ50783
##    [3]   311 MVKVYAPASSANMSVGFDV...GFVHICRLDTAGARVLEN* AIZ51761
##    [4]   429 MKLYNLKDHNEQVSFAQAV...HNLPADFAALRKLMMNHQ* AIZ52710
##    [5]    99 VKKMQSIVLALSLVLVAPM...KKAPHDHHGGHGPGKHHR* AIZ53685
##    ...   ... ...
## [4047]   215 MARTKLKFRLHRAVIVLFC...LLTASKPVPEEEESEKKE* AIZ54159
## [4048]   556 VAQFVYTMHRVGKVVPPKR...LGADALEPKRIKYKRIAK* AIZ54165
## [4049]   171 MHQVVCATTNPAKIQAILQ...VYHQAVILALSPFHNAVY* AIZ54168
## [4050]   290 MDQAGIIRDLLIWLEGHLD...AGDRPINLRCELLIPIRR* AIZ54169
## [4051]   239 MQTPHILIVEDELVTRNTL...IIATIHGEGYRFCGDLED* AIZ54176
stops <- subseq(cds_seqs, cds_lengths-2, cds_lengths)
table(stops)
## stops
##  TAA  TAG  TGA 
## 2621  279 1151

This is a lot to take in. If you wish to use this in earnest you should read over the vignettes and reference documentation of the GenomicRanges, GenomicFeatures and plyranges packages.

table is similar to count that we saw earlier, but operates on vectors directly rather than data frames. If we wanted to do this the tidy way we need to first put things in a data frame. It can be a bit of a chore moving between Bioconductor and Tidyverse!

stops_df <- data_frame(stop=as.character(stops))
count(stops_df, stop)
## # A tibble: 3 x 2
##   stop      n
##   <chr> <int>
## 1 TAA    2621
## 2 TAG     279
## 3 TGA    1151

Challenge

The start codon is the first three bases of a coding sequence. Get the start codons of each coding sequence. What is the most common start codon?

Operations on GRanges

The GenomicRanges and plyranges packages contain many useful operations on GRanges objects, such as obtaining flanking regions or overlapping or nearby ranges.

Suppose we wanted to look at the sequence immediately upstrand of each transcript. We can get the range for each transcript with:

tx <- transcripts(db)    # or filter(gr, type == "transcript")

The flank function can then be used to get regions directly upstrand (the default) or downstrand of each range.

upstrand <- flank(tx, 20)
getSeq(seq, upstrand)
##   A DNAStringSet instance of length 4257
##        width seq
##    [1]    20 TTACAGAGTACACAACATCC
##    [2]    20 AAGGTAACGAGGTAACAACC
##    [3]    20 ATGGAAGTTAGGAGTCTGAC
##    [4]    20 CACGAGTACTGGAAAACTAA
##    [5]    20 AATGATAAAAGGAGTAACCT
##    ...   ... ...
## [4253]    20 AAATTTTCAAAGGGTGGAAG
## [4254]    20 ACATAGAGGCGAAGTCCAAC
## [4255]    20 TTTTCTGAGTAATGCTGATT
## [4256]    20 TGAAAGGATGAGGATATTTT
## [4257]    20 TGGCAATTTAGGTAGCAAAC

By eyeball, there seems to be a fairly common “AGG” motif. De novo motif discovery could be applied to these sequences, either using external software such as meme or a package within Bioconductor such as motifRG.

Seqinfo (optional section)

We mention Seqinfo objects mostly because they are something you may trip over if they contain wrong or inconsistent information.

GRanges and DNAStringSet have a “seqinfo” attribute. This mostly acts as a safety check. One thing this specifies is which chromosomes are circular, and this alters the behaviour of some operations. Bacterial DNA is circular, so we can amend the seqinfo associated with our various objects to reflect this.

seqinfo(seq)
## Seqinfo object with 1 sequence from an unspecified genome:
##   seqnames   seqlengths isCircular genome
##   Chromosome    4558660         NA   <NA>
seqinfo(gr)
## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
##   seqnames   seqlengths isCircular genome
##   Chromosome         NA         NA   <NA>
si <- seqinfo(seq)
isCircular(si) <- TRUE
si
## Seqinfo object with 1 sequence (1 circular) from an unspecified genome:
##   seqnames   seqlengths isCircular genome
##   Chromosome    4558660       TRUE   <NA>
seqinfo(gr) <- si

It’s not uncommon for chromosomes to be given slightly different names in different files, eg with or without a “chr” prefix. The renameSeqlevels function can be used to fix this.

It’s also important to make sure the same genome assembly has been used for your data. For example the current human reference genome is “hg38” but some data still uses the older “hg19” genome. Genomic ranges based on hg19 would refer to the wrong locations within the hg38 assembly, and would need to be “lifted over” to hg38 if this is the genome assembly you are using.

Wrap up

This has been a whirlwind tour of one set of core classes in Bioconductor. The main thing you should take away from this is how to examine Bioconductor objects and where to look for documentation. There are many many Bioconductor packages for working with specific types of data, which we hope will be accessible now that you have a general sense of how Bioconductor packages work.