This section looks at working with sequences, primarily DNA sequences, and genomic features.

We will be using Bioconductor packages for this. Recall that most R packages are available on the CRAN package repository, and installed using install.packages(). Bioconductor is another specialized repository devoted to bioinformatics R packages. Here is how to install packages from Bioconductor:

# Don't run this if you are using our biotraining server, the packages are already installed!

# This loads the Bioconductor installer
source("https://bioconductor.org/biocLite.R")

# If this says you have a version of Bioconductor prior to 3.3
# some things in this tutorial may fail. To upgrade:
# - Make sure you have R 3.3 or higher
# - remove.packages("BiocInstaller")
# - Restart R try again from the top

# Install a basic set of packages
biocLite()

# Install further packages used in this tutorial
biocLite(c(
    "Biostrings",
    "GenomicRanges",
    "BSgenome", 
    "rtracklayer",
    "motifRG",
    "AnnotationHub",
    "ensembldb"
))

# If you want to install further packages in future, you can use
#   library(BiocInstaller)
#   biocLite( ... )

Bioconductor represents a different strand of current development in R, separate from the Hadley Wickham tidyverse. Where Hadley emphasizes the data frame above all else, Bioconductor uses a great variety of data types. It’s the very opposite of tidy!

Nevertheless, Bioconductor is overwhelmingly comprehensive, and represents the most complete environment available for working with bioinformatic data currently available.

Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website, or within R:

vignette()
vignette(package="Biostrings")
vignette("BiostringsQuickOverview", package="Biostrings")

DNA sequences and genomic ranges

Load packages we will be using:

library(Biostrings)     # Provides DNAString, DNAStringSet, etc
library(BSgenome)       # Provides getSeq()
library(GenomicRanges)  # Provides GRanges, etc
library(rtracklayer)    # Provides import() and export()

DNAString

Package Biostrings offers classes for storing DNA strings, DNAString, amino acid sequences, AAString, or anything else in a BString. These are very like character strings, but a variety of biologically meaningful functions can be applied to them.

myseq <- DNAString("ACCATTGATTAT")
myseq

##   12-letter "DNAString" instance
## seq: ACCATTGATTAT

class(myseq)

## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

reverseComplement(myseq)

##   12-letter "DNAString" instance
## seq: ATAATCAATGGT

translate(myseq)

##   4-letter "AAString" instance
## seq: TIDY

subseq(myseq, 3,5)

##   3-letter "DNAString" instance
## seq: CAT

myseq[3:5]

##   3-letter "DNAString" instance
## seq: CAT

as.character(myseq)

## [1] "ACCATTGATTAT"

You can see a complete set of functions that work with DNAString with:

methods(class="DNAString")

You can see get help on the DNAString class with:

?"DNAString-class"

DNAStringSet

Often we want to work with a list of sequences, such as chromosomes.

myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset

##   A DNAStringSet instance of length 2
##     width seq                                                         names               
## [1]    12 ACCATTGATTAT                                                chrI
## [2]     8 ACGTACGT                                                    chrII

# A DNAStringSet is list-like
myset$chrII

##   8-letter "DNAString" instance
## seq: ACGTACGT

# or myset[["chrII"]]
# or myset[[2]]

GRanges

We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges type. GRanges builds on IRanges, “integer ranges”. An IRanges has a starts and ends. A GRanges additionally has sequence names and strand information.

range1 <- GRanges("chrI", IRanges(start=3,end=5), strand="+")
range1

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

getSeq(myset, range1)

##   A DNAStringSet instance of length 1
##     width seq
## [1]     3 CAT

range2 <- GRanges("chrI", IRanges(start=3,end=5), strand="-")
getSeq(myset, range2)

##   A DNAStringSet instance of length 1
##     width seq
## [1]     3 ATG

Accessing GRanges data:

seqnames(range1)

## factor-Rle of length 1 with 1 run
##   Lengths:    1
##   Values : chrI
## Levels(1): chrI

start(range1)

## [1] 3

end(range1)

## [1] 5

strand(range1)

## factor-Rle of length 1 with 1 run
##   Lengths: 1
##   Values : +
## Levels(3): + - *

as.data.frame(range1)

##   seqnames start end width strand
## 1     chrI     3   5     3      +

Further manipulations:

# GRanges are sometimes like vectors:
c(range1, range2)

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   [2]     chrI       3-5      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# GRanges can have metadata columns, so they are also like data frames:
mcols(range1)$wobble <- 10
range1

## GRanges object with 1 range and 1 metadata column:
##       seqnames    ranges strand |    wobble
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chrI       3-5      + |        10
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mcols(range1)

## DataFrame with 1 row and 1 column
##      wobble
##   <numeric>
## 1        10

range1$wobble

## [1] 10

# A handy way to create a GRanges
as("chrI:3-5:+", "GRanges")

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Question

Based on what we saw for DNAString, how can we learn more about using GRanges and IRanges objects?

Challenge

Reverse complement the following DNA sequence and then translate to an amino acid sequence:

TTCCATTTCCAT

Loading files

Loading sequences

DNA sequences are generally stored in FASTA format, a simple text format. These can be loaded with readDNAStringSet from Biostrings. Let’s load the genome of E. coli strain K-12, obtained from the Ensembl FTP site.

### The start of the .fa file looks like this:
# >Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1
# AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
# TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
# TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
# ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
# AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
# CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
# ...

seqs <- readDNAStringSet("r-more-files/Escherichia_coli_k_12.GCA_000800765.1.29.dna.genome.fa")
seqs

##   A DNAStringSet instance of length 1
##       width seq                                                       names               
## [1] 4558660 AGCTTTTCATTCTGACTGCAACGGGCA...TAAAAAACGCCTTAGTAAGTATTTTTC Chromosome dna:ch...

# Our chromosome name is too verbose.
# Remove everything from the name after the first space.
names(seqs)

## [1] "Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1"

names(seqs) <- sub(" .*","",names(seqs))
names(seqs)

## [1] "Chromosome"

Conversely, a DNAStringSet can also be written to a file with writeXStringSet.

Loading features

Genome annotations are available in a variety of text formats such as GFF3 and GTF. They can be loaded with the import function from rtracklayer. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.

### The start of the .gtf file looks like this:
# #!genome-build ASM80076v1
# #!genome-version GCA_000800765.1
# #!genome-date 2014-12
# #!genome-build-accession GCA_000800765.1
# #!genebuild-last-updated 2014-12
# Chromosome      ena     gene    190     255     .       +       .       gene_id "ER3413_4519"; gene_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding";
# Chromosome      ena     transcript      190     255     .       +       .       gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding";
# Chromosome      ena     exon    190     255     .       +       .       gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AIZ54182-1"; exon_version "1";
# Chromosome      ena     CDS     190     252     .       +       0       gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AIZ54182"; protein_version "1";
# ...

features <- import("r-more-files/Escherichia_coli_k_12.GCA_000800765.1.29.gtf")

# Optional: just retain the columns of metadata we need
mcols(features) <- mcols(features)[,c("type","gene_name","gene_id")]

features

## GRanges object with 24926 ranges and 3 metadata columns:
##             seqnames          ranges strand |        type   gene_name     gene_id
##                <Rle>       <IRanges>  <Rle> |    <factor> <character> <character>
##       [1] Chromosome         190-255      + |        gene        thrL ER3413_4519
##       [2] Chromosome         190-255      + |  transcript        thrL ER3413_4519
##       [3] Chromosome         190-255      + |        exon        thrL ER3413_4519
##       [4] Chromosome         190-252      + |         CDS        thrL ER3413_4519
##       [5] Chromosome         190-192      + | start_codon        thrL ER3413_4519
##       ...        ...             ...    ... .         ...         ...         ...
##   [24922] Chromosome 4557950-4558636      + |  transcript        yjtD ER3413_4514
##   [24923] Chromosome 4557950-4558636      + |        exon        yjtD ER3413_4514
##   [24924] Chromosome 4557950-4558633      + |         CDS        yjtD ER3413_4514
##   [24925] Chromosome 4557950-4557952      + | start_codon        yjtD ER3413_4514
##   [24926] Chromosome 4558634-4558636      + |  stop_codon        yjtD ER3413_4514
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Conversely, a GRanges can be written to a file with export.

We can use these annotations to grab sequences from the genome.

feat <- features[4,]
feat

## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |     type   gene_name     gene_id
##            <Rle> <IRanges>  <Rle> | <factor> <character> <character>
##   [1] Chromosome   190-252      + |      CDS        thrL ER3413_4519
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

feat_seq <- getSeq(seqs, feat)
feat_seq

##   A DNAStringSet instance of length 1
##     width seq
## [1]    63 ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGC

translate(feat_seq)

##   A AAStringSet instance of length 1
##     width seq
## [1]    21 MKRISTTITTTITITTGNGAG

The metadata columns let us query the GRanges, for example for a particular gene.

subset(features, gene_name == "lacA")
# Equivalently:
#   features[features$gene_name == "lacA" & !is.na(features$gene_name),]

## GRanges object with 6 ranges and 3 metadata columns:
##         seqnames        ranges strand |        type   gene_name     gene_id
##            <Rle>     <IRanges>  <Rle> |    <factor> <character> <character>
##   [1] Chromosome 363147-363758      - |        gene        lacA  ER3413_350
##   [2] Chromosome 363147-363758      - |  transcript        lacA  ER3413_350
##   [3] Chromosome 363147-363758      - |        exon        lacA  ER3413_350
##   [4] Chromosome 363150-363758      - |         CDS        lacA  ER3413_350
##   [5] Chromosome 363756-363758      - | start_codon        lacA  ER3413_350
##   [6] Chromosome 363147-363149      - |  stop_codon        lacA  ER3413_350
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note: subset is a generic R function. It is also similar to dplyr’s filter. The second argument is special, in it you can refer to columns of the GRanges directly.

We could also get all features of a particular type.

cds <- subset(features, type == "CDS")
cds
# Equivalently:
#   features[features$type == "CDS",]

## GRanges object with 4052 ranges and 3 metadata columns:
##            seqnames          ranges strand |     type   gene_name     gene_id
##               <Rle>       <IRanges>  <Rle> | <factor> <character> <character>
##      [1] Chromosome         190-252      + |      CDS        thrL ER3413_4519
##      [2] Chromosome        337-2796      + |      CDS        thrA    ER3413_1
##      [3] Chromosome       2801-3730      + |      CDS        thrB    ER3413_2
##      [4] Chromosome       3734-5017      + |      CDS        thrC    ER3413_3
##      [5] Chromosome       5234-5527      + |      CDS        yaaX    ER3413_4
##      ...        ...             ...    ... .      ...         ...         ...
##   [4048] Chromosome 4553704-4555125      + |      CDS        creC ER3413_4511
##   [4049] Chromosome 4555186-4556535      + |      CDS        creD ER3413_4512
##   [4050] Chromosome 4556601-4557314      - |      CDS        arcA ER3413_4513
##   [4051] Chromosome 4557410-4557547      + |      CDS        yjjY ER3413_4541
##   [4052] Chromosome 4557950-4558633      + |      CDS        yjtD ER3413_4514
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Further operations on GRanges

Intra-range

Various useful manipulations of individual ranges are defined.

?"intra-range-methods"

Note: How these make use of the strand is a little haphazard. For example flank() and resize() respect strand but shift() does not.

Earlier we translated a coding sequence. Coding sequences are terminated by a stop codon. Let’s extend the CDS feature to include this.

feat <- features[4,]
feat_stop <- resize(feat, width(feat)+3)
seq_stop <- getSeq(seqs, feat_stop)
translate(seq_stop)

##   A AAStringSet instance of length 1
##     width seq
## [1]    22 MKRISTTITTTITITTGNGAG*

resize can fix either the fix="start" or fix="end" of the sequence.

flank can be either flank the start (start=TRUE) or end (start=FALSE).

# input
#                         (----)
#                         .    .
# resize                  (--------)
# resize(fix="end")   (--------)
#                         .    .
# flank              (---).    .
# flank(start=F)          .    .(---)
#                         .    .
# promoters          (------)  .
#                         .    .
# narrow                  .(--).
#                         .    .
# shift (ignores strand!) .  (----)

Inter-range

?"inter-range-methods"
?"setops-methods"

One compelling feature of GenomicRanges is that it is able to find overlapping ranges very quickly.

query <- as("Chromosome:9500-10000:+", "GRanges")
hits <- findOverlaps(query, features, ignore.strand=TRUE)
hits

## Hits object with 10 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1          49
##    [2]         1          50
##    [3]         1          51
##    [4]         1          52
##    [5]         1          54
##    [6]         1          55
##    [7]         1          56
##    [8]         1          57
##    [9]         1          58
##   [10]         1          60
##   -------
##   queryLength: 1 / subjectLength: 24926

subjectHits(hits)

##  [1] 49 50 51 52 54 55 56 57 58 60

features[subjectHits(hits),]

## GRanges object with 10 ranges and 3 metadata columns:
##          seqnames     ranges strand |       type   gene_name     gene_id
##             <Rle>  <IRanges>  <Rle> |   <factor> <character> <character>
##    [1] Chromosome  9306-9893      + |       gene         mog    ER3413_8
##    [2] Chromosome  9306-9893      + | transcript         mog    ER3413_8
##    [3] Chromosome  9306-9893      + |       exon         mog    ER3413_8
##    [4] Chromosome  9306-9890      + |        CDS         mog    ER3413_8
##    [5] Chromosome  9891-9893      + | stop_codon         mog    ER3413_8
##    [6] Chromosome 9928-10494      - |       gene        yaaH    ER3413_9
##    [7] Chromosome 9928-10494      - | transcript        yaaH    ER3413_9
##    [8] Chromosome 9928-10494      - |       exon        yaaH    ER3413_9
##    [9] Chromosome 9931-10494      - |        CDS        yaaH    ER3413_9
##   [10] Chromosome  9928-9930      - | stop_codon        yaaH    ER3413_9
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

findOverlaps(query, features, ignore.strand=FALSE)

## Hits object with 5 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1          49
##   [2]         1          50
##   [3]         1          51
##   [4]         1          52
##   [5]         1          54
##   -------
##   queryLength: 1 / subjectLength: 24926

With findOverlaps, we can use genomic location as the key when joining disparate types of data, so this is an important tool for integrative analysis. See also the related functions nearest, precede, follow, and distance.

GenomicRanges also provides:

range - get a feature that spans from the start to the end of all features in a GRanges.
reduce - merge overlapping features, so that the same bases are covered by a reduced collection of features.
disjoin - as with reduce, but broken at each start and end of the input features.
setdiff - subtracts one set of features from another, could be used with range on a set of exons to get introns. Might need to use GenomicRanges::setdiff if also using dplyr.

# input
#         (--------)
#              (----------------)
#                      (-----)
#                                     (---)
#                                     
# range   (-------------------------------)
# 
# reduce  (----------------------)    (---)
# 
# disjoin (---)(---)(-)(-----)(--)    (---)
# 
# GenomicRanges::setdiff(range(input),input)
#                                 (--)

Challenge

What are E. coli’s most common start and stop codons?

The start codon is the first three bases of the CDS, and the stop codon is the three bases following the end of the CDS.

Hint: Recall that we could get all CDS ranges with:

cds <- subset(features, type == "CDS")

Hint: Use flank() and resize() to manipulate these ranges.

Further data types to explore

GRangesList, etc: Many Bioconductor types have a List version – GRangesList, DNAStringSetList, etc. For example the exons of a collection of genes could be naturally stored in a GRangesList. Most functions that work with GRanges will also worked with GRangesList, and operate on each list element separately.

TxDb: TxDb objects represent the hierarchy of genes which contain transcripts which contain exons and CDS (CoDing Sequence) ranges. TxDb objects are provided by the GenomicFeatures package.

Seqinfo: GRanges (and various other types) may have associated sequence information accessed with seqinfo(). This contains the names and lengths of the sequences the ranges may refer to, and whether they are circular. It allows for some error checking if present.

Finding a known motif

AGGAGGU is the Shine-Dalgarno sequence, which assists binding of the ribosome to a transcript.

vmatchPattern("AGGAGGT", seqs)

## MIndex object of length 1
## $Chromosome
## IRanges object with 63 ranges and 0 metadata columns:
##            start       end     width
##        <integer> <integer> <integer>
##    [1]     56593     56599         7
##    [2]     67347     67353         7
##    [3]    226876    226882         7
##    [4]    229408    229414         7
##    [5]    241665    241671         7
##    ...       ...       ...       ...
##   [59]   4312631   4312637         7
##   [60]   4371930   4371936         7
##   [61]   4410503   4410509         7
##   [62]   4420666   4420672         7
##   [63]   4484025   4484031         7

vmatchPattern is strand specific. If we want matches on the reverse strand we need to also:

vmatchPattern(reverseComplement(DNAString("AGGAGGT")), seqs)

## MIndex object of length 1
## $Chromosome
## IRanges object with 76 ranges and 0 metadata columns:
##            start       end     width
##        <integer> <integer> <integer>
##    [1]     59133     59139         7
##    [2]    125294    125300         7
##    [3]    136473    136479         7
##    [4]    226640    226646         7
##    [5]    266770    266776         7
##    ...       ...       ...       ...
##   [72]   4139844   4139850         7
##   [73]   4181244   4181250         7
##   [74]   4241083   4241089         7
##   [75]   4397026   4397032         7
##   [76]   4473495   4473501         7

Demanding an exact match here is overly strict. vmatchPattern has arguments allowing inexact matches. Alternatively, there is a similar function for searching for a Position Weight Matrix pattern, matchPWM.

The following will search both strands, allowing one mismatch, and produce the result in convenient GRanges form:

query <- DNAString("AGGAGGT")
max.mismatch <- 1

fwd <- vmatchPattern(query, seqs, max.mismatch=max.mismatch)
fwd <- as(fwd, "GRanges")
strand(fwd) <- "+"
rev <- vmatchPattern(reverseComplement(query), seqs, max.mismatch=max.mismatch)
rev <- as(rev, "GRanges")
strand(rev) <- "-"

complete <- c(fwd, rev)
complete

## GRanges object with 7534 ranges and 0 metadata columns:
##            seqnames          ranges strand
##               <Rle>       <IRanges>  <Rle>
##      [1] Chromosome         323-329      +
##      [2] Chromosome       3540-3546      +
##      [3] Chromosome       3765-3771      +
##      [4] Chromosome       5374-5380      +
##      [5] Chromosome       7641-7647      +
##      ...        ...             ...    ...
##   [7530] Chromosome 4550281-4550287      -
##   [7531] Chromosome 4551603-4551609      -
##   [7532] Chromosome 4551732-4551738      -
##   [7533] Chromosome 4552223-4552229      -
##   [7534] Chromosome 4552751-4552757      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Write to GFF file
export(complete, "motif-matches.gff")

We might then view this in the IGV genome browser:

De novo motif finding

Let’s try to “discover” the Shine-Dalgarno sequence for ourselves.

# Note: bacteria do not have introns
# In a eukaryote, you would need to merge CDS by transcript

size <- 20

initiation_regions <- flank(cds, size, start=TRUE)
initiation_seqs <- getSeq(seqs, initiation_regions)
names(initiation_seqs) <- initiation_regions$gene_id

# Look for any composition bias
library(seqLogo)

## Loading required package: grid

letter_counts <- consensusMatrix(initiation_seqs)
probs <- prop.table(letter_counts[1:4,], 2)
seqLogo(probs, ic.scale=FALSE)

seqLogo(probs)

# Generate a background set of sequences by shuffling 
shuffle <- function(dna) {
    # Convert to a vector of single bases
    charvec <- strsplit(as.character(dna),"")[[1]]
    # Shuffle the vector 
    shuffled_charvec <- sample(charvec)
    # Convert back to a DNA string
    DNAString( paste(shuffled_charvec, collapse="") )
}

background_seqs <- DNAStringSet( lapply(initiation_seqs, shuffle) )
names(background_seqs) <- paste0(names(background_seqs), "-shuffled")

library(motifRG)

results <- findMotifFgBg(
    initiation_seqs, background_seqs, 
    both.strand=FALSE, start.width=4)

## AGGA 5.283892e-07 
##  Refine  AGGA 23.05403 : 22.89242 23.10145 22.64465 24.89169 23.78465 TRUE 1590 572 1520 539 
## New motif:  AGGA 
## match range  2074 
## [1] "Rescore"
## [1] "Finished Rescore"
## GGAG 1.758678e-05 
##  Refine  GGAG 12.90943 : 13.6547 11.78928 12.35035 12.47426 14.80007 TRUE 629 252 625 249 
## New motif:  GGAG 
## match range  874 
## [1] "Rescore"
## [1] "Finished Rescore"
## TAGA 5.804435e-07 
##  Refine  TAGA 12.57232 : -12.36378 -13.43672 -12.17826 -12.61662 -12.9688 FALSE 138 463 136 447 
## New motif:  TAGA 
## match range  591 
## [1] "Rescore"
## [1] "Finished Rescore"
## TAGT 4.11006e-06 
##  Refine  TAGT 9.961795 : -10.23105 -9.603613 -8.588459 -9.432462 -9.878337 FALSE 98 318 97 300 
## New motif:  TAGT 
## match range  406 
## [1] "Rescore"
## [1] "Finished Rescore"
## GAGG 0.000101929 
##  Refine  GAGG 8.106548 : 8.759259 7.410004 6.875806 7.349752 9.659215 TRUE 370 181 368 179 
## New motif:  GAGG

summaryMotif(results$motif, results$category)

##            scores signs fg.hits bg.hits fg.seq bg.seq     ratio    fg.frac    bg.frac
## NNAGGANN 23.05403  TRUE    1590     572   1520    539 2.7797203 0.37512340 0.13302073
## NNGGAGNN 23.93018  TRUE    1384     404   1348    388 3.4257426 0.33267522 0.09575518
## NNTAGANN 12.43437 FALSE     148     474    146    456 0.3122363 0.03603159 0.11253702
## NNTAGTNN 10.09031 FALSE      98     322     97    304 0.3043478 0.02393880 0.07502468
## NNGAGGNN 15.99299  TRUE    1001     414    977    392 2.4178744 0.24111550 0.09674235

motifHtmlTable(results)

## [1] "cp  /Library/Frameworks/R.framework/Versions/3.5/Resources/library/motifRG/css/header.html html/motif.html"

refined <- refinePWMMotif(results$motifs[[1]]@match$pattern, initiation_seqs)

## 1 4428.402 
## 2 21011.55 
## 3 21321.49 
## 4 21413.76 
## 5 21440.74 
## 6 21447.47 
## 7 21450.74 
## 8 21454 
## 9 21456.82 
## 10 21459.03 
## 11 21461.5 
## 12 21463.57

seqLogo(refined$model$prob)

There are many other motif finding programs available, outside of R. An alternative approach would be to construct the foreground and background sequences as above, then write them to FASTA files for use with an external program.

writeXStringSet(initiation_seqs, "fg.fa")
writeXStringSet(background_seqs, "bg.fa")

For example MEME (which doesn’t actually need background sequences):

system("meme -dna -maxsize 1000000 fg.fa")

Next steps

We’ve seen just the smallest part of what Bioconductor has to offer in this space.

Besides software, Bioconductor includes packages with data for model organisms, for example. The data is generally from these central repositories:

NCBI’s Entrez Gene gene database and Refseq reference sequences
The EBI’s Ensembl genome browser
The UCSC genome browser

These organizations will generally obtain genome assemblies from the same ultimate sources. For example, all of the above use the Genome Reference Consortium’s GRCh38 DNA sequence for homo sapiens. UCSC likes to call this “hg38” but it is the same DNA sequence. These DNA sequences serve as a common frame of reference. However the three organizations above will differ on their exact set of gene and transcript annotations, and all use a different gene and transcript ID system. These annotations are also revised more often than the underlying DNA sequences.

This mess is partly due to American/European rivalry, and partly due to differing goals. The UCSC genome browser has always been about practicality and showing many lines of evidence. The others are more concerned with careful curation and standardization.

Some example packages:

BSgenome.Hsapiens.UCSC.hg38

Biostrings genome, Homo sapiens, from the UCSC browser, version hg38.

DNA for chromosomes, usable in the same way as the DNAStringSet used above.

TxDb.Hsapiens.UCSC.hg38.knownGene

Transcript database, Homo sapiens, from UCSC browser, genome verison hg38, “knownGene” gene annotations.

GRanges information for genes and transcripts, much as we loaded from a GTF file above.

org.Hs.eg.db

Organism Homo sapiens, primary key is Entrez Gene, database.

Translation of gene ids from various databases, assignment to GO terms, KEGG pathways, etc. Entrez Gene ids are used as the primary key.

biomaRt

Access to BioMart data, on the internet – translation of gene ids, gene sets, gene information, etc.

AnnotationHub

AnnotationHub is a way to retrieve data from a more comprehensive set of organisms and data providers than the above styles of package. The retrieved data is returned in an appropriate Bioconductor type. If data is being updated over time (eg improved annotation of a genome), each version receives a unique ID in AnnotationHub, making it much easier to write reproducable analyses.

AnnotationHub also provides access to experimental data which maps to locations on a genome, similar to the sorts of tracks you would load in the UCSC browser.

Files are cached, so they will only be downloaded once.

In the example below, the yeast genome and annotations are retrieved:

library(AnnotationHub)
ah <- AnnotationHub()

# ah contains a large collection of records that can be retrieved
ah
length(ah)
colnames( mcols(ah) )
table( ah$rdataclass )

# query() searches for terms in an unstructured way
records <- query(ah, c("Ensembl", "85", "Saccharomyces cerevisiae"))
records

mcols(records)
mcols(records)[,c("title","rdataclass")]

# Having located records of interest,
# your R script can refer to the specific AH... record,
# so it always uses the same version of the data.
ah[["AH51399"]]
sc_genome <- import( ah[["AH51399"]] )
sc_granges <- ah[["AH51088"]]

# More recent versions of Bioconductor also allow you to
# retrieve TxDb (and similar EnsDb) objects.


query(ah, c("OrgDb", "Saccharomyces cerevisiae"))
sc_orgdb <- ah[["AH49589"]]

# An OrgDb contains information about genes in an organism
columns(sc_orgdb)
head( keys(sc_orgdb, "ORF") )
select(sc_orgdb, "YFL039C", c("GENENAME", "DESCRIPTION"), keytype="ORF")

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets 
##  [9] methods   base     
## 
## other attached packages:
##  [1] motifRG_1.24.0                    BSgenome.Hsapiens.UCSC.hg19_1.4.0
##  [3] seqLogo_1.46.0                    BSgenome_1.48.0                  
##  [5] rtracklayer_1.40.6                GenomicRanges_1.32.7             
##  [7] GenomeInfoDb_1.16.0               Biostrings_2.48.0                
##  [9] XVector_0.20.0                    IRanges_2.14.12                  
## [11] S4Vectors_0.18.3                  BiocGenerics_0.26.0              
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18                knitr_1.20                  magrittr_1.5               
##  [4] GenomicAlignments_1.16.0    zlibbioc_1.26.0             BiocParallel_1.14.2        
##  [7] lattice_0.20-35             stringr_1.3.1               tools_3.5.0                
## [10] SummarizedExperiment_1.10.1 Biobase_2.40.0              matrixStats_0.54.0         
## [13] htmltools_0.3.6             yaml_2.2.0                  rprojroot_1.3-2            
## [16] digest_0.6.17               Matrix_1.2-14               GenomeInfoDbData_1.1.0     
## [19] codetools_0.2-15            bitops_1.0-6                RCurl_1.95-4.11            
## [22] evaluate_0.11               rmarkdown_1.10              DelayedArray_0.6.6         
## [25] stringi_1.2.4               compiler_3.5.0              Rsamtools_1.32.3           
## [28] backports_1.1.2             XML_3.98-1.16

Working with sequences and genomic features

DNA sequences and genomic ranges

DNAString

DNAStringSet

GRanges

Question

Challenge

Loading files

Loading sequences

Loading features

Further operations on GRanges

Intra-range

Inter-range

Challenge

Further data types to explore

Finding a known motif

De novo motif finding

Next steps

BSgenome.Hsapiens.UCSC.hg38

TxDb.Hsapiens.UCSC.hg38.knownGene

org.Hs.eg.db

biomaRt

AnnotationHub