This section looks at working with sequences, primarily DNA sequences, and genomic features.
We will be using Bioconductor packages for this. Recall that most R packages are available on the CRAN package repository, and installed using install.packages()
. Bioconductor is another specialized repository devoted to bioinformatics R packages. Here is how to install packages from Bioconductor:
# Don't run this if you are using our biotraining server, the packages are already installed!
# This loads the Bioconductor installer
source("https://bioconductor.org/biocLite.R")
# If this says you have a version of Bioconductor prior to 3.3
# some things in this tutorial may fail. To upgrade:
# - Make sure you have R 3.3 or higher
# - remove.packages("BiocInstaller")
# - Restart R try again from the top
# Install a basic set of packages
biocLite()
# Install further packages used in this tutorial
biocLite(c(
"Biostrings",
"GenomicRanges",
"BSgenome",
"rtracklayer",
"motifRG",
"AnnotationHub",
"ensembldb"
))
# If you want to install further packages in future, you can use
# library(BiocInstaller)
# biocLite( ... )
Bioconductor represents a different strand of current development in R, separate from the Hadley Wickham tidyverse. Where Hadley emphasizes the data frame above all else, Bioconductor uses a great variety of data types. It’s the very opposite of tidy!
Nevertheless, Bioconductor is overwhelmingly comprehensive, and represents the most complete environment available for working with bioinformatic data currently available.
Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website, or within R:
vignette()
vignette(package="Biostrings")
vignette("BiostringsQuickOverview", package="Biostrings")
Load packages we will be using:
library(Biostrings) # Provides DNAString, DNAStringSet, etc
library(BSgenome) # Provides getSeq()
library(GenomicRanges) # Provides GRanges, etc
library(rtracklayer) # Provides import() and export()
Package Biostrings
offers classes for storing DNA strings, DNAString
, amino acid sequences, AAString
, or anything else in a BString
. These are very like character strings, but a variety of biologically meaningful functions can be applied to them.
myseq <- DNAString("ACCATTGATTAT")
myseq
## 12-letter "DNAString" instance
## seq: ACCATTGATTAT
class(myseq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
reverseComplement(myseq)
## 12-letter "DNAString" instance
## seq: ATAATCAATGGT
translate(myseq)
## 4-letter "AAString" instance
## seq: TIDY
subseq(myseq, 3,5)
## 3-letter "DNAString" instance
## seq: CAT
myseq[3:5]
## 3-letter "DNAString" instance
## seq: CAT
as.character(myseq)
## [1] "ACCATTGATTAT"
You can see a complete set of functions that work with DNAString with:
methods(class="DNAString")
You can see get help on the DNAString
class with:
?"DNAString-class"
Often we want to work with a list of sequences, such as chromosomes.
myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset
## A DNAStringSet instance of length 2
## width seq names
## [1] 12 ACCATTGATTAT chrI
## [2] 8 ACGTACGT chrII
# A DNAStringSet is list-like
myset$chrII
## 8-letter "DNAString" instance
## seq: ACGTACGT
# or myset[["chrII"]]
# or myset[[2]]
We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges
type. GRanges
builds on IRanges
, “integer ranges”. An IRanges has a starts and ends. A GRanges additionally has sequence names and strand information.
range1 <- GRanges("chrI", IRanges(start=3,end=5), strand="+")
range1
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
getSeq(myset, range1)
## A DNAStringSet instance of length 1
## width seq
## [1] 3 CAT
range2 <- GRanges("chrI", IRanges(start=3,end=5), strand="-")
getSeq(myset, range2)
## A DNAStringSet instance of length 1
## width seq
## [1] 3 ATG
Accessing GRanges data:
seqnames(range1)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : chrI
## Levels(1): chrI
start(range1)
## [1] 3
end(range1)
## [1] 5
strand(range1)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : +
## Levels(3): + - *
as.data.frame(range1)
## seqnames start end width strand
## 1 chrI 3 5 3 +
Further manipulations:
# GRanges are sometimes like vectors:
c(range1, range2)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## [2] chrI 3-5 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# GRanges can have metadata columns, so they are also like data frames:
mcols(range1)$wobble <- 10
range1
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | wobble
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chrI 3-5 + | 10
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mcols(range1)
## DataFrame with 1 row and 1 column
## wobble
## <numeric>
## 1 10
range1$wobble
## [1] 10
# A handy way to create a GRanges
as("chrI:3-5:+", "GRanges")
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Based on what we saw for DNAString
, how can we learn more about using GRanges
and IRanges
objects?
Reverse complement the following DNA sequence and then translate to an amino acid sequence:
TTCCATTTCCAT
DNA sequences are generally stored in FASTA format, a simple text format. These can be loaded with readDNAStringSet
from Biostrings
. Let’s load the genome of E. coli strain K-12, obtained from the Ensembl FTP site.
### The start of the .fa file looks like this:
# >Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1
# AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
# TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
# TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
# ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
# AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
# CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
# ...
seqs <- readDNAStringSet("r-more-files/Escherichia_coli_k_12.GCA_000800765.1.29.dna.genome.fa")
seqs
## A DNAStringSet instance of length 1
## width seq names
## [1] 4558660 AGCTTTTCATTCTGACTGCAACGGGCA...TAAAAAACGCCTTAGTAAGTATTTTTC Chromosome dna:ch...
# Our chromosome name is too verbose.
# Remove everything from the name after the first space.
names(seqs)
## [1] "Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1"
names(seqs) <- sub(" .*","",names(seqs))
names(seqs)
## [1] "Chromosome"
Conversely, a DNAStringSet can also be written to a file with writeXStringSet
.
Genome annotations are available in a variety of text formats such as GFF3 and GTF. They can be loaded with the import
function from rtracklayer
. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.
### The start of the .gtf file looks like this:
# #!genome-build ASM80076v1
# #!genome-version GCA_000800765.1
# #!genome-date 2014-12
# #!genome-build-accession GCA_000800765.1
# #!genebuild-last-updated 2014-12
# Chromosome ena gene 190 255 . + . gene_id "ER3413_4519"; gene_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding";
# Chromosome ena transcript 190 255 . + . gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding";
# Chromosome ena exon 190 255 . + . gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AIZ54182-1"; exon_version "1";
# Chromosome ena CDS 190 252 . + 0 gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AIZ54182"; protein_version "1";
# ...
features <- import("r-more-files/Escherichia_coli_k_12.GCA_000800765.1.29.gtf")
# Optional: just retain the columns of metadata we need
mcols(features) <- mcols(features)[,c("type","gene_name","gene_id")]
features
## GRanges object with 24926 ranges and 3 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 190-255 + | gene thrL ER3413_4519
## [2] Chromosome 190-255 + | transcript thrL ER3413_4519
## [3] Chromosome 190-255 + | exon thrL ER3413_4519
## [4] Chromosome 190-252 + | CDS thrL ER3413_4519
## [5] Chromosome 190-192 + | start_codon thrL ER3413_4519
## ... ... ... ... . ... ... ...
## [24922] Chromosome 4557950-4558636 + | transcript yjtD ER3413_4514
## [24923] Chromosome 4557950-4558636 + | exon yjtD ER3413_4514
## [24924] Chromosome 4557950-4558633 + | CDS yjtD ER3413_4514
## [24925] Chromosome 4557950-4557952 + | start_codon yjtD ER3413_4514
## [24926] Chromosome 4558634-4558636 + | stop_codon yjtD ER3413_4514
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Conversely, a GRanges can be written to a file with export
.
We can use these annotations to grab sequences from the genome.
feat <- features[4,]
feat
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 190-252 + | CDS thrL ER3413_4519
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
feat_seq <- getSeq(seqs, feat)
feat_seq
## A DNAStringSet instance of length 1
## width seq
## [1] 63 ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGC
translate(feat_seq)
## A AAStringSet instance of length 1
## width seq
## [1] 21 MKRISTTITTTITITTGNGAG
The metadata columns let us query the GRanges, for example for a particular gene.
subset(features, gene_name == "lacA")
# Equivalently:
# features[features$gene_name == "lacA" & !is.na(features$gene_name),]
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 363147-363758 - | gene lacA ER3413_350
## [2] Chromosome 363147-363758 - | transcript lacA ER3413_350
## [3] Chromosome 363147-363758 - | exon lacA ER3413_350
## [4] Chromosome 363150-363758 - | CDS lacA ER3413_350
## [5] Chromosome 363756-363758 - | start_codon lacA ER3413_350
## [6] Chromosome 363147-363149 - | stop_codon lacA ER3413_350
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note: subset
is a generic R function. It is also similar to dplyr’s filter
. The second argument is special, in it you can refer to columns of the GRanges directly.
We could also get all features of a particular type.
cds <- subset(features, type == "CDS")
cds
# Equivalently:
# features[features$type == "CDS",]
## GRanges object with 4052 ranges and 3 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 190-252 + | CDS thrL ER3413_4519
## [2] Chromosome 337-2796 + | CDS thrA ER3413_1
## [3] Chromosome 2801-3730 + | CDS thrB ER3413_2
## [4] Chromosome 3734-5017 + | CDS thrC ER3413_3
## [5] Chromosome 5234-5527 + | CDS yaaX ER3413_4
## ... ... ... ... . ... ... ...
## [4048] Chromosome 4553704-4555125 + | CDS creC ER3413_4511
## [4049] Chromosome 4555186-4556535 + | CDS creD ER3413_4512
## [4050] Chromosome 4556601-4557314 - | CDS arcA ER3413_4513
## [4051] Chromosome 4557410-4557547 + | CDS yjjY ER3413_4541
## [4052] Chromosome 4557950-4558633 + | CDS yjtD ER3413_4514
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Various useful manipulations of individual ranges are defined.
?"intra-range-methods"
Note: How these make use of the strand is a little haphazard. For example flank() and resize() respect strand but shift() does not.
Earlier we translated a coding sequence. Coding sequences are terminated by a stop codon. Let’s extend the CDS feature to include this.
feat <- features[4,]
feat_stop <- resize(feat, width(feat)+3)
seq_stop <- getSeq(seqs, feat_stop)
translate(seq_stop)
## A AAStringSet instance of length 1
## width seq
## [1] 22 MKRISTTITTTITITTGNGAG*
resize
can fix either the fix="start"
or fix="end"
of the sequence.
flank
can be either flank the start (start=TRUE
) or end (start=FALSE
).
# input
# (----)
# . .
# resize (--------)
# resize(fix="end") (--------)
# . .
# flank (---). .
# flank(start=F) . .(---)
# . .
# promoters (------) .
# . .
# narrow .(--).
# . .
# shift (ignores strand!) . (----)
?"inter-range-methods"
?"setops-methods"
One compelling feature of GenomicRanges is that it is able to find overlapping ranges very quickly.
query <- as("Chromosome:9500-10000:+", "GRanges")
hits <- findOverlaps(query, features, ignore.strand=TRUE)
hits
## Hits object with 10 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 49
## [2] 1 50
## [3] 1 51
## [4] 1 52
## [5] 1 54
## [6] 1 55
## [7] 1 56
## [8] 1 57
## [9] 1 58
## [10] 1 60
## -------
## queryLength: 1 / subjectLength: 24926
subjectHits(hits)
## [1] 49 50 51 52 54 55 56 57 58 60
features[subjectHits(hits),]
## GRanges object with 10 ranges and 3 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 9306-9893 + | gene mog ER3413_8
## [2] Chromosome 9306-9893 + | transcript mog ER3413_8
## [3] Chromosome 9306-9893 + | exon mog ER3413_8
## [4] Chromosome 9306-9890 + | CDS mog ER3413_8
## [5] Chromosome 9891-9893 + | stop_codon mog ER3413_8
## [6] Chromosome 9928-10494 - | gene yaaH ER3413_9
## [7] Chromosome 9928-10494 - | transcript yaaH ER3413_9
## [8] Chromosome 9928-10494 - | exon yaaH ER3413_9
## [9] Chromosome 9931-10494 - | CDS yaaH ER3413_9
## [10] Chromosome 9928-9930 - | stop_codon yaaH ER3413_9
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
findOverlaps(query, features, ignore.strand=FALSE)
## Hits object with 5 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 49
## [2] 1 50
## [3] 1 51
## [4] 1 52
## [5] 1 54
## -------
## queryLength: 1 / subjectLength: 24926
With findOverlaps
, we can use genomic location as the key when joining disparate types of data, so this is an important tool for integrative analysis. See also the related functions nearest
, precede
, follow
, and distance
.
GenomicRanges
also provides:
range
- get a feature that spans from the start to the end of all features in a GRanges
.
reduce
- merge overlapping features, so that the same bases are covered by a reduced collection of features.
disjoin
- as with reduce, but broken at each start and end of the input features.
setdiff
- subtracts one set of features from another, could be used with range on a set of exons to get introns. Might need to use GenomicRanges::setdiff
if also using dplyr
.
# input
# (--------)
# (----------------)
# (-----)
# (---)
#
# range (-------------------------------)
#
# reduce (----------------------) (---)
#
# disjoin (---)(---)(-)(-----)(--) (---)
#
# GenomicRanges::setdiff(range(input),input)
# (--)
What are E. coli’s most common start and stop codons?
The start codon is the first three bases of the CDS, and the stop codon is the three bases following the end of the CDS.
Hint: Recall that we could get all CDS ranges with:
cds <- subset(features, type == "CDS")
Hint: Use flank()
and resize()
to manipulate these ranges.
GRangesList, etc: Many Bioconductor types have a List version – GRangesList
, DNAStringSetList
, etc. For example the exons of a collection of genes could be naturally stored in a GRangesList
. Most functions that work with GRanges
will also worked with GRangesList
, and operate on each list element separately.
TxDb: TxDb
objects represent the hierarchy of genes which contain transcripts which contain exons and CDS (CoDing Sequence) ranges. TxDb
objects are provided by the GenomicFeatures
package.
Seqinfo: GRanges
(and various other types) may have associated sequence information accessed with seqinfo()
. This contains the names and lengths of the sequences the ranges may refer to, and whether they are circular. It allows for some error checking if present.
AGGAGGU is the Shine-Dalgarno sequence, which assists binding of the ribosome to a transcript.
vmatchPattern("AGGAGGT", seqs)
## MIndex object of length 1
## $Chromosome
## IRanges object with 63 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 56593 56599 7
## [2] 67347 67353 7
## [3] 226876 226882 7
## [4] 229408 229414 7
## [5] 241665 241671 7
## ... ... ... ...
## [59] 4312631 4312637 7
## [60] 4371930 4371936 7
## [61] 4410503 4410509 7
## [62] 4420666 4420672 7
## [63] 4484025 4484031 7
vmatchPattern
is strand specific. If we want matches on the reverse strand we need to also:
vmatchPattern(reverseComplement(DNAString("AGGAGGT")), seqs)
## MIndex object of length 1
## $Chromosome
## IRanges object with 76 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 59133 59139 7
## [2] 125294 125300 7
## [3] 136473 136479 7
## [4] 226640 226646 7
## [5] 266770 266776 7
## ... ... ... ...
## [72] 4139844 4139850 7
## [73] 4181244 4181250 7
## [74] 4241083 4241089 7
## [75] 4397026 4397032 7
## [76] 4473495 4473501 7
Demanding an exact match here is overly strict. vmatchPattern
has arguments allowing inexact matches. Alternatively, there is a similar function for searching for a Position Weight Matrix pattern, matchPWM
.
The following will search both strands, allowing one mismatch, and produce the result in convenient GRanges form:
query <- DNAString("AGGAGGT")
max.mismatch <- 1
fwd <- vmatchPattern(query, seqs, max.mismatch=max.mismatch)
fwd <- as(fwd, "GRanges")
strand(fwd) <- "+"
rev <- vmatchPattern(reverseComplement(query), seqs, max.mismatch=max.mismatch)
rev <- as(rev, "GRanges")
strand(rev) <- "-"
complete <- c(fwd, rev)
complete
## GRanges object with 7534 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] Chromosome 323-329 +
## [2] Chromosome 3540-3546 +
## [3] Chromosome 3765-3771 +
## [4] Chromosome 5374-5380 +
## [5] Chromosome 7641-7647 +
## ... ... ... ...
## [7530] Chromosome 4550281-4550287 -
## [7531] Chromosome 4551603-4551609 -
## [7532] Chromosome 4551732-4551738 -
## [7533] Chromosome 4552223-4552229 -
## [7534] Chromosome 4552751-4552757 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Write to GFF file
export(complete, "motif-matches.gff")
We might then view this in the IGV genome browser:
Let’s try to “discover” the Shine-Dalgarno sequence for ourselves.
# Note: bacteria do not have introns
# In a eukaryote, you would need to merge CDS by transcript
size <- 20
initiation_regions <- flank(cds, size, start=TRUE)
initiation_seqs <- getSeq(seqs, initiation_regions)
names(initiation_seqs) <- initiation_regions$gene_id
# Look for any composition bias
library(seqLogo)
## Loading required package: grid
letter_counts <- consensusMatrix(initiation_seqs)
probs <- prop.table(letter_counts[1:4,], 2)
seqLogo(probs, ic.scale=FALSE)
seqLogo(probs)
# Generate a background set of sequences by shuffling
shuffle <- function(dna) {
# Convert to a vector of single bases
charvec <- strsplit(as.character(dna),"")[[1]]
# Shuffle the vector
shuffled_charvec <- sample(charvec)
# Convert back to a DNA string
DNAString( paste(shuffled_charvec, collapse="") )
}
background_seqs <- DNAStringSet( lapply(initiation_seqs, shuffle) )
names(background_seqs) <- paste0(names(background_seqs), "-shuffled")
library(motifRG)
results <- findMotifFgBg(
initiation_seqs, background_seqs,
both.strand=FALSE, start.width=4)
## AGGA 5.283892e-07
## Refine AGGA 23.05403 : 22.89242 23.10145 22.64465 24.89169 23.78465 TRUE 1590 572 1520 539
## New motif: AGGA
## match range 2074
## [1] "Rescore"
## [1] "Finished Rescore"
## GGAG 1.758678e-05
## Refine GGAG 12.90943 : 13.6547 11.78928 12.35035 12.47426 14.80007 TRUE 629 252 625 249
## New motif: GGAG
## match range 874
## [1] "Rescore"
## [1] "Finished Rescore"
## TAGA 5.804435e-07
## Refine TAGA 12.57232 : -12.36378 -13.43672 -12.17826 -12.61662 -12.9688 FALSE 138 463 136 447
## New motif: TAGA
## match range 591
## [1] "Rescore"
## [1] "Finished Rescore"
## TAGT 4.11006e-06
## Refine TAGT 9.961795 : -10.23105 -9.603613 -8.588459 -9.432462 -9.878337 FALSE 98 318 97 300
## New motif: TAGT
## match range 406
## [1] "Rescore"
## [1] "Finished Rescore"
## GAGG 0.000101929
## Refine GAGG 8.106548 : 8.759259 7.410004 6.875806 7.349752 9.659215 TRUE 370 181 368 179
## New motif: GAGG
summaryMotif(results$motif, results$category)
## scores signs fg.hits bg.hits fg.seq bg.seq ratio fg.frac bg.frac
## NNAGGANN 23.05403 TRUE 1590 572 1520 539 2.7797203 0.37512340 0.13302073
## NNGGAGNN 23.93018 TRUE 1384 404 1348 388 3.4257426 0.33267522 0.09575518
## NNTAGANN 12.43437 FALSE 148 474 146 456 0.3122363 0.03603159 0.11253702
## NNTAGTNN 10.09031 FALSE 98 322 97 304 0.3043478 0.02393880 0.07502468
## NNGAGGNN 15.99299 TRUE 1001 414 977 392 2.4178744 0.24111550 0.09674235
motifHtmlTable(results)
## [1] "cp /Library/Frameworks/R.framework/Versions/3.5/Resources/library/motifRG/css/header.html html/motif.html"
refined <- refinePWMMotif(results$motifs[[1]]@match$pattern, initiation_seqs)
## 1 4428.402
## 2 21011.55
## 3 21321.49
## 4 21413.76
## 5 21440.74
## 6 21447.47
## 7 21450.74
## 8 21454
## 9 21456.82
## 10 21459.03
## 11 21461.5
## 12 21463.57
seqLogo(refined$model$prob)
There are many other motif finding programs available, outside of R. An alternative approach would be to construct the foreground and background sequences as above, then write them to FASTA files for use with an external program.
writeXStringSet(initiation_seqs, "fg.fa")
writeXStringSet(background_seqs, "bg.fa")
For example MEME (which doesn’t actually need background sequences):
system("meme -dna -maxsize 1000000 fg.fa")
We’ve seen just the smallest part of what Bioconductor has to offer in this space.
Besides software, Bioconductor includes packages with data for model organisms, for example. The data is generally from these central repositories:
These organizations will generally obtain genome assemblies from the same ultimate sources. For example, all of the above use the Genome Reference Consortium’s GRCh38 DNA sequence for homo sapiens. UCSC likes to call this “hg38” but it is the same DNA sequence. These DNA sequences serve as a common frame of reference. However the three organizations above will differ on their exact set of gene and transcript annotations, and all use a different gene and transcript ID system. These annotations are also revised more often than the underlying DNA sequences.
This mess is partly due to American/European rivalry, and partly due to differing goals. The UCSC genome browser has always been about practicality and showing many lines of evidence. The others are more concerned with careful curation and standardization.
Some example packages:
Biostrings genome, Homo sapiens, from the UCSC browser, version hg38.
DNA for chromosomes, usable in the same way as the DNAStringSet used above.
Transcript database, Homo sapiens, from UCSC browser, genome verison hg38, “knownGene” gene annotations.
GRanges information for genes and transcripts, much as we loaded from a GTF file above.
Organism Homo sapiens, primary key is Entrez Gene, database.
Translation of gene ids from various databases, assignment to GO terms, KEGG pathways, etc. Entrez Gene ids are used as the primary key.
Access to BioMart data, on the internet – translation of gene ids, gene sets, gene information, etc.
AnnotationHub is a way to retrieve data from a more comprehensive set of organisms and data providers than the above styles of package. The retrieved data is returned in an appropriate Bioconductor type. If data is being updated over time (eg improved annotation of a genome), each version receives a unique ID in AnnotationHub, making it much easier to write reproducable analyses.
AnnotationHub also provides access to experimental data which maps to locations on a genome, similar to the sorts of tracks you would load in the UCSC browser.
Files are cached, so they will only be downloaded once.
In the example below, the yeast genome and annotations are retrieved:
library(AnnotationHub)
ah <- AnnotationHub()
# ah contains a large collection of records that can be retrieved
ah
length(ah)
colnames( mcols(ah) )
table( ah$rdataclass )
# query() searches for terms in an unstructured way
records <- query(ah, c("Ensembl", "85", "Saccharomyces cerevisiae"))
records
mcols(records)
mcols(records)[,c("title","rdataclass")]
# Having located records of interest,
# your R script can refer to the specific AH... record,
# so it always uses the same version of the data.
ah[["AH51399"]]
sc_genome <- import( ah[["AH51399"]] )
sc_granges <- ah[["AH51088"]]
# More recent versions of Bioconductor also allow you to
# retrieve TxDb (and similar EnsDb) objects.
query(ah, c("OrgDb", "Saccharomyces cerevisiae"))
sc_orgdb <- ah[["AH49589"]]
# An OrgDb contains information about genes in an organism
columns(sc_orgdb)
head( keys(sc_orgdb, "ORF") )
select(sc_orgdb, "YFL039C", c("GENENAME", "DESCRIPTION"), keytype="ORF")
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils datasets
## [9] methods base
##
## other attached packages:
## [1] motifRG_1.24.0 BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [3] seqLogo_1.46.0 BSgenome_1.48.0
## [5] rtracklayer_1.40.6 GenomicRanges_1.32.7
## [7] GenomeInfoDb_1.16.0 Biostrings_2.48.0
## [9] XVector_0.20.0 IRanges_2.14.12
## [11] S4Vectors_0.18.3 BiocGenerics_0.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 knitr_1.20 magrittr_1.5
## [4] GenomicAlignments_1.16.0 zlibbioc_1.26.0 BiocParallel_1.14.2
## [7] lattice_0.20-35 stringr_1.3.1 tools_3.5.0
## [10] SummarizedExperiment_1.10.1 Biobase_2.40.0 matrixStats_0.54.0
## [13] htmltools_0.3.6 yaml_2.2.0 rprojroot_1.3-2
## [16] digest_0.6.17 Matrix_1.2-14 GenomeInfoDbData_1.1.0
## [19] codetools_0.2-15 bitops_1.0-6 RCurl_1.95-4.11
## [22] evaluate_0.11 rmarkdown_1.10 DelayedArray_0.6.6
## [25] stringi_1.2.4 compiler_3.5.0 Rsamtools_1.32.3
## [28] backports_1.1.2 XML_3.98-1.16