polyApipe is a pipeline for examining Alternative PolyAdenylation (APA) from 10X Genomics single-cell RNA Sequencing. It consists of a Python 3 part, polyApipe.py
which performs peak calling and UMI counting, and and R part, polyApiper
, which analyses the resultant UMI counts.
This poster on polyApipe was presented at Oz Single Cell 2019.
polyApipe requires various tools:
The quickest way to install these is to use conda. After installing conda, the following will create a suitable environment, which can be loaded with conda activate
. Ensure it is python3.
conda create -n polyApipe_env --override-channels -c bioconda -c conda-forge -c anaconda umi_tools=1.0.0-0 pysam samtools subread
conda activate polyApipe_env
The R part of polyApipe requires Bioconductor. In R:
install.packages("BiocManager")
Download polyApipe:
git clone https://github.com/MonashBioinformaticsPlatform/polyApipe
Test that you can run the Python part:
polyApipe/polyApipe.py
Install the R part in R. weitrix development happens in step with polyApipe, so install it from git as well.
BiocManager::install("pfh/weitrix")
BiocManager::install("MonashBioinformaticsPlatform/polyApipe/polyApiper")
Count reads in poly(A) regions with polyApipe.py script on a directory or bam file:
polyApipe/polyApipe.py -i demo/ -o demo
polyApipe/polyApipe.py -i demo/SRR5259422_demo.bam -o SRR5259422_demo
Process the output of polyApipe.py
in R.
library(polyApiper)
# - Get appropriate ENSEMBL annotations and DNA sequence from AnnotationHub
# - Classify genomic regions into exon,intron,3'UTR,extension
do_ensembl_organism(
out_path="mouse_ens100",
species="Mus musculus",
version="100")
# You will need to supply a set of cells to use.
# - Choose these using conventional scRNA-Seq cell filtering methods.
# - Also provide a cell naming function that combines the batch name and barcode
# consistent with your existing cell naming.
cells_to_use <- c("SRR6129051_AAACCTGAGAGGGCTT", "SRR6129051_AAACCTGCATACGCCG", ...)
cell_name_func <- function(batch,barcode) paste0(batch,"_",barcode)
# Run the R part of the pipeline
# - Load output from polyApipe.py
# - Produce HDF5Array SingleCellExperiment objects containing counts
# - Perform further analysis steps
do_pipeline(
out_path="demo_banquet",
counts_file_dir="demo_counts",
peak_info_file="demo_polyA_peaks.gff",
organism="mouse_ens100",
cells_to_use=cells_to_use,
cell_name_func=cell_name_func)
# Load results (individual objects are lazy-loaded when accessed)
organism <- load_banquet("mouse_ens100")
banq <- load_banquet("demo_banquet")
names(banq)
To work out the available ENSEMBL versions, use this R code, substituting in the appropriate species:
library(AnnotationHub)
ah <- AnnotationHub()
subset(ah, rdataclass == "EnsDb" & species == "Homo sapiens")$title
subset(ah, rdataclass == "EnsDb" & species == "Mus musculus")$title
Peak counts can be loaded into a SingleCellExperiment object without any further interpretation using:
sce <- load_peaks_counts_dir_into_sce("demo_counts/", "demo_polyA_peaks.gff", "./demo_sce")
polyApiper uses uses BiocParallel's default parallel processing as its own default, as returned by BiocParallel::bpparam()
.
If polyApiper hangs or produces strange errors, you can try running it with serial processing:
BiocParallel::register( BiocParallel::SerialParam() )