Overview

1. Inputs

The input to polyApipe is one or more indexed BAM files. The most usual input is BAM files produced from the Cell Ranger pipeline, based on 10X scRNA-Seq data. If you have something different, the requirements are:

  • Alignments must be soft-masked so the unaligned 'A's can be found (e.g. from an aligner such as STAR (Dobin et al. 2013)).
  • Alignments must contain corrected cell barcode tags and uncorrected UMI tags.

2. polyApipe.py: get poly(A) sites and counts

Next, use the Python 3 script polyApipe.py to process the BAM files and produce poly(A) sites and UMI counts:

  • Find polyA-containing reads in bam files. These will have a soft-clipped stretch of polyA (or polyT depending on strand) at the 3' end of the read. This is done for every BAM file, and written to _polyA.bam files.
  • Look for sites where there are a large number of poly(A) reads. The poly(A) site is defined 250bp (--region_size) upstream of where the unaligned As begin (i.e. end of transcription). This is done across a pool of all samples' polyA reads in a stranded manner, resulting in a single .gff file defining peaks.
  • Count reads per APA site per cell. Create a counts matrix of total UMIs falling in each APA site (using all reads, not just polyA reads). This step uses featureCounts (Liao, Smyth, and Shi 2014) and UMI-tools (Smith, Heger, and Sudbery 2017) to count cells per feature using the uncorrected UMI tags (and corrected cell barcodes). There will be one file per sample / input bam, saved as *_counts.tab.gz.

3. polyApiper: ready for downstream analysis in R

Next, use the polyApiper R package to perform further filtering of poly(A) sites, associates sites with genes, and put the data into a convenient form for working with in R.

We recommend using our weitrix package to further analyse these results. Code to get you started is given in the example below, and further details can be found in the weitrix package vignettes.

Example analysis - pbmc8k

The 'pbmc8k' dataset contains around 8000 PBMCs (peripheral blood mononuclear cell) from a healthy individual. It is a publicly available dataset provided by 10X genomics: (https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc8k)

Get input files

First, need the aligned bam files. The output of the 10X Cell Ranger pipeline will work with default parameters.

To obtain the data files for this analysis.

# Bam file and index. NB: large file (68GB).
wget http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam.bam
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_possorted_genome_bam_index.bam.bai
mv pbmc8k_possorted_genome_bam_index.bam.bai pbmc8k_possorted_genome_bam.bam.bai

# Gzipped tar file with the precomputed tsne and clusters information
wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_analysis.tar.gz
tar -xzf pbmc8k_analysis.tar.gz

Find and quantify poly(A) sites

This command will produces the input required for polyApiper. It takes some time (hours) to run.

./polyApipe.py  -i pbmc8k_possorted_genome_bam.bam -o pbmc8k -t 8

There is only one sample in this experiment, so specifing the bam file with -i. And -t means that multithreaded tools will use 8 threads.

Once finished, the following files are output:

pbmc8k_polyA.bam: A BAM file of all the polyA-containing reads seen in the input bam.

pbmc8k_polyA.bam.bai: BAM index.

pbmc8k_polyA_peaks.gff: GFF file defining the poly(A) site regions, defined from peaks of poly(A) reads in the input.

The sites are typically 250bp (--region_size) upstream of the polyA site (unless larger regions were merged). The misprime status (depending on --misprime_A_count and --misprime_in) flags sites near A-rich genomic regions. E.g.

1   polyAends   polyAends   16442   16691   .   -   .   peak="1_16442_r"; peakdepth="10"; misprime="False";
1   polyAends   polyAends   184920  185169  .   -   .   peak="1_184920_r"; peakdepth="179"; misprime="False";
1   polyAends   polyAends   632454  632703  .   +   .   peak="1_632703_f"; peakdepth="3468"; misprime="True";

pbmc8k_counts.tab.gz: A gzipped tab-delimited text file of the number of reads aligned per cell in each polyA site peak. All reads are counted, whether or not they include a polyA. Format <peakname> <cell> <number of reads> .

10_44994717_f   AACTGGTAGTACGCCC    1
10_44994717_f   AACTGGTCAGTCCTTC    2
10_44994717_f   AACTTTCCAGTCAGAG    1

Further processing in R

The R part of the pipeline is now used to further filter APA sites, associate them with genes, and provide files ready for downstream analysis.

library(polyApiper)

# - Get appropriate ENSEMBL annotations and DNA sequence from AnnotationHub
# - Classify genomic regions into exon, intron, 3'UTR, extension
polyApiper:::do_ensembl_organism(
    out_path="human_ens100", 
    species="Homo sapiens", 
    version="100")

# Which cell names to use.
tsne <- read.csv("analysis/tsne/2_components/projection.csv")
cells_to_use <- tsne$Barcode

# How cell names are constructed from batch name and cell barcode.
# Should match naming in cells_to_use.
cell_name_func <- function(batch,cell) paste0(batch,cell,"-1")

# - Load output from polyApipe.py
# - Produce HDF5Array SingleCellExperiment objects containing counts
# - Perform further analysis steps
polyApiper:::do_pipeline(
    out_path      ="pbmc8k_banquet", 
    counts_files  =c("pbmc8k_counts.tab.gz"),
    batch_names   =c(""),
    peak_info_file="pbmc8k_polyA_peaks.gff", 
    organism      ="human_ens100",
    cells_to_use  =cells_to_use,
    cell_name_func=cell_name_func
    )

# To load all outputs for further analysis.
banq <- load_banquet("pbmc8k_banquet")

Now the pbmc8k_banquet directory has been set up with APA information ready for downstream analysis. We recommend using our weitrix package. The methods demonstrated in the APA vignette of the weitrix package can be used: after a calibration step weitrix_confects() can be used to find genes with differential APA and weitrix_sd_confects() can be used to find genes with excess APA in an exploratory manner.

library(weitrix)

# Note: If parallel processing fails, turn it off with
BiocParallel::register( BiocParallel::SerialParam() )

# Lazy-load the outputs of the pipeline
banq <- load_banquet("pbmc8k_banquet")

# Shift scores summarizing the APA state in each gene x cell.
weitrix_x(banq$shift) 

# Total UMI counts, which serve as initial "uncalibrated" precision weights.
weitrix_weights(banq$shift) 

# We now adjust the weights, obtaining "calibrated" weights.
# After this step the weights represent 1/residual variance.
cal <- weitrix_calibrate_all(
    banq$shift, 
    design = ~1,
    trend_formula = ~log2(per_read_var)+well_knotted_spline(log2(weight),3))

# Note: For greater sensitivity use clusters and batches in the design.
#       eg specify design = ~cluster

# A table of genes of interest is produced using:
interesting <- weitrix_sd_confects(cal)

interesting$table[1:8, c("confect","symbol","regions")]
  confect symbol                                      regions
1   0.183   IL7R    intron:intron:intron:intron:3'UTR1:3'UTR1
2   0.171  PARP1   3'UTR1:intron:CDS:CDS:3'UTR2:3'UTR2:3'UTR2
3   0.169   CD69           intron:intron:intron:intron:3'UTR1
4   0.168  SP110 CDS:CDS:CDS:CDS:CDS:CDS:3'UTR1:intron:intron
5   0.166  IFI16      intron:CDS:CDS:CDS:intron:intron:3'UTR1
6   0.165  H2AC6                  3'UTR1:intron:3'UTR1:3'UTR1
7   0.163  CALM3                         3'UTR1:3'UTR2:3'UTR2
8   0.159 MALAT1                exon:exon:exon:exon:exon:exon

polyApipe.py details

Processing multiple samples

If a given experiment has multiple samples in separate BAM files, these can be put in a directory and processed together. Just proveide the directory to the -i paramter of polyApipe.py.

./polyApipe.py  -i multi_bam_experiment_dir -o this_experiment_polyA -t 8

When multiple files are processed, the output will be a directory, with individual *_polyA.bam and *counts.tab.gz files for each sample. There will still be a single output_polyA_peaks.gff file, made considering all polyA reads from all samples pooled together. All samples are counted against this same set of regions.

These can then be loaded together into into R with polyApipeR.

Processing large datasets

When processing multiple files at the polyApipe.py step, individual samples are processed one at a time. The -t / --threads option is simply passed through to tools that support multithreading. This is probably sufficent for most experiments.

However, it might not be ideal for very large experiments (e.g. 100+ samples). In those cases, its possible to use snakemake to run parts of the pipeline in parallel on a cluster. Refer to snakemake docs for how to set this up, but there is a template to show the processing logic on github here. This will need to be adjusted to suit your local compute.

# Add snakemake and graphviz into your conda env.
conda activate polyApipe_snakemake  
snakemake --jobs 30 --cores 8 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -c {cluster.c}  -t {cluster.time}"

Parameters of polyApipe.py

Running polyApipe.py --help shows many options - described below. Most are rarely needed.


Main options : Input and output. Essential. --output should not exist.

Main:
  -i INPUT [INPUT ...], --input INPUT [INPUT ...]
                        A bam file, bam files, or single directory of bam
                        files. (default: None)
  -o OUT_ROOT, --output OUT_ROOT
                        Name root to use for output files. (default: None)

PolyA peak options : For changing how the polyA sites are defined. Increase --depth_threshold for large datasets.

PolyA peak options:
  Set peak-level configs

  --depth_threshold DEPTH_THRESHOLD
                        Need at least this many reads in a peak to call an APA
                        site. (default: 1)
  --region_size REGION_SIZE
                        Size upstream of APA site to consider polyA reads.
                        (default: 250)

PolyA read options : For changing what reads are considered polyA reads. Defaults should usually be sufficient.

PolyA read options:
  Set read-level configs for defining polyA reads

  --minMAPQ MINMAPQ     Minimum MAPQ mapping quality to consider. (default:
                        10)
  --minpolyA MINPOLYA   Number of A to consider polyA. (default: 5)
  --non_A_allowed NONA_ALLOWED
                        Number of non A bases permitted in polyA region (while
                        still having --minpolyA As) (default: 0)
  --misprime_A_count MISPRIME_A_COUNT
                        Number of As seen in the last --misprime_in of aligned
                        reads to label potential mispriming (default: 8)
  --misprime_in MISPRIME_IN
                        Look for --misprime_A_count As in this many
                        nucleotides at the end of a reaad alignment when
                        labelling potential misprime (default: 10)

Bam Tag options : Not needed for typical 10X pipeline output, but if something other than the CB and UR tags are used for describing barcodes, specify here.

Bam tags:
  For specifiing umi, cell e.t.c

  --cell_barcode_tag CORRECTED_CELL_BARCODE_TAG
                        Corrected (exact-match) cell barcode bam tag used in
                        bam input. (default: CB)
  --umi_tag UMI_TAG     Uncorrected UMI / molecular barcode bam tag used in
                        bam input. May contain mismatches. (default: UR)

Running options : The -t/--threads option should usually be used to tell multithreaded tasks how many threads they can use. The rest of these options are just for running via snakemake or debugging.

Running:
  For changing how this script runs. Stop/start on polyA step e.t.c

  -t THREADS, --threads THREADS
                        Num threads for multithreaded steps. (default: 1)
  --no_peaks            Stop after making polyA bams. Do not try to find peaks
                        in polyA files (implies --no_anno --no_count)
                        (default: False)
  --no_count            Stop after making merged polyA peaks gff file. Do not
                        try to count them, or annotate bams with them.
                        (default: False)
  --polyA_bams          Skip polyA filtering step, the bams specified with
                        '-i' are already filtered to polyA-containing reads
                        only. (default: False)
  --peaks_gff PEAKS_GFF
                        If provided, use this gff file of peaks instead of
                        making one from polyA reads. Skips polyA filtering
                        steps, incompatable with --polyA_bams This is a gtf
                        format specifically as output by this script. See
                        example data. (default: None)
  --keep_interim_files  Don't delete the intermediate files (merged polyA,
                        annotated input bams).For debugging or piecemeal runs.
                        (default: False)
                        

References

Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2013. “STAR: Ultrafast universal RNA-seq aligner.” Bioinformatics 29 (1): 15–21. doi:10.1093/bioinformatics/bts635.

Liao, Yang, Gordon K. Smyth, and Wei Shi. 2014. “FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics 30 (7): 923–30. doi:10.1093/bioinformatics/btt656.

Smith, Tom, Andreas Heger, and Ian Sudbery. 2017. “UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy.” Genome Research 27 (3): 491–99. doi:10.1101/gr.209601.116.