This content is the Seurat PBMC tutorial with an additional sections on SingleR for cell type classification and Harmony for dataset integration.
Please go to the installation page for instructions on how to install the libraries used for this workshop. There are also instructions for downloading the raw data there as well.
We suggest working from the pbmc3k_tutorial.R
script
that you would have downloaded as part of the workshop setup instructions. This script
will have all the code that we will be using for the duration of the
workshop.
For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here. In this example, the raw data has already been processed through the cellranger pipeline and we are working with the pipeline output. If you’re interested, you can see the preliminary cellranger QC report here.
Cellranger produces three files:
genes.tsv
as it has been produced by an older version of
cellranger. It contains the list of gene annotations the data was
counted against. Now since cellranger can be used with different types
of experiment (e.g ATAC, hashtag oligos, cell surface proteins) - the
word ‘feature’ is used instead as you might have the hashtag oligo names
listed here or peak namesfeatures.tsv
and the
barcodes.tsv
files. The first three rows in the file
contains information about the file type - the type of file it is and
the dimensions of data ie the number of features, the number of cells
and then the total number of rows containing data in the matrix file
(excluding the first rows).The first column contains the row number of the feature in the
features.tsv
file whereas the second column contains the
row number for the cell barcode in the barcodes.tsv
file.
It’s the third column that contains the UMI count.
32709 1 4
32707 1 1
32706 1 10
32704 1 1
32703 1 5
The count data is stored in this sparse format with the column and row information stored in separate files and only the non-zero counts kept. This representation of the data is an efficient way to store the data and most single cell analysis packages will have a way to read such data in and represent it as a matrix.
We start by reading in the data. The Read10X()
function
reads in the output of the cellranger
pipeline from 10X, returning a unique molecular identified (UMI) count
matrix. The values in this matrix represent the number of molecules for
each feature (i.e. gene; row) that are detected in each cell
(column).
We next use the count matrix to create a Seurat
object.
The object serves as a container that contains both data (like the count
matrix) and analysis (like PCA, or clustering results) for a single-cell
dataset. For a technical discussion of the Seurat
object
structure, check out the GitHub Wiki.
For example, the count matrix is stored in
pbmc@assays$RNA@counts
.
library(dplyr)
library(ggplot2)
library(Seurat)
library(patchwork)
library(clustree)
library(RColorBrewer)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/") # This creates a sparse matrix
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
The .
values in the matrix represent 0s (no molecules
detected). Since most values in an scRNA-seq matrix are 0, Seurat uses a
sparse-matrix representation whenever possible. This results in
significant memory and speed savings for Drop-seq/inDrop/10x data.
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
## 709591472 bytes
sparse.size <- object.size(pbmc.data)
sparse.size
## 29905192 bytes
dense.size/sparse.size
## 23.7 bytes
Lets take a look at the seurat object we have just created in R,
pbmc
.
To accomodate the complexity of data arising from a single cell RNA seq experiment, the seurat object keeps this as a container of multiple data tables that are linked.
The functions in Seurat can access parts of the data object for analysis and visualisation, we will cover this later on.
There are a couple of concepts to discuss here.These are essentially data containers in R as a class, and can accessed as a variable in the R environment.
Classes are pre-defined and can contain multiple data tables and metadata. For Seurat, there are several types.
Slots are parts within a class that contain specific data. These can be lists, data tables and vectors and can be accessed with conventional R methods.
Many of the functions in Seurat operate on the data class and slots
within them seamlessly. There maybe occasion to access these separately
to hack
them, however this is an advanced analysis
method.
Examples of accessing a Seurat object
The assays
slot in pbmc
can be accessed
with GetAssay(pbmc)
or pbmc@assays
.
The RNA
assay can be accessed from this with
GetAssay(pbmc, assay = "RNA)
or
pbmc@assays$RNA
.
We can also use the GetAssayData
function and specify
which slot we’d like to extract data.
GetAssayData(object = pbmc, slot = "data")[1:5, 1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1
## AL627309.1 . . .
## AP006222.2 . . .
## RP11-206L10.2 . . .
## RP11-206L10.9 . . .
## LINC00115 . . .
## AAACCGTGCTTCCG-1 AAACCGTGTATGCG-1
## AL627309.1 . .
## AP006222.2 . .
## RP11-206L10.2 . .
## RP11-206L10.9 . .
## LINC00115 . .
We often want to access assays, so Seurat gives us a shortcut
pbmc$RNA
. You may sometimes see an alternative notation
pbmc[["RNA"]]
.
In general, slots that are always in an object are accessed with
@
and things that may be different in different data sets
are accessed with $
. However, it generally is safer to
access data using the provided functions - if the way the Seurat object
is structured changes in a future update, the functions should remain
useable whereas code that directly references the Seurat structure with
@
and $
may no longer run.
Have a go
Use str
to look at the structure of the Seurat object
pbmc
.
What is in the meta.data
slot within your Seurat object
currently? What type of data is contained here?
Where is our count data within the Seurat object?
The PBMC dataset is a gene-expression dataset and is stored in an
assay called RNA
. What other types of assays could we have
stored in a Seurat object if we had a different type of dataset?
The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.
Low quality cells can add noise to your results leading you to the wrong biological conclusions. Using only good quality cells helps you to avoid this. Reduce noise in the data by filtering out low quality cells such as dying or stressed cells (high mitochondrial expression) and cells with few features that can reflect empty droplets.
Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include
PercentageFeatureSet()
function, which calculates the
percentage of counts originating from a set of featuresMT-
as a set
of mitochondrial genes# The $ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
Where are QC metrics stored in Seurat?
CreateSeuratObject()
What do you notice has changed within the meta.data
table now that we have calculated mitochondrial gene proportion?
Could we add more data into the meta.data
table?
In the example below, we visualize QC metrics, and use these to filter cells.
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
Lets look at the number of features (genes) to the percent mitochondrial genes plot.
plot3 <- FeatureScatter(pbmc, feature1 = "nFeature_RNA", feature2 = "percent.mt")
plot3
Ribosomal gene expression could be another factor to look into your cells within your experiment.
Create more columns of metadata using
PercentageFeatureSet
function, this time search for
ribosomal genes. We can calculate the percentage for the large subunit
(RPL) and small subunit (RPS) ribosomal genes.
Use FeatureScatter
to plot combinations of metrics
available in metadata. How is the mitochondrial gene percentage related
to the ribosomal gene percentage? What can you see? Discuss in break
out.
Create new meta.data columns to contain percentages of the large and small ribosomal genes.
Then plot a scatter plot with this new data. You should find that the large and small ribosomal subunit genes are correlated within cell.
What about with mitochondria and gene, feature counts?
These are the cells you may want to exclude.FeatureScatter
make a plot of the RNA count and mitochondrial percentage with the cells
with very low ribosomal gene perentage.
Okay we are happy with our thresholds for mitochondrial percentage in cells, lets apply them and subset our data. This will remove the cells we think are of poor quality.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & percent.mt < 5)
Lets replot the feature scatters and see what they look like.
plot5 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot6 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot5 + plot6
The sequencing depth can be different per cell. This can bias the counts of expression showing higher numbers for more sequenced cells leading to the wrong biological conclusions. To correct this the feature counts are normalized.
After removing unwanted cells from the dataset, the next step is to
normalize the data. By default, we employ a global-scaling normalization
method “LogNormalize” that normalizes the feature expression
measurements for each cell by the total expression, multiplies this by a
scale factor (10,000 by default), and log-transforms the result.
Normalized values are stored in pbmc$RNA@data
.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
For clarity, in this previous line of code (and in future commands), we provide the default values for certain parameters in the function call. However, this isn’t required and the same behavior can be achieved with:
pbmc <- NormalizeData(pbmc)
Identifying the most variable features allows retaining the real biological variability of the data and reduce noise in the data.
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). The Seurat developers and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
The procedure in Seurat is described in detail here, and improves
on previous versions by directly modeling the mean-variance relationship
inherent in single-cell data, and is implemented in the
FindVariableFeatures()
function. By default, we return
2,000 features per dataset. These will be used in downstream analysis,
like PCA.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
What if we wanted to look at genes we are specifically interested in? We can create a character vector of gene names and apply that to this plot.
Make a plot with labels for the genes IL8, IDH2 and CXCL3.
Next, we apply a linear transformation (‘scaling’) that is a standard
pre-processing step prior to dimensional reduction techniques like PCA.
The ScaleData()
function:
pbmc$RNA@scale.data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
Scaling is an essential step in the Seurat workflow, but only on
genes that will be used as input to PCA. Therefore, the default in
ScaleData()
is only to perform scaling on the previously
identified variable features (2,000 by default). To do this, omit the
features
argument in the previous function call, i.e.
# pbmc <- ScaleData(pbmc)
Your PCA and clustering results will be unaffected. However, Seurat
heatmaps (produced as shown below with DoHeatmap()
) require
genes in the heatmap to be scaled, to make sure highly-expressed genes
don’t dominate the heatmap. To make sure we don’t leave any genes out of
the heatmap later, we are scaling all genes in this tutorial.
In Seurat v2
we can also use the
ScaleData()
function to remove unwanted sources of
variation from a single-cell dataset. For example, we could ‘regress
out’ heterogeneity associated with (for example) cell cycle stage, or
mitochondrial contamination. These features are still supported in
ScaleData()
in Seurat v3
, i.e.:
# pbmc <- ScaleData(pbmc, vars.to.regress = 'percent.mt')
However, particularly for advanced users who would like to use this
functionality, the Seurat developers recommend their new normalization
workflow, SCTransform()
. The method is described in their
paper,
with a separate vignette using Seurat v3 here. As with
ScaleData()
, the function SCTransform()
also
includes a vars.to.regress
parameter.
Imagine each gene represents a dimension - or an axis on a plot. We could plot the expression of two genes with a simple scatterplot. But a genome has thousands of genes - how do you collate all the information from each of those genes in a way that allows you to visualise it in a 2 dimensional image. This is where dimensionality reduction comes in, we calculate meta-features that contains combinations of the variation of different genes. From thousands of genes, we end up with 10s of meta-features
Next we perform PCA on the scaled data. By default, only the
previously determined variable features are used as input, but can be
defined using features
argument if you wish to choose a
different subset.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
Seurat provides several useful ways of visualizing both cells and
features that define the PCA, including VizDimReduction()
,
DimPlot()
, and DimHeatmap()
# Examine and visualize PCA results a few different ways
print(pbmc$pca, dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA
## Negative: NKG7, PRF1, CST7, GZMA, GZMB
## PC_ 3
## Positive: HLA-DQA1, CD79A, HLA-DQB1, CD79B, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: VIM, IL7R, S100A6, S100A8, S100A4
## Negative: HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1
## PC_ 5
## Positive: LTB, VIM, IL7R, AQP3, PPA1
## Negative: GZMB, FGFBP2, NKG7, GNLY, CST7
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
In particular DimHeatmap()
allows for easy exploration
of the primary sources of heterogeneity in a dataset, and can be useful
when trying to decide which PCs to include for further downstream
analyses. Both cells and features are ordered according to their PCA
scores. Setting cells
to a number plots the ‘extreme’ cells
on both ends of the spectrum, which dramatically speeds plotting for
large datasets. Though clearly a supervised analysis, we find this to be
a valuable tool for exploring correlated feature sets.
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should we choose to include? 10? 20? 100?
One heuristic method that can be used is an ‘Elbow plot’: a ranking
of principle components based on the percentage of variance explained by
each one (ElbowPlot()
function). In this example, we can
observe an ‘elbow’ around PC9-10, suggesting that the majority of true
signal is captured in the first 10 PCs.
ElbowPlot(pbmc)
Identifying the true dimensionality of a dataset – can be challenging/uncertain for the user. The Seurat developers suggest these three approaches to consider. The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff.
We use 10 here, but encourage users to consider the following:
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
Run FindNeighbours
and FindClusters
again,
with a different number of dimensions or with a different resolution.
Examine the resulting clusters using DimPlot
.
To maintain the flow of this tutorial, please put the output of this
exploration in a different variable, such as pbmc2
!
Clustering the cells will allow you to visualise the variability of your data, can help to segregate cells into cell types.
Seurat v3 applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, the Seurat approach to partitioning the cellular distance matrix into clusters has dramatically improved. The Seurat approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
As in PhenoGraph, Seurat first constructs a KNN graph based on the
euclidean distance in PCA space, and then refines the edge weights
between any two cells based on the shared overlap in their local
neighborhoods (Jaccard similarity). This step is performed using the
FindNeighbors()
function, and takes as input the previously
defined dimensionality of the dataset (first 10 PCs).
To cluster the cells, Seurat next applies modularity optimization
techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel
et al., Journal of Statistical Mechanics], to iteratively
group cells together, with the goal of optimizing the standard
modularity function. The FindClusters()
function implements
this procedure, and contains a resolution parameter that sets the
‘granularity’ of the downstream clustering, with increased values
leading to a greater number of clusters. We find that setting this
parameter between 0.4-1.2 typically returns good results for single-cell
datasets of around 3K cells. Optimal resolution often increases for
larger datasets. The clusters can be found using the
Idents()
function.
pbmc <- FindNeighbors(pbmc, dims = 1:10)
Try different resolutions when clustering to identify the varibility of your data. The function FindClusters is used to cluster the data
resolution = 2
pbmc <- FindClusters(object = pbmc, reduction.type = "umap", resolution = seq(0.1, resolution, 0.1),
dims.use = 1:10, save.SNN = TRUE)
# the different clustering created
names(pbmc@meta.data)
## [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "percent.mt"
## [5] "RNA_snn_res.0.1" "RNA_snn_res.0.2" "RNA_snn_res.0.3" "RNA_snn_res.0.4"
## [9] "RNA_snn_res.0.5" "RNA_snn_res.0.6" "RNA_snn_res.0.7" "RNA_snn_res.0.8"
## [13] "RNA_snn_res.0.9" "RNA_snn_res.1" "RNA_snn_res.1.1" "RNA_snn_res.1.2"
## [17] "RNA_snn_res.1.3" "RNA_snn_res.1.4" "RNA_snn_res.1.5" "RNA_snn_res.1.6"
## [21] "RNA_snn_res.1.7" "RNA_snn_res.1.8" "RNA_snn_res.1.9" "RNA_snn_res.2"
## [25] "seurat_clusters"
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 9 10 0 7
## AAACCGTGTATGCG-1
## 8
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Plot a clustree to decide how many clusters you have and what resolution capture them.
clustree(pbmc, prefix = "RNA_snn_res.") + theme(legend.key.size = unit(0.05, "cm"))
Name cells with the corresponding cluster name at the resolution you pick. This case we are happy with 0.5.
# The name of the cluster is prefixed with 'RNA_snn_res' and the number of the resolution
Idents(pbmc) <- pbmc$RNA_snn_res.0.5
Plot the UMAP with colored clusters with Dimplot
DimPlot(pbmc, label = TRUE, repel = TRUE, label.box = TRUE) + NoLegend()
You can save the object at this point so that it can easily be loaded back in without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.
saveRDS(pbmc, file = "pbmc_tutorial.rds")
Single cell data helps to segragate cell types. Use markers to identify cell types. warning: In this example the cell types/markers are well known and making this step easy, but in reality this step needs the experts curation.
Seurat can help you find markers that define clusters via
differential expression. By default, it identifies positive and negative
markers of a single cluster (specified in ident.1
),
compared to all other cells. FindAllMarkers()
automates
this process for all clusters, but you can also test groups of clusters
vs. each other, or against all cells.
The min.pct
argument requires a feature to be detected
at a minimum percentage in either of the two groups of cells, and the
thresh.test argument requires a feature to be differentially expressed
(on average) by some amount between the two groups. You can set both of
these to 0, but with a dramatic increase in time - since this will test
a large number of features that are unlikely to be highly
discriminatory. As another option to speed up these computations,
max.cells.per.ident
can be set. This will downsample each
identity class to have no more cells than whatever this is set to. While
there is generally going to be a loss in power, the speed increases can
be significant and the most highly differentially expressed features
will likely still rise to the top.
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | |
---|---|---|---|---|---|
S100A9 | 0 | 5.522104 | 0.996 | 0.217 | 0 |
S100A8 | 0 | 5.436306 | 0.971 | 0.123 | 0 |
FCN1 | 0 | 3.414967 | 0.954 | 0.151 | 0 |
LGALS2 | 0 | 3.803685 | 0.906 | 0.060 | 0 |
CD14 | 0 | 2.803099 | 0.664 | 0.029 | 0 |
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | |
---|---|---|---|---|---|
FCGR3A | 0 | 4.285724 | 0.975 | 0.039 | 0 |
IFITM3 | 0 | 3.874732 | 0.975 | 0.048 | 0 |
CFD | 0 | 3.432638 | 0.938 | 0.037 | 0 |
RP11-290F20.3 | 0 | 2.729214 | 0.850 | 0.017 | 0 |
CD68 | 0 | 3.002809 | 0.912 | 0.037 | 0 |
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene |
---|---|---|---|---|---|---|
0 | 1.424944 | 0.472 | 0.110 | 0 | 0 | CCR7 |
0 | 1.179221 | 0.266 | 0.084 | 0 | 0 | LDLRAP1 |
0 | 1.312595 | 0.982 | 0.639 | 0 | 1 | LTB |
0 | 1.249746 | 0.421 | 0.109 | 0 | 1 | AQP3 |
0 | 5.522104 | 0.996 | 0.217 | 0 | 2 | S100A9 |
0 | 5.436306 | 0.971 | 0.123 | 0 | 2 | S100A8 |
0 | 4.293263 | 0.936 | 0.043 | 0 | 3 | CD79A |
0 | 3.591918 | 0.622 | 0.022 | 0 | 3 | TCL1A |
0 | 3.227624 | 0.592 | 0.045 | 0 | 4 | GZMK |
0 | 3.103339 | 0.955 | 0.226 | 0 | 4 | CCL5 |
0 | 3.328145 | 0.975 | 0.136 | 0 | 5 | FCGR3A |
0 | 3.003894 | 1.000 | 0.315 | 0 | 5 | LST1 |
0 | 4.822027 | 0.966 | 0.071 | 0 | 6 | GZMB |
0 | 4.814796 | 0.933 | 0.136 | 0 | 6 | GNLY |
0 | 3.885927 | 0.829 | 0.010 | 0 | 7 | FCER1A |
0 | 2.950422 | 0.971 | 0.209 | 0 | 7 | HLA-DQA1 |
0 | 8.575788 | 1.000 | 0.024 | 0 | 8 | PPBP |
0 | 7.243522 | 1.000 | 0.010 | 0 | 8 | PF4 |
Seurat has several tests for differential expression which can be set
with the test.use parameter (see the Seurat differential
expression vignette for details). For example, the ROC test returns
the ‘classification power’ abs(AUC-0.5)*2
for any
individual marker, ranging from 0 = random to 1 = perfect.
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
Seurat includes several tools for visualizing marker expression.
VlnPlot()
(shows expression probability distributions
across clusters), and FeaturePlot()
(visualizes feature
expression on a tSNE or PCA plot) are some of the most commonly used
visualizations. Additional ways to explore your dataset include
RidgePlot()
, CellScatter()
, and
DotPlot()
.
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
These are ridgeplots, cell scatter plots and dotplots. Replace
FeaturePlot
with the other functions.
RidgePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
For CellScatter plots, will need the cell id of the cells you want to
look at. You can get this from the cell metadata
(
pbmc@meta.data
).
head(pbmc@meta.data)
orig.ident | nCount_RNA | nFeature_RNA | percent.mt | RNA_snn_res.0.1 | RNA_snn_res.0.2 | RNA_snn_res.0.3 | RNA_snn_res.0.4 | RNA_snn_res.0.5 | RNA_snn_res.0.6 | RNA_snn_res.0.7 | RNA_snn_res.0.8 | RNA_snn_res.0.9 | RNA_snn_res.1 | RNA_snn_res.1.1 | RNA_snn_res.1.2 | RNA_snn_res.1.3 | RNA_snn_res.1.4 | RNA_snn_res.1.5 | RNA_snn_res.1.6 | RNA_snn_res.1.7 | RNA_snn_res.1.8 | RNA_snn_res.1.9 | RNA_snn_res.2 | seurat_clusters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AAACATACAACCAC-1 | pbmc3k | 2419 | 779 | 3.0177759 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | 7 | 7 | 7 | 7 | 7 | 7 | 10 | 10 | 7 | 9 | 9 |
AAACATTGAGCTAC-1 | pbmc3k | 4903 | 1352 | 3.7935958 | 3 | 3 | 2 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 10 | 10 |
AAACATTGATCAGC-1 | pbmc3k | 3147 | 1129 | 0.8897363 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
AAACCGTGCTTCCG-1 | pbmc3k | 2639 | 960 | 1.7430845 | 1 | 4 | 4 | 5 | 5 | 7 | 6 | 7 | 7 | 7 | 8 | 8 | 8 | 8 | 8 | 8 | 7 | 7 | 8 | 7 | 7 |
AAACCGTGTATGCG-1 | pbmc3k | 980 | 521 | 1.2244898 | 2 | 2 | 5 | 6 | 6 | 8 | 7 | 8 | 8 | 8 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 8 |
AAACGCACTGGTAC-1 | pbmc3k | 2163 | 781 | 1.6643551 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
CellScatter(pbmc, cell1 = "AAACATACAACCAC-1", cell2 = "AAACATTGAGCTAC-1")
DotPlots
DotPlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
Which plots do you prefer? Discuss.
DoHeatmap()
generates an expression heatmap for given
cells and features. In this case, we are plotting the top 20 markers (or
all markers if less than 20) for each cluster.
top10 <- pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend() + theme(axis.text.x = element_text(size = 5),
text = element_text(size = 5), axis.text.y = element_text(size = 5))
If you know markers for your cell types, use AddModuleScore to label them.
genes_markers <- list(Naive_CD4_T = c("IL7R", "CCR7"))
pbmc <- AddModuleScore(object = pbmc, features = genes_markers, ctrl = 5, name = "Naive_CD4_T",
search = TRUE)
# color scale for better visualization
plotCol = rev(brewer.pal(n = 7, name = "RdYlBu"))
# notice the name of the cluster has a 1 at the end
names(pbmc@meta.data)
## [1] "orig.ident" "nCount_RNA" "nFeature_RNA" "percent.mt"
## [5] "RNA_snn_res.0.1" "RNA_snn_res.0.2" "RNA_snn_res.0.3" "RNA_snn_res.0.4"
## [9] "RNA_snn_res.0.5" "RNA_snn_res.0.6" "RNA_snn_res.0.7" "RNA_snn_res.0.8"
## [13] "RNA_snn_res.0.9" "RNA_snn_res.1" "RNA_snn_res.1.1" "RNA_snn_res.1.2"
## [17] "RNA_snn_res.1.3" "RNA_snn_res.1.4" "RNA_snn_res.1.5" "RNA_snn_res.1.6"
## [21] "RNA_snn_res.1.7" "RNA_snn_res.1.8" "RNA_snn_res.1.9" "RNA_snn_res.2"
## [25] "seurat_clusters" "Naive_CD4_T1"
# label that cell type
pbmc$cell_label = NA
pbmc$cell_label[pbmc$Naive_CD4_T1 > 1] = "Naive_CD4_T"
Idents(pbmc) = pbmc$cell_label
# plot
FeaturePlot(pbmc, features = "Naive_CD4_T1", label = TRUE, repel = TRUE, ) + scale_color_gradientn(colors = plotCol)
Fortunately in the case of this dataset, we can use canonical markers to easily match the unbiased clustering to known cell types:
Cluster ID | Markers | Cell Type |
---|---|---|
0 | IL7R, CCR7 | Naive CD4+ T |
1 | CD14, LYZ | CD14+ Mono |
2 | IL7R, S100A4 | Memory CD4+ |
3 | MS4A1 | B |
4 | CD8A | CD8+ T |
5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
6 | GNLY, NKG7 | NK |
7 | FCER1A, CST3 | DC |
8 | PPBP | Platelet |
Idents(pbmc) <- pbmc$RNA_snn_res.0.5
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
saveRDS(pbmc, file = "pbmc3k_final.rds")
# install.packages('BiocManager')
# BiocManager::install(c('SingleCellExperiment','SingleR','celldex'),ask=F)
library(SingleCellExperiment)
library(SingleR)
library(celldex)
In this workshop we have focused on the Seurat package. However, there is another whole ecosystem of R packages for single cell analysis within Bioconductor. We won’t go into any detail on these packages in this workshop, but there is good material describing the object type online : OSCA.
For now, we’ll just convert our Seurat object into an object called SingleCellExperiment. Some popular packages from Bioconductor that work with this type are Slingshot, Scran, Scater.
sce <- as.SingleCellExperiment(pbmc)
We will now use a package called SingleR to label each cell. SingleR
uses a reference data set of cell types with expression data to infer
the best label for each cell. A convenient collection of cell type
reference is in the celldex
package which currently
contains the follow sets:
ls("package:celldex")
## [1] "BlueprintEncodeData" "DatabaseImmuneCellExpressionData"
## [3] "HumanPrimaryCellAtlasData" "ImmGenData"
## [5] "MonacoImmuneData" "MouseRNAseqData"
## [7] "NovershternHematopoieticData"
In this example, we’ll use the HumanPrimaryCellAtlasData
set, which contains high-level, and fine-grained label types. Lets
download the reference dataset
ref.set <- celldex::HumanPrimaryCellAtlasData()
head(unique(ref.set$label.main))
## [1] "DC" "Smooth_muscle_cells" "Epithelial_cells"
## [4] "B_cell" "Neutrophils" "T_cells"
An example of the types of “fine” labels.
head(unique(ref.set$label.fine))
## [1] "DC:monocyte-derived:immature" "DC:monocyte-derived:Galectin-1"
## [3] "DC:monocyte-derived:LPS" "DC:monocyte-derived"
## [5] "Smooth_muscle_cells:bronchial:vit_D" "Smooth_muscle_cells:bronchial"
Now we’ll label our cells using the SingleCellExperiment object, with the above reference set.
pred.cnts <- SingleR::SingleR(test = sce, ref = ref.set, labels = ref.set$label.main)
Keep any types that have more than 10 cells to the label, and put those labels back on our Seurat object and plot our on our umap.
lbls.keep <- table(pred.cnts$labels) > 10
pbmc$SingleR.labels <- ifelse(lbls.keep[pred.cnts$labels], pred.cnts$labels, "Other")
DimPlot(pbmc, reduction = "umap", group.by = "SingleR.labels")
It is nice to see that SingleR does not use the clusters we computed earlier, but the labels do seem to match those clusters reasonably well.
See if you can annotate the data with the fine labels from the Monoco reference dataset and whether it improves the cell type annotation resolution. Do you lose any groups?
Remember you can view the list of references with
ls('package:celldex')
You can have data coming from different samples, batches or experiments and you will need to combine them.
When data is collected from multiple samples, multiple runs of the single cell sequencing library preparation, or multiple conditions, cells of the same type may become separated in the UMAP and be put into several different clusters.
For the purpose of clustering and cell identification, we would like to remove such effects.
We will now look at GSE96583, another PBMC dataset. For speed, we will be looking at a subset of 5000 cells from this data. The cells in this dataset were pooled from eight individual donors. A nice feature is that genetic differences allow some of the cell doublets to be identified. This data contains two batches of single cell sequencing. One of the batches was stimulated with IFN-beta.
The data has already been processed as we have done with the first
PBMC dataset, and can be loaded from kang2018.rds
.
kang <- readRDS("kang2018.rds")
head(kang@meta.data)
orig.ident | nCount_RNA | nFeature_RNA | ind | stim | cell | multiplets | |
---|---|---|---|---|---|---|---|
AGGGCGCTATTTCC-1 | SeuratProject | 2020 | 523 | 1256 | stim | CD14+ Monocytes | singlet |
GGAGACGATTCGTT-1 | SeuratProject | 840 | 381 | 1256 | stim | CD4 T cells | singlet |
CACCGTTGTCGTAG-1 | SeuratProject | 3097 | 995 | 1016 | ctrl | FCGR3A+ Monocytes | singlet |
TATCGTACACGCAT-1 | SeuratProject | 1011 | 540 | 1488 | stim | B cells | singlet |
TGACGCCTTGCTTT-1 | SeuratProject | 570 | 367 | 101 | ctrl | CD4 T cells | ambs |
TACGAGACCTATTC-1 | SeuratProject | 2399 | 770 | 1244 | stim | CD4 T cells | singlet |
ind
identifies a cell as coming from one of 8
individuals.stim
identifies a cell as control or stimulated with
IFN-beta.cell
contains the cell types identified by the creators
of this data set.multiplets
classifies cells as singlet or doublet.DimPlot(kang, reduction = "umap", group.by = "ind")
DimPlot(kang, reduction = "umap", group.by = "stim")
DimPlot(kang, reduction = "pca", group.by = "stim")
kang <- FindNeighbors(kang, reduction = "pca", dims = 1:10)
kang <- FindClusters(kang, resolution = 0.25)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 5000
## Number of edges: 175130
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9501
## Number of communities: 12
## Elapsed time: 0 seconds
kang$pca_clusters <- kang$seurat_clusters
DimPlot(kang, reduction = "umap", group.by = "pca_clusters")
There is a big difference between unstimulated and stimulated cells.
This has split cells of the same type into pairs of clusters. If the
difference was simply uniform, we could regress it out (e.g. using
ScaleData(..., vars.to.regress="stim")
). However, as can be
seen in the PCA plot, the difference is not uniform and we need to do
something cleverer.
We will use Harmony, which can remove non-uniform effects. We will try to remove both the small differences between individuals and the large difference between the unstimulated and stimulated cells.
Harmony operates only on the PCA scores. The original gene expression levels remain unaltered.
library(harmony)
kang <- RunHarmony(kang, c("stim", "ind"), reduction = "pca")
This has added a new set of reduced dimensions to the Seurat object,
kang$harmony
which is a modified version of the existing
kang$pca
reduced dimensions.
DimPlot(kang, reduction = "harmony", group.by = "stim")
We can use harmony
the same way we used the
pca
reduction to compute a UMAP layout or to find
clusters.
kang <- RunUMAP(kang, reduction = "harmony", dims = 1:10, reduction.name = "umap_harmony")
DimPlot(kang, reduction = "umap_harmony", group.by = "stim")
kang <- FindNeighbors(kang, reduction = "harmony", dims = 1:10)
kang <- FindClusters(kang, resolution = 0.25)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 5000
## Number of edges: 171131
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9332
## Number of communities: 9
## Elapsed time: 0 seconds
kang$harmony_clusters <- kang$seurat_clusters
DimPlot(kang, reduction = "umap_harmony", group.by = "harmony_clusters")
DimPlot(kang, reduction = "umap", group.by = "harmony_clusters")
Having found a good set of clusters, we would usually perform differential expression analysis on the original data and include batches/runs/individuals as predictors in the linear model. In this example we could now compare un-stimulated and stimulated cells within each cluster. A particularly nice statistical approach that is possible here would be to convert the counts to pseudo-bulk data for the eight individuals, and then apply a bulk RNA-Seq differential expression analysis method. However there is still the problem that unstimulated and stimulated cells were processed in separate batches.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /persistent/software/apps/R/4.2.1/lib/R/lib/libRblas.so
## LAPACK: /persistent/software/apps/R/4.2.1/lib/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] harmony_0.1.0 Rcpp_1.0.8.3
## [3] celldex_1.6.0 SingleR_1.10.0
## [5] SingleCellExperiment_1.18.1 SummarizedExperiment_1.26.1
## [7] Biobase_2.56.0 GenomicRanges_1.48.0
## [9] GenomeInfoDb_1.32.2 IRanges_2.30.0
## [11] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [13] MatrixGenerics_1.8.0 matrixStats_0.62.0
## [15] RColorBrewer_1.1-3 clustree_0.5.0
## [17] ggraph_2.1.0 patchwork_1.1.2
## [19] SeuratObject_4.1.3 Seurat_4.2.1
## [21] ggplot2_3.3.6 dplyr_1.0.10
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 spatstat.explore_3.0-5
## [3] reticulate_1.26 tidyselect_1.1.2
## [5] RSQLite_2.2.14 AnnotationDbi_1.58.0
## [7] htmlwidgets_1.5.4 grid_4.2.1
## [9] BiocParallel_1.30.3 Rtsne_0.16
## [11] munsell_0.5.0 ScaledMatrix_1.4.1
## [13] codetools_0.2-18 ica_1.0-3
## [15] future_1.27.0 miniUI_0.1.1.1
## [17] withr_2.5.0 spatstat.random_3.0-1
## [19] colorspace_2.0-3 progressr_0.10.1
## [21] filelock_1.0.2 highr_0.9
## [23] knitr_1.39 rstudioapi_0.13
## [25] ROCR_1.0-11 tensor_1.5
## [27] listenv_0.8.0 labeling_0.4.2
## [29] GenomeInfoDbData_1.2.8 polyclip_1.10-0
## [31] bit64_4.0.5 farver_2.1.0
## [33] parallelly_1.32.1 vctrs_0.4.1
## [35] generics_0.1.2 xfun_0.31
## [37] BiocFileCache_2.4.0 R6_2.5.1
## [39] graphlayouts_0.8.4 rsvd_1.0.5
## [41] bitops_1.0-7 spatstat.utils_3.0-1
## [43] cachem_1.0.6 DelayedArray_0.22.0
## [45] assertthat_0.2.1 promises_1.2.0.1
## [47] scales_1.2.0 gtable_0.3.0
## [49] beachmat_2.12.0 globals_0.16.1
## [51] goftest_1.2-3 tidygraph_1.2.2
## [53] rlang_1.0.5 splines_4.2.1
## [55] lazyeval_0.2.2 spatstat.geom_3.0-3
## [57] checkmate_2.1.0 BiocManager_1.30.19
## [59] yaml_2.3.5 reshape2_1.4.4
## [61] abind_1.4-5 backports_1.4.1
## [63] httpuv_1.6.5 tools_4.2.1
## [65] ellipsis_0.3.2 jquerylib_0.1.4
## [67] ggridges_0.5.3 plyr_1.8.7
## [69] sparseMatrixStats_1.8.0 zlibbioc_1.42.0
## [71] purrr_0.3.4 RCurl_1.98-1.7
## [73] deldir_1.0-6 pbapply_1.5-0
## [75] viridis_0.6.2 cowplot_1.1.1
## [77] zoo_1.8-10 ggrepel_0.9.1
## [79] cluster_2.1.3 magrittr_2.0.3
## [81] data.table_1.14.2 scattermore_0.8
## [83] lmtest_0.9-40 RANN_2.6.1
## [85] fitdistrplus_1.1-8 mime_0.12
## [87] evaluate_0.15 xtable_1.8-4
## [89] gridExtra_2.3 compiler_4.2.1
## [91] tibble_3.1.7 KernSmooth_2.23-20
## [93] crayon_1.5.1 htmltools_0.5.2
## [95] later_1.3.0 tidyr_1.2.0
## [97] DBI_1.1.2 ExperimentHub_2.4.0
## [99] tweenr_2.0.2 formatR_1.12
## [101] dbplyr_2.2.1 rappdirs_0.3.3
## [103] MASS_7.3-57 Matrix_1.5-3
## [105] cli_3.3.0 parallel_4.2.1
## [107] igraph_1.3.4 pkgconfig_2.0.3
## [109] sp_1.5-0 plotly_4.10.0
## [111] spatstat.sparse_3.0-0 bslib_0.3.1
## [113] XVector_0.36.0 stringr_1.4.0
## [115] digest_0.6.29 sctransform_0.3.5
## [117] RcppAnnoy_0.0.19 spatstat.data_3.0-0
## [119] Biostrings_2.64.0 rmarkdown_2.14
## [121] leiden_0.4.2 uwot_0.1.14
## [123] DelayedMatrixStats_1.18.2 curl_4.3.2
## [125] shiny_1.7.2 lifecycle_1.0.1
## [127] nlme_3.1-157 jsonlite_1.8.0
## [129] BiocNeighbors_1.14.0 viridisLite_0.4.0
## [131] limma_3.52.4 fansi_1.0.3
## [133] pillar_1.7.0 lattice_0.20-45
## [135] KEGGREST_1.36.2 fastmap_1.1.0
## [137] httr_1.4.3 survival_3.3-1
## [139] interactiveDisplayBase_1.34.0 glue_1.6.2
## [141] png_0.1-7 BiocVersion_3.15.2
## [143] bit_4.0.4 ggforce_0.4.1
## [145] stringi_1.7.6 sass_0.4.1
## [147] blob_1.2.3 AnnotationHub_3.4.0
## [149] BiocSingular_1.12.0 memoise_2.0.1
## [151] irlba_2.3.5 future.apply_1.9.0