8 PCAs and UMAPs
8.1 Identification of highly variable features (feature selection)
Why do we need to do this?
Identifying the most variable features allows retaining the real biological variability of the data and reduce noise in the data.
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
Our procedure in Seurat is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures()
function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
pbmc <- FindVariableFeatures(pbmc, selection.method = 'vst', nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
#> When using repel, set xnudge and ynudge to 0 for optimal results
plot1 + plot2
#> Warning: Transformation introduced infinite values in continuous
#> x-axis
#> Transformation introduced infinite values in continuous
#> x-axis
8.2 Scaling the data
Why do we need to do this?
Highly expresed genes can overpower the signal of other less expresed genes with equal importance. Within the same cell the assumption is that the underlying RNA content is constant. Aditionally, If variables are provided in vars.to.regress, they are individually regressed against each feature, and the resulting residuals are then scaled and centered. This step allows controling for cell cycle and other factors that may bias your clustering.
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData()
function:
- Shifts the expression of each gene, so that the mean expression across cells is 0
- Scales the expression of each gene, so that the variance across cells is 1
- This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
- The results of this are stored in
pbmc$RNA@scale.data
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
#> Centering and scaling data matrix
This step takes too long! Can I make it faster?
Scaling is an essential step in the Seurat workflow, but only on genes that will be used as input to PCA. Therefore, the default in ScaleData()
is only to perform scaling on the previously identified variable features (2,000 by default). To do this, omit the features
argument in the previous function call, i.e.
# pbmc <- ScaleData(pbmc)
DoHeatmap()
) require genes in the heatmap to be scaled, to make sure highly-expressed genes don’t dominate the heatmap. To make sure we don’t leave any genes out of the heatmap later, we are scaling all genes in this tutorial.
How can I remove unwanted sources of variation, as in Seurat v2?
In Seurat v2
we also use the ScaleData()
function to remove unwanted sources of variation from a single-cell dataset. For example, we could ‘regress out’ heterogeneity associated with (for example) cell cycle stage, or mitochondrial contamination. These features are still supported in ScaleData()
in Seurat v3
, i.e.:
# pbmc <- ScaleData(pbmc, vars.to.regress = 'percent.mt')
SCTransform()
. The method is described in our paper, with a separate vignette using Seurat v3 here. As with ScaleData()
, the function SCTransform()
also includes a vars.to.regress
parameter.