Typical dimensionality reduction steps:

1000s of genes → 10s of Principal Components → 2 UMAP dimensions

Thousands of dimensions are hard to draw, so this simulated example will use 3 genes.

Assume the genes are already log transformed, centered, and scaled as usual.

Principal Components Analysis finds a new basis for the data. We go from genes-as-axes to PCs-as-axes. Each PC axis is defined by a vector of “gene loadings” (or “feature loadings”). The projection of each cell onto each PC is called the “cell score” or “cell embedding”.

This is a linear transformation. In fact it is a rotation. Straight lines in gene space remain straight lines in PC space.

Principal Components are ordered by how much variation they explain. The first n components are the best possible approximation of the data with n components, in terms of total squared error. We can simplify the data by discarding all but the first few PCs. Here, try hiding PC3 by unchecking its checkbox.

The most variation does not necessarily mean the most interesting structure. Here PC1 and PC3 actually show more interesting structure. Try hiding PC2 by unchecking its checkbox. This works ok here, but will not work well in general – biological phenomena of interest may be mixed up across several PCs.

UMAP is good at revealing interesting structure.

UMAP tries to put nearest neighbours in a high dimensional space close to each other in a 2D layout. By using the nearest neighbours graph, it follows the topology of the data.

UMAP is a non-linear transformation, straight lines may become curved or even torn apart.

The UMAP below was computed from all three PCs. Because I am mean, I found an example where the UMAP went a bit weird. Can you work out what has happened?

Using these widgets:

An axis can be hidden using the checkboxes.
Drag over points to highlight them in both plots.

## 2022-11-07 This code requires the development versions of langevitour and plotly.R
# remotes::install_github("pfh/langevitour")
# remotes::install_github("plotly/plotly.R")

library(langevitour)
library(uwot)
library(crosstalk)
library(plotly)
library(htmltools)

# Simulate some data
set.seed(563)
n <- 2000
mat <- cbind(
    rnorm(n),
    rnorm(n),
    rnorm(n)*0.3 + rep(c(-1,1),n/2))

gene_mat <- mat %*% cbind(
    gene1 = c(1,1,0),
    gene2 = c(1,-1,0),
    gene3 = c(1,0,2))


# Get principal components
pcs <- prcomp(gene_mat)

# Calculate UMAP layout
layout <- umap(pcs$x, min_dist=0.5)


# Object that allows widgets to share selections
shared <- SharedData$new(as.data.frame(layout))


langevitour_widget <- langevitour(
    gene_mat, 
    extraAxes=pcs$rotation, 
    axisColors=rep(c("#008800","#880000"),each=3), 
    scale=8, pointSize=1, link=shared, 
    state=list(damping=-1),
    width="600", height="600")

umap_plot <- ggplot(shared) + 
    aes(x=V1,y=V2,text="") + 
    geom_point(size=0.5) + 
    coord_fixed() +
    theme_bw() +
    labs(x="",y="",title="UMAP layout")

umap_widget <- ggplotly(umap_plot, tooltip="text", width=500, height=500) |> 
    style(unselected=list(marker=list(opacity=1))) |> # Don't double-fade unselected points
    highlight(on="plotly_selected", off="plotly_deselect")

browsable(div(
    style="display: grid; grid-template-columns: 1fr 1fr;", 
    langevitour_widget, umap_widget))

About nearest neighbours graphs

The UMAP layout is based on the nearest neighbour graph. For each cell the (default) 30 nearest neighbours are found (n.neighbors parameter in RunUMAP). This tends to capture the topology of shapes in high-dimensional space. It also adapts to how dense clusters are so, as a feature, UMAP erases differences in cluster density. All the points within clusters will be packed together with about the same density (min.dist parameter in RunUMAP).

Similarly, graph-based clustering methods such as Louvain clustering use nearest neighbour connections to find clusters following the topology of the data. Notice in the two clusters above there are pairs of points within a cluster that are further apart than some pairs of points between the two different clusters. A purely distance-based clustering method such as k-means would perform poorly here.

Louvain clustering can handle clusters with complex curved shapes, and clusters of different size and density.

Dimensionality reduction example

About nearest neighbours graphs