4 Load data

Let’s get started with a single cell introduction

4.1 Setup the Seurat Object

4.2 The data set

The dataset used in this workshop is a modified version derived from this study (see here). It has been adapted to introduce additional complexity for instructional purposes. Please refrain from drawing any biological conclusions from this data as it does not represent real experimental results.

This dataset represents human peripheral blood mononuclear cells (PBMCs), pooled from eight individual donors. Genetic differences among donors enable the identification of some cell doublets, enhancing data complexity. It includes two single-cell sequencing batches, one of which was stimulated with IFN-beta. Additionally, mitochondrial expression levels have been introduced to demonstrate how to interpret and apply mitochondrial thresholds for filtering purposes.

Note: What does the data look like?

What do the input files look like? It varies, but this is the output of the CellRanger pipleine, described here

├── analysis
│   ├── clustering
│   ├── diffexp
│   ├── pca
│   ├── tsne
│   └── umap
├── cloupe.cloupe
├── filtered_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5.  --> matrix to read in h5 format
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5 
└── web_summary.html

We start by reading in the data. There are several options for loading the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).

We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the Seurat object structure, check out the GitHub Wiki. For example, the count matrix is stored in pbmc@assays$RNA@counts.

4.3 Different ways of loading the data

Example 1. Load your data using the path to the folder: filtered_feature_bc_matrix that is in the output folder of your cellranger run using the Read10X function.

# Load the PBMC dataset
 pbmc.data <- Read10X(data.dir = "outs/filtered_feature_bc_matrix")
# Initialize the Seurat object with the raw (non-normalized data).
 seurat_object <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200)

Example 2. Load your data directing the ReadMtx function to each of the relevant files in the filtered_feature_bc_matrix folder in the outputs from your cellranger run. MTX is a simple text format for sparse matrices.

expression_matrix <- ReadMtx(
  mtx = "outs/filtered_feature_bc_matrix/count_matrix.mtx.gz", features = "outs/filtered_feature_bc_matrix/features.tsv.gz",
  cells = "outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
)
seurat_object <- CreateSeuratObject(counts = expression_matrix)

Example 3. Load your data using the Read10X_h5 function to each of the relevant HDF5 files. HDF5 is an efficient binary format.

pbmc.data <- Read10X_h5("data/filtered_feature_bc_matrix.h5")
metadata <- read.table("data/metadata.txt")
  
seurat_object <- CreateSeuratObject(counts = pbmc.data ,
                                 assay = "RNA", project = 'pbmc')
#> Warning: Feature names cannot have underscores ('_'),
#> replacing with dashes ('-')

# Use AddMetaData to add new meta data to object
seurat_object  <- AddMetaData(object = seurat_object, metadata = metadata)
What does data in a count matrix look like?
# Lets examine a few genes in the first thirty cells
pbmc.data[c("CD3D","TCL1A","MS4A1"), 1:30]
#> 3 x 30 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 30 column names 'AGGGCGCTATTTCC-1', 'GGAGACGATTCGTT-1', 'CACCGTTGTCGTAG-1' ... ]]
#>                                                            
#> CD3D  . 2 . . . 5 . . . . 2 . . . 1 . . 1 1 . . . 1 . 1 3 .
#> TCL1A . . . . . . . . . . . . . . . . . . . . . . . 3 . . .
#> MS4A1 . . . . . . . . . . . . . . . . . . . . . . . 1 . . .
#>            
#> CD3D  2 . .
#> TCL1A . . .
#> MS4A1 . . .

The . values in the matrix represent 0s (no molecules detected). Since most values in an scRNA-seq matrix are 0, Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x data.

dense.size <- object.size(as.matrix(pbmc.data))
#> Warning in asMethod(object): sparse->dense coercion:
#> allocating vector of size 1.3 GiB
dense.size
#> 1428252816 bytes
sparse.size <- object.size(pbmc.data)
sparse.size
#> 39615768 bytes
dense.size / sparse.size
#> 36.1 bytes