Normalizations and PCA
Avoid dense matrices whenever possible. Put normalizations that preserve sparsity (0 values stay 0) before normalizations that break sparsity (e.g. adding values to each row/column). A typical RNA-seq matrix has <5% non-zero entries, so your code will operate on 20x more entries with a dense matrix.
-
For most operations, we recommend using lazy evaluation to avoid creating intermediate matrices. The one common exception to this rule is when running PCA. Because PCA requires looping through the matrix several hundred times, it is often faster to write the matrix to disk once just before PCA rather than recalculating the entries on each PCA iteration.
- For storage efficiency, keep any sparsity-breaking normalizations
delayed, but store all the sparse normalizations in a temporary location
with
write_matrix_dir()
then apply the sparsity-breaking normalizations
- For storage efficiency, keep any sparsity-breaking normalizations
delayed, but store all the sparse normalizations in a temporary location
with
Adding values to the rows/columns of a matrix has very little overhead for PCA because it translates into a pre or post processing step before each mat-vec multiply iteration. As a sparsity-breaking operation, adding a vector to the matrix causes most other operations to become more expensive, however.
Storage order
- Sparse matrices can be stored in a row-major or column-major
orientation with BPCells. Along the indexed dimension (e.g. rows for
row-major), BPCells can efficiently seek to a selected column without
reading the whole matrix. This has performance implications for certain
operations:
- Marker features can only be computed on a matrix indexed by gene/feature.
- Sparse matrix multiplication can only be performed between matrices with the same storage order
- Sparse matrix multiplication performance can change dramatically depending on the storage order and relative matrix size/sparsity. For column-major matrices, the left matrix should be fast to load and contain few delayed operations, while the right matrix can be slow to load and contain many delayed operations. For row-major matrices the left/right preferences are reversed.
- You can check the storage order for a matrix by printing it out in the R terminal
- When calling the
t()
function, BPCells just flips a boolean flag for whether the matrix is row-major or column-major. This does not affect the underlying storage order. - To adjust the underlying storage order, call
transpose_storage_order()
. This is a slower operation, that requires writing a new copy of the data to disk.
Other tips
- Use a single call to
matrix_stats()
to calculate mean + variance in a single pass through the matrix when possible. See the function reference for details. - For ATAC-seq data, you can calculate variable features on the tile matrix without ever saving it to disk. This allows you to subset to variable tiles and create a peak matrix with just your variable tiles for some space savings.