BPCells matrices are stored in sparse format, meaning only the non-zero entries are stored. Matrices can store integer counts data or decimal numbers (float or double). See details for more information.
Usage
write_matrix_memory(mat, compress = TRUE)
write_matrix_dir(
mat,
dir,
compress = TRUE,
buffer_size = 8192L,
overwrite = FALSE
)
open_matrix_dir(dir, buffer_size = 8192L)
write_matrix_hdf5(
mat,
path,
group,
compress = TRUE,
buffer_size = 8192L,
chunk_size = 1024L,
overwrite = FALSE,
gzip_level = 0L
)
open_matrix_hdf5(path, group, buffer_size = 16384L)
Arguments
- compress
Whether or not to compress the data.
- dir
Directory to save the data into
- buffer_size
For performance tuning only. The number of items to be buffered in memory before calling writes to disk.
- overwrite
If
TRUE
, write to a temp dir then overwrite existing data. Alternatively, pass a temp path as a string to customize the temp dir location.- path
Path to the hdf5 file on disk
- group
The group within the hdf5 file to write the data to. If writing to an existing hdf5 file this group must not already be in use
- chunk_size
For performance tuning only. The chunk size used for the HDF5 array storage.
- gzip_level
Gzip compression level. Default is 0 (no compression). This is recommended when both compression and compatibility with outside programs is required. Otherwise, using compress=TRUE is recommended as it is >10x faster with often similar compression levels.
- matrix
Input matrix, either IterableMatrix or dgCMatrix
Details
Storage locations
Matrices can be stored in a directory on disk, in memory, or in an HDF5 file. Saving in a directory on disk is a good default for local analysis, as it provides the best I/O performance and lowest memory usage. The HDF5 format allows saving within existing hdf5 files to group data together, and the in memory format provides the fastest performance in the event memory usage is unimportant.
Bitpacking Compression
For typical RNA counts matrices holding integer counts, this bitpacking compression will result in 6-8x less space than an R dgCMatrix, and 4-6x smaller than a scipy csc_matrix. The compression will be more effective when the count values in the matrix are small, and when the rows of the matrix are sorted by rowMeans. In tests on RNA-seq data optimal ordering could save up to 40% of storage space. On non-integer data only the row indices are compressed, not the values themselves so space savings will be smaller.
For non-integer data matrices, bitpacking compression is much less effective, as it can only be applied to the indexes of each entry but not the values. There will still be some space savings, but far less than for counts matrices.