Matrix Storage Format • BPCells

This provides an overview of the BPCells compressed and uncompressed matrix formats as they stand as of April 2023. While BPCells is pre-1.0 these formats may be updated, though with a goal of maintaining backwards read-compatibility across updates.

Matrix Logical Storage Layout

For data storage, we use a storage abstraction of named data arrays, stored as e.g. a single group hdf5 or a directory of files. The matrix format is a compressed sparse column/row (CSC/CSR) format with the following data arrays:

Name	Type	Length
`val`	uint32/float32/float64	# non-zeros
`index`	uint32	# non-zeros
`idxptr`	uint64	# cols + 1
`shape`	uint32	2
`row_names`	string	0 or #rows
`col_names`	string	0 or #cols
`storage_order`	string	1

The interpretation of each array is as follows:

val - Values of non-zero entries in increasing order of (column, row) position of the non-zero value.
index - index[i] provides the 0-based row index for the value found in val[i] (or column index for row-major storage order)
idxptr - The indexes in idx and val for the entries in column j can be found from idxptr[j] to idxptr[j+1] - 1 , inclusive. (or row j for row-major storage order)
shape - number of rows in matrix, followed by number of columns
row_names - Names for each row of the matrix (optional)
col_names - Names for each column of the matrix (optional)
storage_order- col for compressed-sparse-column, or row for compressed-sparse-row

Bitpacked compressed matrices consist of the following modifications:

val: For unsigned 32-bit integers, we replace val with val_data, val_idx, and val_idx_offsets corresponding to a BP-128m1 encoding as described below. The total number of values is already stored as the last value in idxptr. For 32-bit and 64-bit floats val remains unchanged.
index: We replace the index array with a BP-128d1z encoded data in arrays index_data, index_idx, index_idx_offsets, and index_starts

Each matrix is stored as a single directory, HDF5 group, or R S4 object. The storage format for each matrix is encoded as a version string. The current version string is of the format [compression]-[datatype]-matrix-v2, where [compression] can be either packed or unpacked, and [datatype] can be one of uint, float, and double corresponding to 32-bit unsigned integer, 32-bit float, and 64-bit double respectively. In v1 formats, the only difference is that idxptr had type uint32.

Bitpacking formats

Our bitpacked formats are based on the formats described in a paper by Lemire and Boytsov.

BP-128

The vanilla BP-128 format is stored in 3 arrays as follows:

data - stream of bitpacked data, represented as 32-bit integers with the interleaved bit layout as shown in Lemire and Boytsov figure 6. A chunk of 128 32-bit input integers with $B$ bits per integer will be stored using $4B$ 32-bit integers holding the bitpacked data.
idx - list of 32-bit integers, where the encoded data for integers index 128*i to 128*i + 127 can be found in data from index idx[i] to index idx[i+1]-1. For lists with $2^{32}$ (4 billion) entries or greater, idx stores the index modulo $2^{32}$
idx_offsets - list of 64-bit integers, where the values of idx with indices from idx_offsets[i] to idx_offsets[i+1]-1 should have i*(2^32) added to them.

BP-128m1

This is the same as BP-128, but with 1 subtracted from each value prior to compression

BP-128d1

Equivalent to the BP-128* algorithm from Lemire and Boytsov where integers are difference encoded prior to bitpacking. This is best for lists of sorted integers.

data - Encoding as with vanilla BP-128, but we do difference encoding prior to bitpacking: $x_{0}^{\prime}=0$ , $x_{1}^{\prime}=x_{1}-x_{0}$ , $x_{2}^{\prime}=x_{2}-x_{1}$ , …, $x_{127}^{\prime}=x_{127}-x_{126}$
idx, idx_offsets - identical to BP-128
starts - list of 32-bit integers, where starts[i] is the decoded value for the integer at index 128*i

BP-128d1z

Similar to BP128d1 but with zigzag encoding applied after difference encoding. This is best for lists of close but not fully sorted runs of integers.

data - Encoding as with BP-128d1, but between difference encoding and bitpacking, the results are zigzag encoded, where $zigzag(x)=2x$ if $x\geq0$ , and $zigzag(x)=-2x-1$ if $x<0$ .
idx, idx_offsets - identical to BP-128
starts - identical to BP128-d1

The core bitpacking code can be found in src/bitpacking/bp128.cpp in the github repository.

Physical storage layout

The abstraction of named data arrays can be realized by a few different formats. The three currently supported by BPCells are:

Directory of files format:

An array of numbers is stored as a single file with an 8-byte header, followed by the data values in little-endian binary format. Unsigned integers are encoded according to standard little-endian representation, and 32-bit and 64-bit floating point numbers as IEEE-754 format. Header values are 8-byte ASCII text as follows: unsigned 32-bit integer UINT32v1, unsigned 64-bit integer UINT64v1, 32-bit float FLOATSv1, 64-bit float DOUBLEv1. Arrays of strings are stored as ASCII text with one array value per line with no header. The version string is stored as a file named version containing the version string followed by a newline.

Hdf5 file format:

Arrays of numbers are stored as HDF5 datasets using the built-in HDF5 encoding format. Arrays of strings are stored as HDF5 variable length string datasets.

The version string is stored as a version attribute on the HDF5 group

R object format:

Strings are stored as native R character arrays. Unsigned integers and 32-bit floats are stored in native R integer arrays by bitcasting the R signed integers into the required data types. 64-bit floats are stored in native R numeric arrays. 64-bit integers are stored as doubles in R numeric arrays. This reduces the highest representable value from $2^{64}-1$ to $2^{53}-1$ (about 9 quadrillion), which we do not expect to pose practical problems. Named collections of arrays are stored in R lists (when writing) or S4 objects (when reading). The version string is stored as a string vector named version of length 1.