Regress out unwanted variation — regress

Regress out the effects of confounding variables using a linear least squares regression model.

Usage

regress_out(mat, latent_data, prediction_axis = c("row", "col"))

Arguments

mat: Input IterableMatrix
latent_data: Data to regress out, as a data.frame where each column is a variable to regress out.
prediction_axis: Which axis corresponds to prediction outputs from the linear models (e.g. the gene axis in typical single cell analysis). Options include "row" (default) and "col".

Value

IterableMatrix

Details

Conceptually, regress_out calculates a linear least squares best fit model for each row of the matrix. (Or column if prediction_axis is "col"). The input data for each regression model are the columns of latent_data, and each model tries to predict the values in the corresponding row (or column) of mat. After fitting each model, regress_out will subtract the model predictions from the input values, aiming to only retain effects that are not explained by the variables in latent_data.

These models can be fit efficiently since they all share the same input data and so most of the calculations for the closed-form best fit solution are shared. A QR factorization of the model matrix and a dense matrix-vector multiply are sufficient to fully calculate the residual values.

Efficiency considerations: As the output matrix is dense rather than sparse, mean and variance calculations may run comparatively slowly. However, PCA and matrix/vector multiply operations can be performed at nearly the same cost as the input matrix due to mathematical simplifications. Memory usage scales with n_features * ((nrow(mat) + ncol(mat)). Generally, n_features == ncol(latent_data), but for categorical variables in latent_data, each category will be expanded into its own indicator variable. Memory usage will therefore be higher when using categorical input variables with many (i.e. >100) distinct values.