Regress out the effects of confounding variables using a linear least squares regression model.
Usage
regress_out(mat, latent_data, prediction_axis = c("row", "col"))
Arguments
- mat
Input IterableMatrix
- latent_data
Data to regress out, as a
data.frame
where each column is a variable to regress out.- prediction_axis
Which axis corresponds to prediction outputs from the linear models (e.g. the gene axis in typical single cell analysis). Options include "row" (default) and "col".
Details
Conceptually, regress_out
calculates a linear least squares best fit model for each row of the matrix.
(Or column if prediction_axis
is "col"
).
The input data for each regression model are the columns of latent_data
, and each model tries to
predict the values in the corresponding row (or column) of mat
. After fitting each model, regress_out
will subtract the model predictions from the input values, aiming to only retain effects that are
not explained by the variables in latent_data
.
These models can be fit efficiently since they all share the same input data and so most of the calculations for the closed-form best fit solution are shared. A QR factorization of the model matrix and a dense matrix-vector multiply are sufficient to fully calculate the residual values.
Efficiency considerations: As the output matrix is dense rather than sparse, mean and variance calculations may
run comparatively slowly. However, PCA and matrix/vector multiply operations can be performed at nearly the same
cost as the input matrix due to mathematical simplifications. Memory usage scales with n_features * ((nrow(mat) + ncol(mat))
.
Generally, n_features == ncol(latent_data)
, but for categorical variables in latent_data
, each
category will be expanded into its own indicator variable. Memory usage will therefore be higher when
using categorical input variables with many (i.e. >100) distinct values.