Jon Stearley
2007-Feb-01 01:59 UTC
[R] memory-efficient column aggregation of a sparse matrix
I need to sum the columns of a sparse matrix according to a factor - ie given a sparse matrix X and a factor fac of length ncol(X), sum the elements by column factors and return the sparse matrix Y of size nrow(X) by nlevels(f). The appended code does the job, but is unacceptably memory-bound because tapply() uses a non-sparse representation. Can anyone suggest a more memory and cpu efficient approach? Eg, a sparse matrix tapply method? Thanks. -- +--------------------------------------------------------------+ | Jon Stearley (505) 845-7571 (FAX 844-9297) | | Sandia National Laboratories Scalable Systems Integration | +--------------------------------------------------------------+ # x and y are of SparseM class matrix.csr "aggregate.csr" <- function(x, fac) { # make a vector indicating the row of each nonzero rows <- integer(length=length(x at ra)) rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row rows <- as.integer(cumsum(rows)) # and finish with a cumsum # make a vector indicating the column factor of each nonzero f <- fac[x at ja] # aggregate by row,f y <- tapply(x at ra, list(rows,f), sum) # sparsify it y[is.na(y)] <- 0 # change tapply NAs to as.matrix.csr 0s y <- as.matrix.csr(y) y }
Douglas Bates
2007-Feb-01 13:22 UTC
[R] memory-efficient column aggregation of a sparse matrix
On 1/31/07, Jon Stearley <jrstear at sandia.gov> wrote:> I need to sum the columns of a sparse matrix according to a factor - > ie given a sparse matrix X and a factor fac of length ncol(X), sum > the elements by column factors and return the sparse matrix Y of size > nrow(X) by nlevels(f). The appended code does the job, but is > unacceptably memory-bound because tapply() uses a non-sparse > representation. Can anyone suggest a more memory and cpu efficient > approach? Eg, a sparse matrix tapply method? Thanks.This is the sort of operation that is much more easily performed in the triplet representation of a sparse matrix where each nonzero element is represented by its row index, column index and value. Using that representation you could map the column indices according to the factor then convert back to one of the other representations. The only question would be what to do about nonzeros in different columns of the original matrix that get mapped to the same element in the result. It turns out that in the sparse matrix code used by the Matrix package the triplet representation allows for duplicate index positions with the convention that the resulting value at a position is the sum of the values of any triplets with that index pair. If you decide to use this approach please be aware that the indices for the triplet representation in the Matrix package are 0-based (as in C code) not 1-based (as in R code). (I imagine that Martin is thinking "we really should change that" as he reads this part.)> > -- > +--------------------------------------------------------------+ > | Jon Stearley (505) 845-7571 (FAX 844-9297) | > | Sandia National Laboratories Scalable Systems Integration | > +--------------------------------------------------------------+ > > > # x and y are of SparseM class matrix.csr > "aggregate.csr" <- > function(x, fac) { > # make a vector indicating the row of each nonzero > rows <- integer(length=length(x at ra)) > rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row > rows <- as.integer(cumsum(rows)) # and finish with a cumsum > > # make a vector indicating the column factor of each nonzero > f <- fac[x at ja] > > # aggregate by row,f > y <- tapply(x at ra, list(rows,f), sum) > > # sparsify it > y[is.na(y)] <- 0 # change tapply NAs to as.matrix.csr 0s > y <- as.matrix.csr(y) > > y > } >
roger koenker
2007-Feb-01 13:30 UTC
[R] memory-efficient column aggregation of a sparse matrix
Doug is right, I think, that this would be easier with full indexing using the matrix.coo classe, if you want to use SparseM. But then the tapply seems to be the way to go. url: www.econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820 On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote:> On 1/31/07, Jon Stearley <jrstear at sandia.gov> wrote: >> I need to sum the columns of a sparse matrix according to a factor - >> ie given a sparse matrix X and a factor fac of length ncol(X), sum >> the elements by column factors and return the sparse matrix Y of size >> nrow(X) by nlevels(f). The appended code does the job, but is >> unacceptably memory-bound because tapply() uses a non-sparse >> representation. Can anyone suggest a more memory and cpu efficient >> approach? Eg, a sparse matrix tapply method? Thanks. > > This is the sort of operation that is much more easily performed in > the triplet representation of a sparse matrix where each nonzero > element is represented by its row index, column index and value. > Using that representation you could map the column indices according > to the factor then convert back to one of the other representations. > The only question would be what to do about nonzeros in different > columns of the original matrix that get mapped to the same element in > the result. It turns out that in the sparse matrix code used by the > Matrix package the triplet representation allows for duplicate index > positions with the convention that the resulting value at a position > is the sum of the values of any triplets with that index pair. > > If you decide to use this approach please be aware that the indices > for the triplet representation in the Matrix package are 0-based (as > in C code) not 1-based (as in R code). (I imagine that Martin is > thinking "we really should change that" as he reads this part.) > >> >> -- >> +--------------------------------------------------------------+ >> | Jon Stearley (505) 845-7571 (FAX 844-9297) | >> | Sandia National Laboratories Scalable Systems Integration | >> +--------------------------------------------------------------+ >> >> >> # x and y are of SparseM class matrix.csr >> "aggregate.csr" <- >> function(x, fac) { >> # make a vector indicating the row of each nonzero >> rows <- integer(length=length(x at ra)) >> rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row >> rows <- as.integer(cumsum(rows)) # and finish with a cumsum >> >> # make a vector indicating the column factor of each nonzero >> f <- fac[x at ja] >> >> # aggregate by row,f >> y <- tapply(x at ra, list(rows,f), sum) >> >> # sparsify it >> y[is.na(y)] <- 0 # change tapply NAs to as.matrix.csr 0s >> y <- as.matrix.csr(y) >> >> y >> } >>