thr3ads.net - R help - [R] memory-efficient column aggregation of a sparse matrix [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Jon Stearley

2007-Feb-01 01:59 UTC

[R] memory-efficient column aggregation of a sparse matrix

I need to sum the columns of a sparse matrix according to a factor -  
ie given a sparse matrix X and a factor fac of length ncol(X), sum  
the elements by column factors and return the sparse matrix Y of size  
nrow(X) by nlevels(f).  The appended code does the job, but is  
unacceptably memory-bound because tapply() uses a non-sparse  
representation.  Can anyone suggest a more memory and cpu efficient  
approach?  Eg, a sparse matrix tapply method?  Thanks.

-- 
+--------------------------------------------------------------+
| Jon Stearley                  (505) 845-7571  (FAX 844-9297) |
| Sandia National Laboratories  Scalable Systems Integration   |
+--------------------------------------------------------------+


# x and y are of SparseM class matrix.csr
"aggregate.csr" <-
function(x, fac) {
         # make a vector indicating the row of each nonzero
         rows <- integer(length=length(x at ra))
         rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row
         rows <- as.integer(cumsum(rows)) # and finish with a cumsum

         # make a vector indicating the column factor of each nonzero
         f <- fac[x at ja]

         # aggregate by row,f
         y <- tapply(x at ra, list(rows,f), sum)

         # sparsify it
         y[is.na(y)] <- 0  # change tapply NAs to as.matrix.csr 0s
         y <- as.matrix.csr(y)

         y
}

Douglas Bates

2007-Feb-01 13:22 UTC

head link

[R] memory-efficient column aggregation of a sparse matrix

On 1/31/07, Jon Stearley <jrstear at sandia.gov>
wrote:> I need to sum the columns of a sparse matrix according to a factor -
> ie given a sparse matrix X and a factor fac of length ncol(X), sum
> the elements by column factors and return the sparse matrix Y of size
> nrow(X) by nlevels(f).  The appended code does the job, but is
> unacceptably memory-bound because tapply() uses a non-sparse
> representation.  Can anyone suggest a more memory and cpu efficient
> approach?  Eg, a sparse matrix tapply method?  Thanks.
This is the sort of operation that is much more easily performed in
the triplet representation of a sparse matrix where each nonzero
element is represented by its row index, column index and value.
Using that representation you could map the column indices according
to the factor then convert back to one of the other representations.
The only question would be what to do about nonzeros in different
columns of the original matrix that get mapped to the same element in
the result.  It turns out that in the sparse matrix code used by the
Matrix package the triplet representation allows for duplicate index
positions with the convention that the resulting value at a position
is the sum of the values of any triplets with that index pair.

If you decide to use this approach please be aware that the indices
for the triplet representation in the Matrix package are 0-based (as
in C code) not 1-based (as in R code).  (I imagine that Martin is
thinking "we really should change that" as he reads this part.)
>
> --
> +--------------------------------------------------------------+
> | Jon Stearley                  (505) 845-7571  (FAX 844-9297) |
> | Sandia National Laboratories  Scalable Systems Integration   |
> +--------------------------------------------------------------+
>
>
> # x and y are of SparseM class matrix.csr
> "aggregate.csr" <-
> function(x, fac) {
>          # make a vector indicating the row of each nonzero
>          rows <- integer(length=length(x at ra))
>          rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each row
>          rows <- as.integer(cumsum(rows)) # and finish with a cumsum
>
>          # make a vector indicating the column factor of each nonzero
>          f <- fac[x at ja]
>
>          # aggregate by row,f
>          y <- tapply(x at ra, list(rows,f), sum)
>
>          # sparsify it
>          y[is.na(y)] <- 0  # change tapply NAs to as.matrix.csr 0s
>          y <- as.matrix.csr(y)
>
>          y
> }
>

roger koenker

2007-Feb-01 13:30 UTC

head link

[R] memory-efficient column aggregation of a sparse matrix

Doug is right, I think, that this would be easier with full indexing
using the  matrix.coo classe, if you want to use SparseM.  But
then the tapply seems to be the way to go.

url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    rkoenker at uiuc.edu            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820


On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote:
> On 1/31/07, Jon Stearley <jrstear at sandia.gov> wrote:
>> I need to sum the columns of a sparse matrix according to a factor -
>> ie given a sparse matrix X and a factor fac of length ncol(X), sum
>> the elements by column factors and return the sparse matrix Y of size
>> nrow(X) by nlevels(f).  The appended code does the job, but is
>> unacceptably memory-bound because tapply() uses a non-sparse
>> representation.  Can anyone suggest a more memory and cpu efficient
>> approach?  Eg, a sparse matrix tapply method?  Thanks.
>
> This is the sort of operation that is much more easily performed in
> the triplet representation of a sparse matrix where each nonzero
> element is represented by its row index, column index and value.
> Using that representation you could map the column indices according
> to the factor then convert back to one of the other representations.
> The only question would be what to do about nonzeros in different
> columns of the original matrix that get mapped to the same element in
> the result.  It turns out that in the sparse matrix code used by the
> Matrix package the triplet representation allows for duplicate index
> positions with the convention that the resulting value at a position
> is the sum of the values of any triplets with that index pair.
>
> If you decide to use this approach please be aware that the indices
> for the triplet representation in the Matrix package are 0-based (as
> in C code) not 1-based (as in R code).  (I imagine that Martin is
> thinking "we really should change that" as he reads this part.)
>
>>
>> --
>> +--------------------------------------------------------------+
>> | Jon Stearley                  (505) 845-7571  (FAX 844-9297) |
>> | Sandia National Laboratories  Scalable Systems Integration   |
>> +--------------------------------------------------------------+
>>
>>
>> # x and y are of SparseM class matrix.csr
>> "aggregate.csr" <-
>> function(x, fac) {
>>          # make a vector indicating the row of each nonzero
>>          rows <- integer(length=length(x at ra))
>>          rows[x at ia[1:nrow(x)]] <- 1 # put a 1 at start of each
row
>>          rows <- as.integer(cumsum(rows)) # and finish with a cumsum
>>
>>          # make a vector indicating the column factor of each nonzero
>>          f <- fac[x at ja]
>>
>>          # aggregate by row,f
>>          y <- tapply(x at ra, list(rows,f), sum)
>>
>>          # sparsify it
>>          y[is.na(y)] <- 0  # change tapply NAs to as.matrix.csr 0s
>>          y <- as.matrix.csr(y)
>>
>>          y
>> }
>>

R help - Feb 2007 - memory-efficient column aggregation of a sparse matrix

[R] memory-efficient column aggregation of a sparse matrix

[R] memory-efficient column aggregation of a sparse matrix

[R] memory-efficient column aggregation of a sparse matrix