tomdharray at gmail.com
2016-Jan-13 01:50 UTC
[R] Scaling rows of a large Matrix::sparseMatrix()
Hello R-Users,
I'm looking for a way to scale the rows of a sparse matrix M with about
57,000 rows, 14,000 columns, and 238,000 non-zero matrix elements; see
example code below.
Usually I'd use the base::scale() function (see sample code), but it
freezes my computer. The same happens when I try to run a for loop over
the matrix rows.
The conversion with as.matrix() yields a 5.8 Gb large object, which
appears too large for scale().
So my question is: How can the rows of a large sparse matrix be
efficiently scaled?
Thanks and regards,
Dirk
### Hardware/Session Info
Intel Core i7 w/ 12 Gb RAM
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
### Example Code
library(Matrix)
set.seed(42)
## These are exemplary values for my real "problem matrix"
N_ROW <- 56743
N_COL <- 13648
SIZE <- 238283
PROB <- c(0.050, 0.050, 0.099, 0.149, 0.198, 0.178, 0.119,
0.079, 0.0297, 0.0198, 0.001, 0.001, 0.001)
## get some random values to populate the sparse matrix
x <- do.call(
what = rbind,
args = lapply(X = 1:N_ROW,
FUN = function(i)
expand.grid(i,
sample(x = 1:N_COL,
size = sample(1:15, 1),
replace = TRUE)
)
)
)
x[,3] <- sample(x = 1:13, size = nrow(x),
replace = TRUE, prob = PROB)
## build the sparse matrix
M <- Matrix::sparseMatrix(
dims = c(N_ROW, N_COL),
i = x[,1],
j = x[,2],
x = x[,3]
)
print(format(object.size(M), units = "auto"))
## *******************************************
## Scaling the rows of M
## scale() lets my computer freeze
# M <- scale(t(M), center = FALSE, scale(Matrix::rowSums(M)))
## this appears to be not elegant at all and takes forever
# rwsms <- Matrix::rowSums(M)
# for (i in 1:nrow(M)) M[i,] <- M[i,]/rwsms[[i]]
> On 13 Jan 2016, at 02:50, tomdharray at gmail.com wrote: > > So my question is: How can the rows of a large sparse matrix be > efficiently scaled?If you're not picky about the particular storage format, the "wordspace" package http://wordspace.r-forge.r-project.org/ has an efficient scaleMargins() function, which can be made to do what you need in combination with rowNorms() and colNorms(); cf. the trivial implementation of normalize.rows(). These functions only work with a dgCMatrix and will try to coerce any other sparseMatrix to this format. Best, Stefan
Hello, Dirk, maybe I'm missing something, but to avoid your for-loop-approach doesn't M <- M/Matrix::rowSums(M) do what you want? Hth -- Gerrit --------------------------------------------------------------------- Dr. Gerrit Eichner Mathematical Institute, Room 212 gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany Fax: +49-(0)641-99-32109 http://www.uni-giessen.de/eichner ---------------------------------------------------------------------> Hello R-Users, > > I'm looking for a way to scale the rows of a sparse matrix M with about > 57,000 rows, 14,000 columns, and 238,000 non-zero matrix elements; see > example code below. > > Usually I'd use the base::scale() function (see sample code), but it > freezes my computer. The same happens when I try to run a for loop over > the matrix rows. > > The conversion with as.matrix() yields a 5.8 Gb large object, which > appears too large for scale(). > > > So my question is: How can the rows of a large sparse matrix be > efficiently scaled? > > Thanks and regards, > > Dirk > > > ### Hardware/Session Info > Intel Core i7 w/ 12 Gb RAM > R version 3.2.1 (2015-06-18) > Platform: x86_64-unknown-linux-gnu (64-bit) > Running under: Ubuntu 14.04.3 LTS > > ### Example Code > library(Matrix) > set.seed(42) > > ## These are exemplary values for my real "problem matrix" > N_ROW <- 56743 > N_COL <- 13648 > SIZE <- 238283 > PROB <- c(0.050, 0.050, 0.099, 0.149, 0.198, 0.178, 0.119, > 0.079, 0.0297, 0.0198, 0.001, 0.001, 0.001) > > ## get some random values to populate the sparse matrix > x <- do.call( > what = rbind, > args = lapply(X = 1:N_ROW, > FUN = function(i) > expand.grid(i, > sample(x = 1:N_COL, > size = sample(1:15, 1), > replace = TRUE) > ) > ) > ) > x[,3] <- sample(x = 1:13, size = nrow(x), > replace = TRUE, prob = PROB) > > ## build the sparse matrix > M <- Matrix::sparseMatrix( > dims = c(N_ROW, N_COL), > i = x[,1], > j = x[,2], > x = x[,3] > ) > print(format(object.size(M), units = "auto")) > > ## ******************************************* > ## Scaling the rows of M > > ## scale() lets my computer freeze > # M <- scale(t(M), center = FALSE, scale(Matrix::rowSums(M))) > > ## this appears to be not elegant at all and takes forever > # rwsms <- Matrix::rowSums(M) > # for (i in 1:nrow(M)) M[i,] <- M[i,]/rwsms[[i]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
tomdharray at gmail.com
2016-Jan-14 01:11 UTC
[R] Scaling rows of a large Matrix::sparseMatrix()
Hello Gerrit,
Thanks. Your proposal works in general, but I get memory allocation
errors with my actual 57,000 x 14,000 matrix.
The fix which I now use is to scale the data before I build the matrix;
see below.
Cheers,
Dirk
## Code Start -----------------------------
library(parallel)
rowscale <- function(.x) cbind(.x[,1:3], .x[,3] / sum(.x[,3]))
y <- split(x = x, f = x[,1])
localSocketCluster <- parallel::makeCluster(spec = 4, type =
"SOCK")
y <- parallel::parLapply(cl = localSocketCluster, X = y, fun = rowscale)
parallel::stopCluster(cl = localSocketCluster)
x <- do.call(what = rbind, args = y)
## build the sparse matrix
M <- Matrix::sparseMatrix(dims = c(N_ROW, N_COL),
i = x[,1], j = x[,2], x = x[,4])
## Code End -----------------------------
On 16-01-13 03:23 AM, Gerrit Eichner wrote:> Hello, Dirk,
>
> maybe I'm missing something, but to avoid your for-loop-approach
doesn't
>
> M <- M/Matrix::rowSums(M)
>
> do what you want?
>
> Hth -- Gerrit
>
> ---------------------------------------------------------------------
> Dr. Gerrit Eichner Mathematical Institute, Room 212
> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen
> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany
> Fax: +49-(0)641-99-32109 http://www.uni-giessen.de/eichner
> ---------------------------------------------------------------------
>