tomdharray at gmail.com
2016-Jan-13 01:50 UTC
[R] Scaling rows of a large Matrix::sparseMatrix()
Hello R-Users, I'm looking for a way to scale the rows of a sparse matrix M with about 57,000 rows, 14,000 columns, and 238,000 non-zero matrix elements; see example code below. Usually I'd use the base::scale() function (see sample code), but it freezes my computer. The same happens when I try to run a for loop over the matrix rows. The conversion with as.matrix() yields a 5.8 Gb large object, which appears too large for scale(). So my question is: How can the rows of a large sparse matrix be efficiently scaled? Thanks and regards, Dirk ### Hardware/Session Info Intel Core i7 w/ 12 Gb RAM R version 3.2.1 (2015-06-18) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS ### Example Code library(Matrix) set.seed(42) ## These are exemplary values for my real "problem matrix" N_ROW <- 56743 N_COL <- 13648 SIZE <- 238283 PROB <- c(0.050, 0.050, 0.099, 0.149, 0.198, 0.178, 0.119, 0.079, 0.0297, 0.0198, 0.001, 0.001, 0.001) ## get some random values to populate the sparse matrix x <- do.call( what = rbind, args = lapply(X = 1:N_ROW, FUN = function(i) expand.grid(i, sample(x = 1:N_COL, size = sample(1:15, 1), replace = TRUE) ) ) ) x[,3] <- sample(x = 1:13, size = nrow(x), replace = TRUE, prob = PROB) ## build the sparse matrix M <- Matrix::sparseMatrix( dims = c(N_ROW, N_COL), i = x[,1], j = x[,2], x = x[,3] ) print(format(object.size(M), units = "auto")) ## ******************************************* ## Scaling the rows of M ## scale() lets my computer freeze # M <- scale(t(M), center = FALSE, scale(Matrix::rowSums(M))) ## this appears to be not elegant at all and takes forever # rwsms <- Matrix::rowSums(M) # for (i in 1:nrow(M)) M[i,] <- M[i,]/rwsms[[i]]
> On 13 Jan 2016, at 02:50, tomdharray at gmail.com wrote: > > So my question is: How can the rows of a large sparse matrix be > efficiently scaled?If you're not picky about the particular storage format, the "wordspace" package wordspace.r-forge.r-project.org has an efficient scaleMargins() function, which can be made to do what you need in combination with rowNorms() and colNorms(); cf. the trivial implementation of normalize.rows(). These functions only work with a dgCMatrix and will try to coerce any other sparseMatrix to this format. Best, Stefan
Hello, Dirk, maybe I'm missing something, but to avoid your for-loop-approach doesn't M <- M/Matrix::rowSums(M) do what you want? Hth -- Gerrit --------------------------------------------------------------------- Dr. Gerrit Eichner Mathematical Institute, Room 212 gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany Fax: +49-(0)641-99-32109 uni-giessen.de/eichner ---------------------------------------------------------------------> Hello R-Users, > > I'm looking for a way to scale the rows of a sparse matrix M with about > 57,000 rows, 14,000 columns, and 238,000 non-zero matrix elements; see > example code below. > > Usually I'd use the base::scale() function (see sample code), but it > freezes my computer. The same happens when I try to run a for loop over > the matrix rows. > > The conversion with as.matrix() yields a 5.8 Gb large object, which > appears too large for scale(). > > > So my question is: How can the rows of a large sparse matrix be > efficiently scaled? > > Thanks and regards, > > Dirk > > > ### Hardware/Session Info > Intel Core i7 w/ 12 Gb RAM > R version 3.2.1 (2015-06-18) > Platform: x86_64-unknown-linux-gnu (64-bit) > Running under: Ubuntu 14.04.3 LTS > > ### Example Code > library(Matrix) > set.seed(42) > > ## These are exemplary values for my real "problem matrix" > N_ROW <- 56743 > N_COL <- 13648 > SIZE <- 238283 > PROB <- c(0.050, 0.050, 0.099, 0.149, 0.198, 0.178, 0.119, > 0.079, 0.0297, 0.0198, 0.001, 0.001, 0.001) > > ## get some random values to populate the sparse matrix > x <- do.call( > what = rbind, > args = lapply(X = 1:N_ROW, > FUN = function(i) > expand.grid(i, > sample(x = 1:N_COL, > size = sample(1:15, 1), > replace = TRUE) > ) > ) > ) > x[,3] <- sample(x = 1:13, size = nrow(x), > replace = TRUE, prob = PROB) > > ## build the sparse matrix > M <- Matrix::sparseMatrix( > dims = c(N_ROW, N_COL), > i = x[,1], > j = x[,2], > x = x[,3] > ) > print(format(object.size(M), units = "auto")) > > ## ******************************************* > ## Scaling the rows of M > > ## scale() lets my computer freeze > # M <- scale(t(M), center = FALSE, scale(Matrix::rowSums(M))) > > ## this appears to be not elegant at all and takes forever > # rwsms <- Matrix::rowSums(M) > # for (i in 1:nrow(M)) M[i,] <- M[i,]/rwsms[[i]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
tomdharray at gmail.com
2016-Jan-14 01:11 UTC
[R] Scaling rows of a large Matrix::sparseMatrix()
Hello Gerrit, Thanks. Your proposal works in general, but I get memory allocation errors with my actual 57,000 x 14,000 matrix. The fix which I now use is to scale the data before I build the matrix; see below. Cheers, Dirk ## Code Start ----------------------------- library(parallel) rowscale <- function(.x) cbind(.x[,1:3], .x[,3] / sum(.x[,3])) y <- split(x = x, f = x[,1]) localSocketCluster <- parallel::makeCluster(spec = 4, type = "SOCK") y <- parallel::parLapply(cl = localSocketCluster, X = y, fun = rowscale) parallel::stopCluster(cl = localSocketCluster) x <- do.call(what = rbind, args = y) ## build the sparse matrix M <- Matrix::sparseMatrix(dims = c(N_ROW, N_COL), i = x[,1], j = x[,2], x = x[,4]) ## Code End ----------------------------- On 16-01-13 03:23 AM, Gerrit Eichner wrote:> Hello, Dirk, > > maybe I'm missing something, but to avoid your for-loop-approach doesn't > > M <- M/Matrix::rowSums(M) > > do what you want? > > Hth -- Gerrit > > --------------------------------------------------------------------- > Dr. Gerrit Eichner Mathematical Institute, Room 212 > gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > Fax: +49-(0)641-99-32109 uni-giessen.de/eichner > --------------------------------------------------------------------- >