Ortiz-Bobea, Ariel
2014-May-01 19:48 UTC
[R] speeding up applying hist() over rows of a matrix
Hello everyone, I'm trying to construct bins for each row in a matrix. I'm using apply() in combination with hist() to do this. Performing this binning for a 10K-by-50 matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500 matrix. This suggests the bottleneck is accessing rows in apply() rather than the calculations going on inside hist(). My initial idea is to process as many columns (as make sense for the intended use) at once. However, I still have many many rows to process and I would appreciate any feedback on how to speed this up. Any thoughts? Thanks, Ariel Here is the illustration: # create data m1 <- matrix(10*rnorm(50*10^4), ncol=50) m2 <- matrix(10*rnorm(50*10^4), ncol=500) # compute bins bins <- seq(-100,100,1) system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) }) system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) }) --- Ariel Ortiz-Bobea Fellow Resources for the Future 1616 P Street, N.W. Washington, DC 20036 [[alternative HTML version deleted]]
William Dunlap
2014-May-02 16:23 UTC
[R] speeding up applying hist() over rows of a matrix
Your original code, as a function of 'm' and 'bins' is f0 <- function (m, bins) { t(apply(m, 1, function(x) hist(x, breaks = bins, plot = FALSE)$counts)) } and the time it takes to run on your m1 is about 5 s. on my machine> system.time(r0 <- f0(m1,bins))user system elapsed 4.95 0.00 5.02 hist(x,breaks=bins) is essentially tabulate(cut(x,bins),nbins=length(bins)-1). See how much it speeds things up by replacing hist() with tabulate(cut()): f1 <- function (m, bins) { nbins <- length(bins) - 1L t(apply(m, 1, function(x) tabulate(cut(x, bins), nbins = nbins))) } That doesn't help with the time, but it does give the same output> system.time(r1 <- f1(m1,bins))user system elapsed 4.85 0.10 5.35> identical(r0, r1)[1] TRUE Now try speeding it up by calling cut() on the whole matrix first and then applying tabulate to each row, as in f2 <- function (m, bins) { nbins <- length(bins) - 1L m <- array(as.integer(cut(m, bins)), dim = dim(m)) t(apply(m, 1, tabulate, nbins = nbins)) } That saves quite a bit of time and gives the same output> system.time(r2 <- f2(m1,bins))user system elapsed 0.25 0.00 0.25> identical(r0, r2)[1] TRUE Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, May 1, 2014 at 12:48 PM, Ortiz-Bobea, Ariel <Ortiz-Bobea at rff.org> wrote:> Hello everyone, > > > > I'm trying to construct bins for each row in a matrix. I'm using apply() in combination with hist() to do this. Performing this binning for a 10K-by-50 matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500 matrix. This suggests the bottleneck is accessing rows in apply() rather than the calculations going on inside hist(). > > > > My initial idea is to process as many columns (as make sense for the intended use) at once. However, I still have many many rows to process and I would appreciate any feedback on how to speed this up. > > > > Any thoughts? > > > > Thanks, > > > > Ariel > > > > Here is the illustration: > > > > # create data > > m1 <- matrix(10*rnorm(50*10^4), ncol=50) > > m2 <- matrix(10*rnorm(50*10^4), ncol=500) > > > > # compute bins > > bins <- seq(-100,100,1) > > system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) }) > > system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins, plot=FALSE)$counts)) }) > > > --- > Ariel Ortiz-Bobea > Fellow > Resources for the Future > 1616 P Street, N.W. > Washington, DC 20036 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.