Ortiz-Bobea, Ariel
2014-May-01 19:48 UTC
[R] speeding up applying hist() over rows of a matrix
Hello everyone,
I'm trying to construct bins for each row in a matrix. I'm using apply()
in combination with hist() to do this. Performing this binning for a 10K-by-50
matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500 matrix. This
suggests the bottleneck is accessing rows in apply() rather than the
calculations going on inside hist().
My initial idea is to process as many columns (as make sense for the intended
use) at once. However, I still have many many rows to process and I would
appreciate any feedback on how to speed this up.
Any thoughts?
Thanks,
Ariel
Here is the illustration:
# create data
m1 <- matrix(10*rnorm(50*10^4), ncol=50)
m2 <- matrix(10*rnorm(50*10^4), ncol=500)
# compute bins
bins <- seq(-100,100,1)
system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins,
plot=FALSE)$counts)) })
system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins,
plot=FALSE)$counts)) })
---
Ariel Ortiz-Bobea
Fellow
Resources for the Future
1616 P Street, N.W.
Washington, DC 20036
[[alternative HTML version deleted]]
William Dunlap
2014-May-02 16:23 UTC
[R] speeding up applying hist() over rows of a matrix
Your original code, as a function of 'm' and 'bins' is
f0 <- function (m, bins) {
t(apply(m, 1, function(x) hist(x, breaks = bins, plot = FALSE)$counts))
}
and the time it takes to run on your m1 is about 5 s. on my
machine> system.time(r0 <- f0(m1,bins))
user system elapsed
4.95 0.00 5.02
hist(x,breaks=bins) is essentially tabulate(cut(x,bins),nbins=length(bins)-1).
See how much it speeds things up by replacing hist() with tabulate(cut()):
f1 <- function (m, bins)
{
nbins <- length(bins) - 1L
t(apply(m, 1, function(x) tabulate(cut(x, bins), nbins = nbins)))
}
That doesn't help with the time, but it does give the same
output> system.time(r1 <- f1(m1,bins))
user system elapsed
4.85 0.10 5.35> identical(r0, r1)
[1] TRUE
Now try speeding it up by calling cut() on the whole matrix first
and then applying tabulate to each row, as in
f2 <- function (m, bins) {
nbins <- length(bins) - 1L
m <- array(as.integer(cut(m, bins)), dim = dim(m))
t(apply(m, 1, tabulate, nbins = nbins))
}
That saves quite a bit of time and gives the same output> system.time(r2 <- f2(m1,bins))
user system elapsed
0.25 0.00 0.25> identical(r0, r2)
[1] TRUE
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Thu, May 1, 2014 at 12:48 PM, Ortiz-Bobea, Ariel <Ortiz-Bobea at
rff.org> wrote:> Hello everyone,
>
>
>
> I'm trying to construct bins for each row in a matrix. I'm using
apply() in combination with hist() to do this. Performing this binning for a
10K-by-50 matrix takes about 5 seconds, but only 0.5 seconds for a 1K-by-500
matrix. This suggests the bottleneck is accessing rows in apply() rather than
the calculations going on inside hist().
>
>
>
> My initial idea is to process as many columns (as make sense for the
intended use) at once. However, I still have many many rows to process and I
would appreciate any feedback on how to speed this up.
>
>
>
> Any thoughts?
>
>
>
> Thanks,
>
>
>
> Ariel
>
>
>
> Here is the illustration:
>
>
>
> # create data
>
> m1 <- matrix(10*rnorm(50*10^4), ncol=50)
>
> m2 <- matrix(10*rnorm(50*10^4), ncol=500)
>
>
>
> # compute bins
>
> bins <- seq(-100,100,1)
>
> system.time({ out1 <- t(apply(m1,1, function(x) hist(x,breaks=bins,
plot=FALSE)$counts)) })
>
> system.time({ out2 <- t(apply(m2,1, function(x) hist(x,breaks=bins,
plot=FALSE)$counts)) })
>
>
> ---
> Ariel Ortiz-Bobea
> Fellow
> Resources for the Future
> 1616 P Street, N.W.
> Washington, DC 20036
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.