Henrik Bengtsson
2012-Oct-03 00:19 UTC
[Rd] Fastest non-overlapping binning mean function out there?
Hi, I'm looking for a super-duper fast mean/sum binning implementation available in R, and before implementing z = binnedMeans(x y) in native code myself, does any one know of an existing function/package for this? I'm sure it already exists. So, given data (x,y) and B bins bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's assume there are no missing values and 'x' and 'bx' is already ordered. The length of 'x' is in the order of 10,000-millions. The number of elements in each bin vary. Thanks, Henrik
Hervé Pagès
2012-Oct-03 01:11 UTC
[Rd] Fastest non-overlapping binning mean function out there?
Hi Henrik, On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:> Hi, > > I'm looking for a super-duper fast mean/sum binning implementation > available in R, and before implementing z = binnedMeans(x y) in native > code myself, does any one know of an existing function/package for > this? I'm sure it already exists. So, given data (x,y) and B bins > bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the > binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < > bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's > assume there are no missing values and 'x' and 'bx' is already > ordered. The length of 'x' is in the order of 10,000-millions. The > number of elements in each bin vary.You didn't say if you have a lot of bins or not. If you don't have a lot of bins (e.g. < 10000), something like aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean) might not be too bad: > x <- seq(0, 8, by=0.1) > bx <- c(2, 2.5, 4, 5.8) > aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean) bin x 1 0 0.95 2 1 2.20 3 2 3.20 4 3 4.85 5 4 6.90 I didn't try it on a 10,000-millions-elements vector though (and I've no idea how I could do this). H.> > Thanks, > > Henrik > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Terry Therneau
2012-Oct-03 12:50 UTC
[Rd] Fastest non-overlapping binning mean function out there?
Look at rowsum. It's pretty fast C code. Terry T On 10/03/2012 05:00 AM, r-devel-request at r-project.org wrote:> Hi, > > I'm looking for a super-duper fast mean/sum binning implementation > available in R, and before implementing z = binnedMeans(x y) in native > code myself, does any one know of an existing function/package for > this? I'm sure it already exists. So, given data (x,y) and B bins > bx[1]< bx[2]< ...< bx[B]< bx[B+1], I'd like to calculate the > binned means (or sums) 'z' such that z[1] = mean(x[bx[1]<= x& x< > bx[2]]), z[2] = mean(x[bx[2]<= x& x< bx[3]]), .... z[B]. Let's > assume there are no missing values and 'x' and 'bx' is already > ordered. The length of 'x' is in the order of 10,000-millions. The > number of elements in each bin vary. > > Thanks, > > Henrik
Martin Morgan
2012-Oct-03 13:47 UTC
[Rd] Fastest non-overlapping binning mean function out there?
On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:> Hi, > > I'm looking for a super-duper fast mean/sum binning implementation > available in R, and before implementing z = binnedMeans(x y) in native > code myself, does any one know of an existing function/package for > this? I'm sure it already exists. So, given data (x,y) and B bins > bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the > binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < > bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's > assume there are no missing values and 'x' and 'bx' is already > ordered. The length of 'x' is in the order of 10,000-millions. The > number of elements in each bin vary.since x and bx are ordered (sorting x would be expensive), the C code seems pretty straight-forward and memory-efficient -- create a result vector as long as bx, then iterate through x accumulating n and it's sum until x[i] >= bx[i]. (I think R's implementation of mean() actually pays more attention to numerical error, but with this much data...) library(inline) binmean <- cfunction(signature(x="numeric", bx="numeric"), " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n; SEXP ans = PROTECT(NEW_NUMERIC(nb)); double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans); sum = j = n = 0; for (i = 0; i < nx; ++i) { while (xp[i] >= bxp[j]) { ansp[j++] = n > 0 ? sum / n : 0; sum = n = 0; } n += 1; sum += xp[i]; } ansp[j] = n > 0 ? sum / n : 0; UNPROTECT(1); return ans; ") with a test case nx <- 4e7 nb <- 1e3 x <- sort(runif(nx)) bx <- do.call(seq, c(as.list(range(x)), length.out=nb)) leading to > bx1 <- c(bx[-1], bx[nb] + 1) > system.time(res <- binmean(x, bx1)) user system elapsed 0.052 0.000 0.050 Martin> > Thanks, > > Henrik > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793