thr3ads.net - R devel - [Rd] Fastest non-overlapping binning mean function out there? [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2012-Oct-03 00:19 UTC

[Rd] Fastest non-overlapping binning mean function out there?

Hi,

I'm looking for a super-duper fast mean/sum binning implementation
available in R, and before implementing z = binnedMeans(x y) in native
code myself, does any one know of an existing function/package for
this?  I'm sure it already exists.  So, given data (x,y) and B bins
bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the
binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x
<
bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B].  Let's
assume there are no missing values and 'x' and 'bx' is already
ordered.  The length of 'x' is in the order of 10,000-millions.  The
number of elements in each bin vary.

Thanks,

Henrik

Hervé Pagès

2012-Oct-03 01:11 UTC

head link

[Rd] Fastest non-overlapping binning mean function out there?

Hi Henrik,

On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:> Hi,
>
> I'm looking for a super-duper fast mean/sum binning implementation
> available in R, and before implementing z = binnedMeans(x y) in native
> code myself, does any one know of an existing function/package for
> this?  I'm sure it already exists.  So, given data (x,y) and B bins
> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to
calculate the
> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x
& x <
> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. 
Let's
> assume there are no missing values and 'x' and 'bx' is
already
> ordered.  The length of 'x' is in the order of 10,000-millions. 
The
> number of elements in each bin vary.
You didn't say if you have a lot of bins or not. If you don't have a lot
of bins (e.g. < 10000), something like

   aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)

might not be too bad:

   > x <- seq(0, 8, by=0.1)
   > bx <- c(2, 2.5, 4, 5.8)
   > aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
     bin    x
   1   0 0.95
   2   1 2.20
   3   2 3.20
   4   3 4.85
   5   4 6.90

I didn't try it on a 10,000-millions-elements vector though (and I've
no idea how I could do this).

H.

>
> Thanks,
>
> Henrik
>
> ______________________________________________
> R-devel at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-devel
>
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Terry Therneau

2012-Oct-03 12:50 UTC

head link

[Rd] Fastest non-overlapping binning mean function out there?

Look at rowsum.  It's pretty fast C code.

Terry T

On 10/03/2012 05:00 AM, r-devel-request at r-project.org
wrote:> Hi,
>
> I'm looking for a super-duper fast mean/sum binning implementation
> available in R, and before implementing z = binnedMeans(x y) in native
> code myself, does any one know of an existing function/package for
> this?  I'm sure it already exists.  So, given data (x,y) and B bins
> bx[1]<  bx[2]<  ...<  bx[B]<  bx[B+1], I'd like to
calculate the
> binned means (or sums) 'z' such that z[1] = mean(x[bx[1]<=
x&  x<
> bx[2]]), z[2] = mean(x[bx[2]<= x&  x<  bx[3]]), .... z[B]. 
Let's
> assume there are no missing values and 'x' and 'bx' is
already
> ordered.  The length of 'x' is in the order of 10,000-millions. 
The
> number of elements in each bin vary.
>
> Thanks,
>
> Henrik

Martin Morgan

2012-Oct-03 13:47 UTC

head link

[Rd] Fastest non-overlapping binning mean function out there?

On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:> Hi,
>
> I'm looking for a super-duper fast mean/sum binning implementation
> available in R, and before implementing z = binnedMeans(x y) in native
> code myself, does any one know of an existing function/package for
> this?  I'm sure it already exists.  So, given data (x,y) and B bins
> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to
calculate the
> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x
& x <
> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. 
Let's
> assume there are no missing values and 'x' and 'bx' is
already
> ordered.  The length of 'x' is in the order of 10,000-millions. 
The
> number of elements in each bin vary.
since x and bx are ordered (sorting x would be expensive), the C code seems 
pretty straight-forward and memory-efficient -- create a result vector as long 
as bx, then iterate through x accumulating n and it's sum until x[i] >=
bx[i].
(I think R's implementation of mean() actually pays more attention to
numerical
error, but with this much data...)

library(inline)
binmean <- cfunction(signature(x="numeric",
bx="numeric"),
"   int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
     SEXP ans = PROTECT(NEW_NUMERIC(nb));
     double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
     sum = j = n = 0;
     for (i = 0; i < nx; ++i) {
         while (xp[i] >= bxp[j]) {
              ansp[j++] = n > 0 ? sum / n : 0;
              sum = n = 0;
         }
         n += 1;
         sum += xp[i];
     }
     ansp[j] = n > 0 ? sum / n : 0;
     UNPROTECT(1);
     return ans;
")

with a test case

nx <- 4e7
nb <- 1e3
x <- sort(runif(nx))
bx <- do.call(seq, c(as.list(range(x)), length.out=nb))

leading to

 > bx1 <- c(bx[-1], bx[nb] + 1)
 > system.time(res <- binmean(x, bx1))
    user  system elapsed
   0.052   0.000   0.050

Martin
>
> Thanks,
>
> Henrik
>
> ______________________________________________
> R-devel at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Maybe Matching Threads

Search for more maybe matching threads

R devel - Oct 2012 - Fastest non-overlapping binning mean function out there?

[Rd] Fastest non-overlapping binning mean function out there?

[Rd] Fastest non-overlapping binning mean function out there?

[Rd] Fastest non-overlapping binning mean function out there?

[Rd] Fastest non-overlapping binning mean function out there?

Maybe Matching Threads