Tadeáš Palusga
2015-Mar-05 14:55 UTC
[Rd] Performance issue in stats:::weighted.mean.default method
Hi, I'm using this mailing list for the first time and I hope this is the right one. I don't think that the following is a bug but it can be a performance issue. By my opinion, there is no need to filter by [w != 0] in last sum of weighted.mean.default method defined in src/library/stats/R/weighted.mean.R. There is no need to do it because you can always sum zero numbers and filtering is too expensive (see following benchmark snippet) library(microbenchmark) x <- sample(500,5000,replace=TRUE) w <- sample(1000,5000,replace=TRUE)/1000 * ifelse((sample(10,5000,replace=TRUE) -1) > 0, 1, 0) fun.new <- function(x,w) {sum(x*w)/sum(w)} fun.orig <- function(x,w) {sum(x*w[w!=0])/sum(w)} print(microbenchmark( ORIGFN = fun.orig(x,w), NEWFN = fun.new(x,w), times = 1000)) #results: #Unit: microseconds # expr min lq mean median uq max neval # ORIGFN 190.889 194.6590 210.08952 198.847 202.928 1779.789 1000 # NEWFN 20.857 21.7175 24.61149 22.080 22.594 1744.014 1000 So my suggestion is to remove the w != check Index: weighted.mean.R ==================================================================--- weighted.mean.R (revision 67941) +++ weighted.mean.R (working copy) @@ -29,7 +29,7 @@ stop("'x' and 'w' must have the same length") w <- as.double(w) # avoid overflow in sum for integer weights. if (na.rm) { i <- !is.na(x); w <- w[i]; x <- x[i] } - sum((x*w)[w != 0])/sum(w) # --> NaN in empty case + sum(x*w)/sum(w) # --> NaN in empty case } ## see note for ?mean.Date I hope i'm not missing something - I really don't see the reason to have this filtration here. BR Tadeas 'donarus' Palusga
Prof Brian Ripley
2015-Mar-05 17:49 UTC
[Rd] Performance issue in stats:::weighted.mean.default method
On 05/03/2015 14:55, Tade?? Palusga wrote:> Hi, > I'm using this mailing list for the first time and I hope this is the > right one. I don't think that the following is a bug but it can be a > performance issue. > > By my opinion, there is no need to filter by [w != 0] in last sum of > weighted.mean.default method defined in > src/library/stats/R/weighted.mean.R. There is no need to do it because > you can always sum zero numbers and filtering is too expensive (see > following benchmark snippet)But 0*x is not necessarily 0, so there is a need to do it ... see > w <- c(0, 1) > x <- c(Inf, 1) > weighted.mean(x, w) [1] 1 > fun.new(x, w) [1] NaN> > > > library(microbenchmark) > x <- sample(500,5000,replace=TRUE) > w <- sample(1000,5000,replace=TRUE)/1000 * > ifelse((sample(10,5000,replace=TRUE) -1) > 0, 1, 0) > fun.new <- function(x,w) {sum(x*w)/sum(w)} > fun.orig <- function(x,w) {sum(x*w[w!=0])/sum(w)} > print(microbenchmark( > ORIGFN = fun.orig(x,w), > NEWFN = fun.new(x,w), > times = 1000)) > > #results: > #Unit: microseconds > # expr min lq mean median uq max neval > # ORIGFN 190.889 194.6590 210.08952 198.847 202.928 1779.789 1000 > # NEWFN 20.857 21.7175 24.61149 22.080 22.594 1744.014 1000 > > > > > So my suggestion is to remove the w != check > > > > > Index: weighted.mean.R > ==================================================================> --- weighted.mean.R (revision 67941) > +++ weighted.mean.R (working copy) > @@ -29,7 +29,7 @@ > stop("'x' and 'w' must have the same length") > w <- as.double(w) # avoid overflow in sum for integer weights. > if (na.rm) { i <- !is.na(x); w <- w[i]; x <- x[i] } > - sum((x*w)[w != 0])/sum(w) # --> NaN in empty case > + sum(x*w)/sum(w) # --> NaN in empty case > } > > ## see note for ?mean.Date > > > I hope i'm not missing something - I really don't see the reason to have > this filtration here. > > BR > > Tadeas 'donarus' Palusga > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK
Henrik Bengtsson
2015-Mar-05 19:39 UTC
[Rd] Performance issue in stats:::weighted.mean.default method
See weightedMean() in the matrixStats package. It's optimized for data type, speed and memory and implemented in native code so it can avoid some of these intermediate copies. It's a few times faster than weighted.mean[.default](); library(matrixStats) library(microbenchmark) n <- 5000 x <- sample(500,n,replace=TRUE) w <- sample(1000,n,replace=TRUE)/1000 * ifelse((sample(10,n,replace=TRUE) -1) > 0, 1, 0) fun.new <- function(x,w) {sum(x*w)/sum(w)} fun.orig <- function(x,w) {sum(x*w[w!=0])/sum(w)} stats <- microbenchmark( weightedMean(x,w), weighted.mean(x,w), ORIGFN = fun.orig(x,w), NEWFN = fun.new(x,w), times = 1000 )> print(stats, digits=3)Unit: microseconds expr min lq mean median uq max neval weightedMean(x, w) 28.7 31.7 33.4 32.9 33.8 81.7 1000 weighted.mean(x, w) 129.6 141.6 149.6 143.7 147.1 2332.9 1000 ORIGFN 205.7 222.0 235.0 225.4 231.4 2655.8 1000 NEWFN 38.9 42.3 44.3 42.8 43.6 385.8 1000 Relative performance will vary with n = length(x). The weightedMean() function handles zero-weight Inf values:> w <- c(0, 1) > x <- c(Inf, 1) > weighted.mean(x, w)[1] 1> fun.new(x, w)[1] NaN> weightedMean(x,w)[1] 1 You'll find more benchmark results on weightedMean() vs weighted.mean() on https://github.com/HenrikBengtsson/matrixStats/wiki/weightedMean /Henrik On Thu, Mar 5, 2015 at 9:49 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:> On 05/03/2015 14:55, Tade?? Palusga wrote: >> >> Hi, >> I'm using this mailing list for the first time and I hope this is the >> right one. I don't think that the following is a bug but it can be a >> performance issue. >> >> By my opinion, there is no need to filter by [w != 0] in last sum of >> weighted.mean.default method defined in >> src/library/stats/R/weighted.mean.R. There is no need to do it because >> you can always sum zero numbers and filtering is too expensive (see >> following benchmark snippet) > > > But 0*x is not necessarily 0, so there is a need to do it ... see > >> w <- c(0, 1) >> x <- c(Inf, 1) >> weighted.mean(x, w) > [1] 1 >> fun.new(x, w) > [1] NaN > > >> >> >> >> library(microbenchmark) >> x <- sample(500,5000,replace=TRUE) >> w <- sample(1000,5000,replace=TRUE)/1000 * >> ifelse((sample(10,5000,replace=TRUE) -1) > 0, 1, 0) >> fun.new <- function(x,w) {sum(x*w)/sum(w)} >> fun.orig <- function(x,w) {sum(x*w[w!=0])/sum(w)} >> print(microbenchmark( >> ORIGFN = fun.orig(x,w), >> NEWFN = fun.new(x,w), >> times = 1000)) >> >> #results: >> #Unit: microseconds >> # expr min lq mean median uq max neval >> # ORIGFN 190.889 194.6590 210.08952 198.847 202.928 1779.789 1000 >> # NEWFN 20.857 21.7175 24.61149 22.080 22.594 1744.014 1000 >> >> >> >> >> So my suggestion is to remove the w != check >> >> >> >> >> Index: weighted.mean.R >> ==================================================================>> --- weighted.mean.R (revision 67941) >> +++ weighted.mean.R (working copy) >> @@ -29,7 +29,7 @@ >> stop("'x' and 'w' must have the same length") >> w <- as.double(w) # avoid overflow in sum for integer weights. >> if (na.rm) { i <- !is.na(x); w <- w[i]; x <- x[i] } >> - sum((x*w)[w != 0])/sum(w) # --> NaN in empty case >> + sum(x*w)/sum(w) # --> NaN in empty case >> } >> >> ## see note for ?mean.Date >> >> >> I hope i'm not missing something - I really don't see the reason to have >> this filtration here. >> >> BR >> >> Tadeas 'donarus' Palusga >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Emeritus Professor of Applied Statistics, University of Oxford > 1 South Parks Road, Oxford OX1 3TG, UK > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Tadeáš Palusga
2015-Mar-05 19:54 UTC
[Rd] Performance issue in stats:::weighted.mean.default method
Oops, such an amateur mistake. Thanks a lot for your quick response. Regards TP On 03/05/2015 06:49 PM, Prof Brian Ripley wrote:> On 05/03/2015 14:55, Tade?? Palusga wrote: >> Hi, >> I'm using this mailing list for the first time and I hope this is the >> right one. I don't think that the following is a bug but it can be a >> performance issue. >> >> By my opinion, there is no need to filter by [w != 0] in last sum of >> weighted.mean.default method defined in >> src/library/stats/R/weighted.mean.R. There is no need to do it because >> you can always sum zero numbers and filtering is too expensive (see >> following benchmark snippet) > > But 0*x is not necessarily 0, so there is a need to do it ... see > > > w <- c(0, 1) > > x <- c(Inf, 1) > > weighted.mean(x, w) > [1] 1 > > fun.new(x, w) > [1] NaN > >> >> >> >> library(microbenchmark) >> x <- sample(500,5000,replace=TRUE) >> w <- sample(1000,5000,replace=TRUE)/1000 * >> ifelse((sample(10,5000,replace=TRUE) -1) > 0, 1, 0) >> fun.new <- function(x,w) {sum(x*w)/sum(w)} >> fun.orig <- function(x,w) {sum(x*w[w!=0])/sum(w)} >> print(microbenchmark( >> ORIGFN = fun.orig(x,w), >> NEWFN = fun.new(x,w), >> times = 1000)) >> >> #results: >> #Unit: microseconds >> # expr min lq mean median uq max neval >> # ORIGFN 190.889 194.6590 210.08952 198.847 202.928 1779.789 1000 >> # NEWFN 20.857 21.7175 24.61149 22.080 22.594 1744.014 1000 >> >> >> >> >> So my suggestion is to remove the w != check >> >> >> >> >> Index: weighted.mean.R >> ==================================================================>> --- weighted.mean.R (revision 67941) >> +++ weighted.mean.R (working copy) >> @@ -29,7 +29,7 @@ >> stop("'x' and 'w' must have the same length") >> w <- as.double(w) # avoid overflow in sum for integer weights. >> if (na.rm) { i <- !is.na(x); w <- w[i]; x <- x[i] } >> - sum((x*w)[w != 0])/sum(w) # --> NaN in empty case >> + sum(x*w)/sum(w) # --> NaN in empty case >> } >> >> ## see note for ?mean.Date >> >> >> I hope i'm not missing something - I really don't see the reason to have >> this filtration here. >> >> BR >> >> Tadeas 'donarus' Palusga >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > >
Possibly Parallel Threads
- Performance issue in stats:::weighted.mean.default method
- [LLVMdev] Attributes & CloneFunctionInto
- [LLVMdev] [PATCH] Split init.trampoline into init.trampoline & adjust.trampoline
- Error starting domain: internal error: Unable to add port vnet0 to OVS bridge br0
- Emulated TPM doesn't work on Debian Buster