Viechtbauer, Wolfgang (SP)
2021-Sep-02 10:55 UTC
[Rd] sum() and mean() for (ALTREP) integer sequences
Hi all, I am trying to understand the performance of functions applied to integer sequences. Consider the following: ### begin example ### library(lobstr) library(microbenchmark) x <- sample(1e6) obj_size(x) # 4,000,048 B y <- 1:1e6 obj_size(y) # 680 B # So we can see that 'y' uses ALTREP. These are, as expected, the same: sum(x) # [1] 500000500000 sum(y) # [1] 500000500000 # For 'x', we have to go through the trouble of actually summing up 1e6 integers. # For 'y', knowing its form, we really just need to do: 1e6*(1e6+1)/2 # [1] 500000500000 # which should be a whole lot faster. And indeed, it is: microbenchmark(sum(x),sum(y)) # Unit: nanoseconds # expr min lq mean median uq max neval cld # sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b # sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a # Now what about mean()? mean(x) # [1] 500000.5 mean(y) # [1] 500000.5 # which is the same as (1e6+1)/2 # [1] 500000.5 # But this surprised me: microbenchmark(mean(x),mean(y)) # Unit: microseconds # expr min lq mean median uq max neval cld # mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a # mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b ### end example ### So why is mean() on an ALTREP sequence slower when sum() is faster? And more generally, when using sum() on an ALTREP integer sequence, does R actually use something like n*(n+1)/2 (or generalized to sequences a:b -- (a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()? Best, Wolfgang
On 02/09/2021 6:55 a.m., Viechtbauer, Wolfgang (SP) wrote:> Hi all, > > I am trying to understand the performance of functions applied to integer sequences. Consider the following: > > ### begin example ### > > library(lobstr) > library(microbenchmark) > > x <- sample(1e6) > obj_size(x) > # 4,000,048 B > > y <- 1:1e6 > obj_size(y) > # 680 B > > # So we can see that 'y' uses ALTREP. These are, as expected, the same: > > sum(x) > # [1] 500000500000 > sum(y) > # [1] 500000500000 > > # For 'x', we have to go through the trouble of actually summing up 1e6 integers. > # For 'y', knowing its form, we really just need to do: > > 1e6*(1e6+1)/2 > # [1] 500000500000 > > # which should be a whole lot faster. And indeed, it is: > > microbenchmark(sum(x),sum(y)) > > # Unit: nanoseconds > # expr min lq mean median uq max neval cld > # sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b > # sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a > > # Now what about mean()? > > mean(x) > # [1] 500000.5 > mean(y) > # [1] 500000.5 > > # which is the same as > > (1e6+1)/2 > # [1] 500000.5 > > # But this surprised me: > > microbenchmark(mean(x),mean(y)) > > # Unit: microseconds > # expr min lq mean median uq max neval cld > # mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a > # mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b > > ### end example ### > > So why is mean() on an ALTREP sequence slower when sum() is faster? > > And more generally, when using sum() on an ALTREP integer sequence, does R actually use something like n*(n+1)/2 (or generalized to sequences a:b -- (a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?The mean.default function looks like this: function (x, trim = 0, na.rm = FALSE, ...) { if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) { warning("argument is not numeric or logical: returning NA") return(NA_real_) } if (na.rm) x <- x[!is.na(x)] if (!is.numeric(trim) || length(trim) != 1L) stop("'trim' must be numeric of length one") n <- length(x) if (trim > 0 && n) { if (is.complex(x)) stop("trimmed means are not defined for complex data") if (anyNA(x)) return(NA_real_) if (trim >= 0.5) return(stats::median(x, na.rm = FALSE)) lo <- floor(n * trim) + 1 hi <- n + 1 - lo x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi] } .Internal(mean(x)) } So it does fixups for trimming and NA removal, then calls an internal function. The internal function is the first part of do_summary, here: https://github.com/wch/r-source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L541-L556 It is using separate functions for the mean by type. The real_mean function here: https://github.com/wch/r-source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L476-L515 makes a big effort to avoid overflows. So I suspect the reason mean.default doesn't use sum(x)/length(x) at the end is that on a long vector sum(x) could overflow when mean(x) shouldn't. So why not take the ALTREP into account? I suspect it's just too much trouble for a rare case. Duncan Murdoch
GILLIBERT, Andre
2021-Sep-02 19:54 UTC
[Rd] sum() and mean() for (ALTREP) integer sequences
Hello, Please, find a long response below. == difference between mean(x) and sum(x)/length(x) = At the core, mean(x) and sum(x)/length(x) works very differently for real numbers. Mean is more accurate when applied to a vector with a small variance but a very high mean, especially on platforms without extended precision numbers (i.e. long double is 64 bits rather than 80 bits). On a Windows 10 computer (with 80 bits long double): k=1e6 base=2^51 realmean = mean(1:k) sqa=(base+1):(base+k) # with ALTREP sq=(base+1):(base+k)+0.0 # without ALTREP mean(sq) - base - realmean # correctly returns zero sum(sq)/k - base - realmean # incorrectly returns -0.5 sum(sqa)/k - base - realmean # correctly returns zero On a GNU/Linux x86_64 computer with extended precision floating point disabled, the difference is worse: mean(sq) - base - realmean # correctly returns zero sum(sq)/k - base - realmean # incorrectly returns -1180 sum(sqa)/k - base - realmean # correctly returns zero Therefore (without ALTREP) sum can be inaccurate, due to floating point rounding errors. The good accuracy of mean() is due to a two-passes algorithm of the real_mean() function in src/main/summary.c. The algorithm can be summarized by the following equivalent R code: badmean=function(v) {sum(v)/length(v)} goodmean=function(v) {center = badmean(v); center + badmean(v - center)} goodmean(sq) - base - realmean # correctly returns zero == should mean() call ALTINTEGER_SUM ? =As you noticed in the examples above, sum() is much more accurate with ALTREPs than with the naive algorithm, because there are no cumulative rounding errors. Moreover, if we focus on INTSXP, the maximum possible value is lower : INT_MAX = 2^31-1 The sum of a large sequence (e.g. 1:(2^31-1)) can still overflow the exact integer range (0 to 2^53) of an FP64, and the division does not always round back to the correct value. bad = 1:189812531L mean(bad) - sum(bad)/length(bad) # returns -1.5e-08 on a platform with FP80 mean(bad) == 94906266 # correct value (the actual result is an integer) The implementation of mean() on INTSXP do not use the two-passes trick, but relies on a LDOUBLE (typically FP80 on the x86 platform) that is large enough to avoid rounding bugs even with huge integer sequences. Unfortunately the ALTINTEGER_SUM interface returns at best a FP64, and so, would not return the FP64 closest to the actual mean for some sequences. Adding a MEAN function to the ALTREP interface could solve the problem. == can mean performance be improved easily ? = The mean() implementation for integers, supports ALTREPs with poor iteration performances, using the slow INTEGER_ELT() macro. Moreover, it converts each integer to LDOUBLE, which is slow. It can be improved using ITERATE_BY_REGION0 and using a 64 bits integer (if available) that cannot overflow on small batches (size = 512). # before patching (on Ubuntu x86_64 Silvermont Celeron J1900) x = 1:1e8 y = 1:1e8+0L system.time(mean(x)) # user 1.33 second system.time(mean(y)) # user 0.32 second # after patching (on Ubuntu x86_64 Silvermont Celeron J1900) # after patching (on Ubuntu x86_64 Silvermont Celeron J1900) x = 1:1e8 y = 1:1e8+0L system.time(mean(x)) # user 0.29 second # more than x4 faster system.time(mean(y)) # user 0.18 second # x 1.7 faster ! (patch attached to this message) The patch is not optimal. It should ideally use isum(), and risum() but these functions are a mess, needing refactoring. For instance, they take a 'call' argument and may display a warning message related to the sum() function. -- Sincerely Andre GILLIBERT ________________________________ De : R-devel <r-devel-bounces at r-project.org> de la part de Viechtbauer, Wolfgang (SP) <wolfgang.viechtbauer at maastrichtuniversity.nl> Envoy? : jeudi 2 septembre 2021 12:55:03 ? : r-devel at r-project.org Objet : [Rd] sum() and mean() for (ALTREP) integer sequences ATTENTION: Cet e-mail provient d?une adresse mail ext?rieure au CHU de Rouen. Ne cliquez pas sur les liens ou n'ouvrez pas les pi?ces jointes ? moins de conna?tre l'exp?diteur et de savoir que le contenu est s?r. En cas de doute, transf?rer le mail ? ? DSI, S?curit? ? pour analyse. Merci de votre vigilance Hi all, I am trying to understand the performance of functions applied to integer sequences. Consider the following: ### begin example ### library(lobstr) library(microbenchmark) x <- sample(1e6) obj_size(x) # 4,000,048 B y <- 1:1e6 obj_size(y) # 680 B # So we can see that 'y' uses ALTREP. These are, as expected, the same: sum(x) # [1] 500000500000 sum(y) # [1] 500000500000 # For 'x', we have to go through the trouble of actually summing up 1e6 integers. # For 'y', knowing its form, we really just need to do: 1e6*(1e6+1)/2 # [1] 500000500000 # which should be a whole lot faster. And indeed, it is: microbenchmark(sum(x),sum(y)) # Unit: nanoseconds # expr min lq mean median uq max neval cld # sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b # sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a # Now what about mean()? mean(x) # [1] 500000.5 mean(y) # [1] 500000.5 # which is the same as (1e6+1)/2 # [1] 500000.5 # But this surprised me: microbenchmark(mean(x),mean(y)) # Unit: microseconds # expr min lq mean median uq max neval cld # mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a # mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b ### end example ### So why is mean() on an ALTREP sequence slower when sum() is faster? And more generally, when using sum() on an ALTREP integer sequence, does R actually use something like n*(n+1)/2 (or generalized to sequences a:b -- (a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()? Best, Wolfgang ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel