thr3ads.net - R devel - [Rd] sum() and mean() for (ALTREP) integer sequences [Sep 2021]

If this information is useful, please help other people find it:
Share via:

Viechtbauer, Wolfgang (SP)

2021-Sep-02 10:55 UTC

[Rd] sum() and mean() for (ALTREP) integer sequences

Hi all,

I am trying to understand the performance of functions applied to integer
sequences. Consider the following:

### begin example ###

library(lobstr)
library(microbenchmark)

x <- sample(1e6)
obj_size(x)
# 4,000,048 B

y <- 1:1e6
obj_size(y)
# 680 B

# So we can see that 'y' uses ALTREP. These are, as expected, the same:

sum(x)
# [1] 500000500000
sum(y)
# [1] 500000500000

# For 'x', we have to go through the trouble of actually summing up 1e6
integers.
# For 'y', knowing its form, we really just need to do:

1e6*(1e6+1)/2
# [1] 500000500000

# which should be a whole lot faster. And indeed, it is:

microbenchmark(sum(x),sum(y))

# Unit: nanoseconds
#    expr    min       lq      mean   median       uq    max neval cld
#  sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519   100   b
#  sum(y)    183    245.5    446.09    338.5    447.0   3233   100  a

# Now what about mean()?

mean(x)
# [1] 500000.5
mean(y)
# [1] 500000.5

# which is the same as

(1e6+1)/2
# [1] 500000.5

# But this surprised me:

microbenchmark(mean(x),mean(y))

# Unit: microseconds
#     expr      min        lq     mean   median       uq      max neval cld
#  mean(x)  935.389  943.4795 1021.423  954.689  985.122 2065.974   100  a
#  mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768   100   b

### end example ###

So why is mean() on an ALTREP sequence slower when sum() is faster?

And more generally, when using sum() on an ALTREP integer sequence, does R
actually use something like n*(n+1)/2 (or generalized to sequences a:b --
(a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?

Best,
Wolfgang

Duncan Murdoch

2021-Sep-02 11:46 UTC

head link

[Rd] sum() and mean() for (ALTREP) integer sequences

On 02/09/2021 6:55 a.m., Viechtbauer, Wolfgang (SP)
wrote:> Hi all,
> 
> I am trying to understand the performance of functions applied to integer
sequences. Consider the following:
> 
> ### begin example ###
> 
> library(lobstr)
> library(microbenchmark)
> 
> x <- sample(1e6)
> obj_size(x)
> # 4,000,048 B
> 
> y <- 1:1e6
> obj_size(y)
> # 680 B
> 
> # So we can see that 'y' uses ALTREP. These are, as expected, the
same:
> 
> sum(x)
> # [1] 500000500000
> sum(y)
> # [1] 500000500000
> 
> # For 'x', we have to go through the trouble of actually summing up
1e6 integers.
> # For 'y', knowing its form, we really just need to do:
> 
> 1e6*(1e6+1)/2
> # [1] 500000500000
> 
> # which should be a whole lot faster. And indeed, it is:
> 
> microbenchmark(sum(x),sum(y))
> 
> # Unit: nanoseconds
> #    expr    min       lq      mean   median       uq    max neval cld
> #  sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519   100   b
> #  sum(y)    183    245.5    446.09    338.5    447.0   3233   100  a
> 
> # Now what about mean()?
> 
> mean(x)
> # [1] 500000.5
> mean(y)
> # [1] 500000.5
> 
> # which is the same as
> 
> (1e6+1)/2
> # [1] 500000.5
> 
> # But this surprised me:
> 
> microbenchmark(mean(x),mean(y))
> 
> # Unit: microseconds
> #     expr      min        lq     mean   median       uq      max neval cld
> #  mean(x)  935.389  943.4795 1021.423  954.689  985.122 2065.974   100  a
> #  mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768   100   b
> 
> ### end example ###
> 
> So why is mean() on an ALTREP sequence slower when sum() is faster?
> 
> And more generally, when using sum() on an ALTREP integer sequence, does R
actually use something like n*(n+1)/2 (or generalized to sequences a:b --
(a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?
The mean.default function looks like this:

function (x, trim = 0, na.rm = FALSE, ...)
{
     if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
         warning("argument is not numeric or logical: returning NA")
         return(NA_real_)
     }
     if (na.rm)
         x <- x[!is.na(x)]
     if (!is.numeric(trim) || length(trim) != 1L)
         stop("'trim' must be numeric of length one")
     n <- length(x)
     if (trim > 0 && n) {
         if (is.complex(x))
             stop("trimmed means are not defined for complex data")
         if (anyNA(x))
             return(NA_real_)
         if (trim >= 0.5)
             return(stats::median(x, na.rm = FALSE))
         lo <- floor(n * trim) + 1
         hi <- n + 1 - lo
         x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
     }
     .Internal(mean(x))
}

So it does fixups for trimming and NA removal, then calls an internal 
function.  The internal function is the first part of do_summary, here:

https://github.com/wch/r-source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L541-L556


It is using separate functions for the mean by type.  The real_mean 
function here:

https://github.com/wch/r-source/blob/f9c955fc6699a1f0482e4281ba658215c0e0b949/src/main/summary.c#L476-L515


makes a big effort to avoid overflows.

So I suspect the reason mean.default doesn't use sum(x)/length(x) at the 
end is that on a long vector sum(x) could overflow when mean(x) shouldn't.

So why not take the ALTREP into account?  I suspect it's just too much 
trouble for a rare case.

Duncan Murdoch

GILLIBERT, Andre

2021-Sep-02 19:54 UTC

head link

[Rd] sum() and mean() for (ALTREP) integer sequences

Hello,


Please, find a long response below.


== difference between mean(x) and sum(x)/length(x) =

At the core, mean(x) and sum(x)/length(x) works very differently for real
numbers.

Mean is more accurate when applied to a vector with a small variance but a very
high mean, especially on platforms without extended precision numbers (i.e. long
double is 64 bits rather than 80 bits).


On a Windows 10 computer (with 80 bits long double):

k=1e6
base=2^51
realmean = mean(1:k)
sqa=(base+1):(base+k) # with ALTREP
sq=(base+1):(base+k)+0.0 # without ALTREP
mean(sq) - base - realmean  # correctly returns zero
sum(sq)/k - base - realmean # incorrectly returns -0.5
sum(sqa)/k - base - realmean # correctly returns zero

On a GNU/Linux x86_64 computer with extended precision floating point disabled,
the difference is worse:
mean(sq) - base - realmean  # correctly returns zero
sum(sq)/k - base - realmean # incorrectly returns -1180
sum(sqa)/k - base - realmean # correctly returns zero

Therefore (without ALTREP) sum can be inaccurate, due to floating point rounding
errors.

The good accuracy of mean() is due to a two-passes algorithm of the real_mean()
function in src/main/summary.c.

The algorithm can be summarized by the following equivalent R code:
badmean=function(v) {sum(v)/length(v)}
goodmean=function(v) {center = badmean(v); center + badmean(v - center)}
goodmean(sq) - base - realmean # correctly returns zero


== should mean() call ALTINTEGER_SUM ? =As you noticed in the examples above,
sum() is much more accurate with ALTREPs than with the naive algorithm, because
there are no cumulative rounding errors.

Moreover, if we focus on INTSXP, the maximum possible value is lower : INT_MAX =
2^31-1

The sum of a large sequence (e.g. 1:(2^31-1)) can still overflow the exact
integer range (0 to 2^53) of an FP64, and the division does not always round
back to the correct value.

bad = 1:189812531L
mean(bad) - sum(bad)/length(bad) # returns -1.5e-08 on a platform with FP80
mean(bad) == 94906266 # correct value (the actual result is an integer)

The implementation of mean() on INTSXP do not use the two-passes trick, but
relies on a LDOUBLE (typically FP80 on the x86 platform) that is large enough to
avoid rounding bugs even with huge integer sequences.

Unfortunately the ALTINTEGER_SUM interface returns at best a FP64, and so, would
not return the FP64 closest to the actual mean for some sequences.

Adding a MEAN function to the ALTREP interface could solve the problem.


== can mean performance be improved easily ? =

The mean() implementation for integers, supports ALTREPs with poor iteration
performances, using the slow INTEGER_ELT() macro.

Moreover, it converts each integer to LDOUBLE, which is slow.


It can be improved using ITERATE_BY_REGION0 and using a 64 bits integer (if
available) that cannot overflow on small batches (size = 512).


# before patching (on Ubuntu x86_64 Silvermont Celeron J1900)
x = 1:1e8
y = 1:1e8+0L
system.time(mean(x)) # user 1.33 second
system.time(mean(y)) # user 0.32 second

# after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
# after patching (on Ubuntu x86_64 Silvermont Celeron J1900)
x = 1:1e8
y = 1:1e8+0L
system.time(mean(x)) # user 0.29 second # more than x4 faster
system.time(mean(y)) # user 0.18 second # x 1.7 faster !

(patch attached to this message)

The patch is not optimal. It should ideally use isum(), and risum() but these
functions are a mess, needing refactoring. For instance, they take a
'call' argument and may display a warning message related to the sum()
function.

--
Sincerely
Andre GILLIBERT
________________________________
De : R-devel <r-devel-bounces at r-project.org> de la part de Viechtbauer,
Wolfgang (SP) <wolfgang.viechtbauer at maastrichtuniversity.nl>
Envoy? : jeudi 2 septembre 2021 12:55:03
? : r-devel at r-project.org
Objet : [Rd] sum() and mean() for (ALTREP) integer sequences

ATTENTION: Cet e-mail provient d?une adresse mail ext?rieure au CHU de Rouen. Ne
cliquez pas sur les liens ou n'ouvrez pas les pi?ces jointes ? moins de
conna?tre l'exp?diteur et de savoir que le contenu est s?r. En cas de doute,
transf?rer le mail ? ? DSI, S?curit? ? pour analyse. Merci de votre vigilance


Hi all,

I am trying to understand the performance of functions applied to integer
sequences. Consider the following:

### begin example ###

library(lobstr)
library(microbenchmark)

x <- sample(1e6)
obj_size(x)
# 4,000,048 B

y <- 1:1e6
obj_size(y)
# 680 B

# So we can see that 'y' uses ALTREP. These are, as expected, the same:

sum(x)
# [1] 500000500000
sum(y)
# [1] 500000500000

# For 'x', we have to go through the trouble of actually summing up 1e6
integers.
# For 'y', knowing its form, we really just need to do:

1e6*(1e6+1)/2
# [1] 500000500000

# which should be a whole lot faster. And indeed, it is:

microbenchmark(sum(x),sum(y))

# Unit: nanoseconds
#    expr    min       lq      mean   median       uq    max neval cld
#  sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519   100   b
#  sum(y)    183    245.5    446.09    338.5    447.0   3233   100  a

# Now what about mean()?

mean(x)
# [1] 500000.5
mean(y)
# [1] 500000.5

# which is the same as

(1e6+1)/2
# [1] 500000.5

# But this surprised me:

microbenchmark(mean(x),mean(y))

# Unit: microseconds
#     expr      min        lq     mean   median       uq      max neval cld
#  mean(x)  935.389  943.4795 1021.423  954.689  985.122 2065.974   100  a
#  mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768   100   b

### end example ###

So why is mean() on an ALTREP sequence slower when sum() is faster?

And more generally, when using sum() on an ALTREP integer sequence, does R
actually use something like n*(n+1)/2 (or generalized to sequences a:b --
(a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()?

Best,
Wolfgang

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Sep 2021 - sum() and mean() for (ALTREP) integer sequences

[Rd] sum() and mean() for (ALTREP) integer sequences

[Rd] sum() and mean() for (ALTREP) integer sequences

[Rd] sum() and mean() for (ALTREP) integer sequences