thr3ads.net - R devel - [Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm... [Jun 2015]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2015-Jun-01 00:02 UTC

[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) ==
"double"
is much more time consuming when there are missing values versus when
there are not.  I'm observing this on both Window and Linux, but it's
quite surprising to me.  Currently, my main suspect is settings in on
how R was built.  The second suspect is my brain.  I hope that someone
can clarify the below results and confirm or not whether they see the
same.  Note, this is for "doubles", so I'm not expecting
early-stopping as for "integers" (where testing for NA is cheap).

On R 3.2.0, on Windows (using the official CRAN builds), on Linux
(local built), and on OS X (official AT&T builds), I get:
> x <- rep(0, 1e8)
> stopifnot(typeof(x) == "double")
> system.time(sum(x, na.rm=FALSE))   user  system elapsed
   0.19    0.01    0.20
> y <- rep(NA_real_, 1e8)
> stopifnot(typeof(y) == "double")
> system.time(sum(y, na.rm=FALSE))   user  system elapsed
   9.54    0.00    9.55
> z <- x; z[length(z)/2] <- NA_real_
> stopifnot(typeof(z) == "double")
> system.time(sum(z, na.rm=FALSE))   user  system elapsed
   4.49    0.00    4.51

Following the source code, I'm pretty sure the code
(github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
performing the calculation is:

static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
{
  LDOUBLE s = 0.0;
  Rboolean updated = FALSE;
  for (R_xlen_t i = 0; i < n; i++) {
    if (!narm || !ISNAN(x[i])) {
      if(!updated) updated = TRUE;
        s += x[i];
    }
  }
  if(s > DBL_MAX) *value = R_PosInf;
  else if (s < -DBL_MAX) *value = R_NegInf;
  else *value = (double) s;
  return updated;
}

In other words, when na.rm=FALSE, that inner for loop:

  for (R_xlen_t i = 0; i < n; i++) {
    if (!narm || !ISNAN(x[i])) {
      if(!updated) updated = TRUE;
        s += x[i];
    }
  }

should effectively become (because !ISNAN(x[i]) "does not make a
difference"):

  for (R_xlen_t i = 0; i < n; i++) {
    if (!narm) {
      if(!updated) updated = TRUE;
        s += x[i];
    }
  }

That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
Now, I have always been under impression that summing with NA:s is
*not* more expensive that summing over regular (double) values, which
is confirmed by the below example, but the above benchmarking
disagree.  It looks like there is a big overhead keeping track of the
sum `s` being NA, which is supported by the fact that summing over 'z'
is costs half of 'y'.

Now, I *cannot* reproduce the above using the following 'inline'
example:
> sum2 <- inline::cfunction(sig=c(x="double",
narm="logical"), body=' double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 double sum = 0;
 for (R_xlen_t i = 0; i < n; i++) {
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
 }
 return ScalarReal(sum);
')
> x <- rep(0, 1e8)
> stopifnot(typeof(x) == "double")
> system.time(sum2(x, narm=FALSE))   user  system elapsed
   0.16    0.00    0.16
> y <- rep(NA_real_, 1e8)
> stopifnot(typeof(y) == "double")
> system.time(sum2(y, narm=FALSE))   user  system elapsed
   0.16    0.00    0.15
> z <- x; z[length(z)/2] <- NA_real_
> stopifnot(typeof(z) == "double")
> system.time(sum2(z, narm=FALSE))   user  system elapsed
   0.16    0.00    0.15

This is why I suspect it's related to how R was configured when it was
built. What's going on? Can someone please bring some light on this?

Thanks

Henrik

Henrik Bengtsson

2015-Jun-01 01:08 UTC

head link

[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

This is a great example how you cannot figure it out after spending
two hours troubleshooting, but a few minutes after you post to
R-devel, it's just jumps to you (is there a word for this other than
"impatient"?);

Let me answer my own question.  The discrepancy between my sum2() code
and the internal code for base::sum() is that the latter uses LDOUBLE
= long double (on some system it's only double, cf.
github.com/wch/r-source/blob/trunk/src/nmath/nmath.h#L28-L33),
whereas my sum2() code uses double.  So using long double, I can
reproduce the penalty of having NA_real_ with na.rm=FALSE;
> sum3 <- inline::cfunction(sig=c(x="double",
narm="logical"), body='#define LDOUBLE long double
 double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 LDOUBLE sum = 0.0;
 for (R_xlen_t i = 0; i < n; i++) {
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
 }
 return ScalarReal((double)sum);
')
> x <- rep(0, 1e8)
> stopifnot(typeof(x) == "double")
> system.time(sum3(x, narm=FALSE))   user  system elapsed
   0.40    0.00    0.44> y <- rep(NA_real_, 1e8)
> stopifnot(typeof(y) == "double")
> system.time(sum3(y, narm=FALSE))   user  system elapsed
   9.80    0.00    9.84> z <- x; z[length(z)/2] <- NA_real_
> stopifnot(typeof(z) == "double")
> system.time(sum3(z, narm=FALSE))   user  system elapsed
   4.49    0.00    4.50

This might even be what the following comment refers to:

/* Required by C99 but might be slow */
#ifdef HAVE_LONG_DOUBLE
# define LDOUBLE long double
#else
# define LDOUBLE double
#endif

So now I should rephrase my question: Is there away to avoid this
penalty when using 'long double'?  Is this something the compiler can
be clever about, or is the only solution to not use 'long double'?

/Henrik

On Sun, May 31, 2015 at 5:02 PM, Henrik Bengtsson
<henrik.bengtsson at ucsf.edu> wrote:> I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) ==
"double"
> is much more time consuming when there are missing values versus when
> there are not.  I'm observing this on both Window and Linux, but
it's
> quite surprising to me.  Currently, my main suspect is settings in on
> how R was built.  The second suspect is my brain.  I hope that someone
> can clarify the below results and confirm or not whether they see the
> same.  Note, this is for "doubles", so I'm not expecting
> early-stopping as for "integers" (where testing for NA is cheap).
>
> On R 3.2.0, on Windows (using the official CRAN builds), on Linux
> (local built), and on OS X (official AT&T builds), I get:
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum(x, na.rm=FALSE))
>    user  system elapsed
>    0.19    0.01    0.20
>
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum(y, na.rm=FALSE))
>    user  system elapsed
>    9.54    0.00    9.55
>
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum(z, na.rm=FALSE))
>    user  system elapsed
>    4.49    0.00    4.51
>
> Following the source code, I'm pretty sure the code
> (github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
> performing the calculation is:
>
> static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean narm)
> {
>   LDOUBLE s = 0.0;
>   Rboolean updated = FALSE;
>   for (R_xlen_t i = 0; i < n; i++) {
>     if (!narm || !ISNAN(x[i])) {
>       if(!updated) updated = TRUE;
>         s += x[i];
>     }
>   }
>   if(s > DBL_MAX) *value = R_PosInf;
>   else if (s < -DBL_MAX) *value = R_NegInf;
>   else *value = (double) s;
>   return updated;
> }
>
> In other words, when na.rm=FALSE, that inner for loop:
>
>   for (R_xlen_t i = 0; i < n; i++) {
>     if (!narm || !ISNAN(x[i])) {
>       if(!updated) updated = TRUE;
>         s += x[i];
>     }
>   }
>
> should effectively become (because !ISNAN(x[i]) "does not make a
difference"):
>
>   for (R_xlen_t i = 0; i < n; i++) {
>     if (!narm) {
>       if(!updated) updated = TRUE;
>         s += x[i];
>     }
>   }
>
> That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
> Now, I have always been under impression that summing with NA:s is
> *not* more expensive that summing over regular (double) values, which
> is confirmed by the below example, but the above benchmarking
> disagree.  It looks like there is a big overhead keeping track of the
> sum `s` being NA, which is supported by the fact that summing over
'z'
> is costs half of 'y'.
>
> Now, I *cannot* reproduce the above using the following 'inline'
example:
>
>> sum2 <- inline::cfunction(sig=c(x="double",
narm="logical"), body='
>  double *x_ = REAL(x);
>  int narm_ = asLogical(narm);
>  int n = length(x);
>  double sum = 0;
>  for (R_xlen_t i = 0; i < n; i++) {
>    if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
>  }
>  return ScalarReal(sum);
> ')
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum2(x, narm=FALSE))
>    user  system elapsed
>    0.16    0.00    0.16
>
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum2(y, narm=FALSE))
>    user  system elapsed
>    0.16    0.00    0.15
>
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum2(z, narm=FALSE))
>    user  system elapsed
>    0.16    0.00    0.15
>
> This is why I suspect it's related to how R was configured when it was
> built. What's going on? Can someone please bring some light on this?
>
> Thanks
>
> Henrik

Henrik Bengtsson

2015-Jun-06 04:28 UTC

head link

[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

I should summarize what I've learned after my more recent post on this:

With 'long double', it is extremely expensive to keep "track
of"
non-finites (NA, NaN and Inf) when doing basic arithmetics.  The most
extreme cost is seen when the first value is non-finite, because there
is penalty added in all of the following iterations.

EXAMPLE:

## All finite values> x <- rep(0.1, 1e8)
> system.time(sum(x, narm=FALSE))   user  system elapsed
   0.28    0.00    0.28

## First value is NA (maximum penalty)> y <- x; x[1] <- NA
> system.time(sum(y, narm=FALSE))   user  system elapsed
  23.47    0.00   23.93   (sic!)

## Last value is NA (minimum penalty)> z <- x; x[length(z)] <- NA
> system.time(sum(z, narm=FALSE))   user  system elapsed
   0.35    0.00    0.34

## Silly, but proof of concept by summing rev(y)
## such that NA ends up last> system.time(sum(rev(y), narm=FALSE))   user  system elapsed
   2.40    0.85    3.31


This penalty of having non-finite values is only observed with 'long
double', but not with 'double' (see previous message).


WORKAROUND:
To workaround this penalty, one can test the running result for being
non-finite and return early, e.g. if na.rm=FALSE then as soon as the
running sum is NA_REAL, return NA_REAL.  Unfortunately, the cost for
testing for NA_REAL is expensive, so the penality of doing every
iteration would be too expensive.  A better approach is to test for
early stopping at some interval.  For example:
> sum4 <- inline::cfunction(sig=c(x="double",
narm="logical"), body='#define LDOUBLE long double
 double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 LDOUBLE sum = 0.0;
 for (R_xlen_t i = 0; i < n; i++) {
   /* If narm=FALSE, always sum */
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
   /* Early stopping check when sum non-finite */
   if (!R_FINITE(sum)) break;
 }
 return ScalarReal((double)sum);
')

## See previous email for non-early stopping version> system.time(sum3(x, narm=FALSE))   user  system elapsed
   1.05    0.00    1.05

## Early stopping every iteration adds quite a bit
## of overhead if never used> system.time(sum4(x, narm=FALSE))   user  system elapsed
   1.79    0.00    1.86


## Early stopping once in a while> sum5 <- inline::cfunction(sig=c(x="double",
narm="logical"), body='#define LDOUBLE long double
 double *x_ = REAL(x);
 int narm_ = asLogical(narm);
 int n = length(x);
 LDOUBLE sum = 0.0;
 for (R_xlen_t i = 0; i < n; i++) {
   /* If narm=FALSE, always sum */
   if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
   /* Early stopping check when sum non-finite */
   if (i % 1048576 == 0 && !R_FINITE(sum)) break;
 }
 return ScalarReal((double)sum);
')

## Low extra cost even without non-finites> system.time(sum5(x, narm=FALSE))   user  system elapsed
   1.05    0.00    1.06

## ...and can still do early stopping> system.time(sum5(y, narm=FALSE))   user  system elapsed
      0       0       0


/Henrik


On Sun, May 31, 2015 at 6:08 PM, Henrik Bengtsson
<henrik.bengtsson at ucsf.edu> wrote:> This is a great example how you cannot figure it out after spending
> two hours troubleshooting, but a few minutes after you post to
> R-devel, it's just jumps to you (is there a word for this other than
> "impatient"?);
>
> Let me answer my own question.  The discrepancy between my sum2() code
> and the internal code for base::sum() is that the latter uses LDOUBLE
> = long double (on some system it's only double, cf.
> github.com/wch/r-source/blob/trunk/src/nmath/nmath.h#L28-L33),
> whereas my sum2() code uses double.  So using long double, I can
> reproduce the penalty of having NA_real_ with na.rm=FALSE;
>
>> sum3 <- inline::cfunction(sig=c(x="double",
narm="logical"), body='
> #define LDOUBLE long double
>  double *x_ = REAL(x);
>  int narm_ = asLogical(narm);
>  int n = length(x);
>  LDOUBLE sum = 0.0;
>  for (R_xlen_t i = 0; i < n; i++) {
>    if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
>  }
>  return ScalarReal((double)sum);
> ')
>
>> x <- rep(0, 1e8)
>> stopifnot(typeof(x) == "double")
>> system.time(sum3(x, narm=FALSE))
>    user  system elapsed
>    0.40    0.00    0.44
>> y <- rep(NA_real_, 1e8)
>> stopifnot(typeof(y) == "double")
>> system.time(sum3(y, narm=FALSE))
>    user  system elapsed
>    9.80    0.00    9.84
>> z <- x; z[length(z)/2] <- NA_real_
>> stopifnot(typeof(z) == "double")
>> system.time(sum3(z, narm=FALSE))
>    user  system elapsed
>    4.49    0.00    4.50
>
> This might even be what the following comment refers to:
>
> /* Required by C99 but might be slow */
> #ifdef HAVE_LONG_DOUBLE
> # define LDOUBLE long double
> #else
> # define LDOUBLE double
> #endif
>
> So now I should rephrase my question: Is there away to avoid this
> penalty when using 'long double'?  Is this something the compiler
can
> be clever about, or is the only solution to not use 'long double'?
>
> /Henrik
>
> On Sun, May 31, 2015 at 5:02 PM, Henrik Bengtsson
> <henrik.bengtsson at ucsf.edu> wrote:
>> I'm observing that base::sum(x, na.rm=FALSE) for typeof(x) ==
"double"
>> is much more time consuming when there are missing values versus when
>> there are not.  I'm observing this on both Window and Linux, but
it's
>> quite surprising to me.  Currently, my main suspect is settings in on
>> how R was built.  The second suspect is my brain.  I hope that someone
>> can clarify the below results and confirm or not whether they see the
>> same.  Note, this is for "doubles", so I'm not expecting
>> early-stopping as for "integers" (where testing for NA is
cheap).
>>
>> On R 3.2.0, on Windows (using the official CRAN builds), on Linux
>> (local built), and on OS X (official AT&T builds), I get:
>>
>>> x <- rep(0, 1e8)
>>> stopifnot(typeof(x) == "double")
>>> system.time(sum(x, na.rm=FALSE))
>>    user  system elapsed
>>    0.19    0.01    0.20
>>
>>> y <- rep(NA_real_, 1e8)
>>> stopifnot(typeof(y) == "double")
>>> system.time(sum(y, na.rm=FALSE))
>>    user  system elapsed
>>    9.54    0.00    9.55
>>
>>> z <- x; z[length(z)/2] <- NA_real_
>>> stopifnot(typeof(z) == "double")
>>> system.time(sum(z, na.rm=FALSE))
>>    user  system elapsed
>>    4.49    0.00    4.51
>>
>> Following the source code, I'm pretty sure the code
>>
(github.com/wch/r-source/blob/trunk/src/main/summary.c#L112-L128)
>> performing the calculation is:
>>
>> static Rboolean rsum(double *x, R_xlen_t n, double *value, Rboolean
narm)
>> {
>>   LDOUBLE s = 0.0;
>>   Rboolean updated = FALSE;
>>   for (R_xlen_t i = 0; i < n; i++) {
>>     if (!narm || !ISNAN(x[i])) {
>>       if(!updated) updated = TRUE;
>>         s += x[i];
>>     }
>>   }
>>   if(s > DBL_MAX) *value = R_PosInf;
>>   else if (s < -DBL_MAX) *value = R_NegInf;
>>   else *value = (double) s;
>>   return updated;
>> }
>>
>> In other words, when na.rm=FALSE, that inner for loop:
>>
>>   for (R_xlen_t i = 0; i < n; i++) {
>>     if (!narm || !ISNAN(x[i])) {
>>       if(!updated) updated = TRUE;
>>         s += x[i];
>>     }
>>   }
>>
>> should effectively become (because !ISNAN(x[i]) "does not make a
difference"):
>>
>>   for (R_xlen_t i = 0; i < n; i++) {
>>     if (!narm) {
>>       if(!updated) updated = TRUE;
>>         s += x[i];
>>     }
>>   }
>>
>> That is, sum(x, na.rm=FALSE) basically spends time on `s += x[i]`.
>> Now, I have always been under impression that summing with NA:s is
>> *not* more expensive that summing over regular (double) values, which
>> is confirmed by the below example, but the above benchmarking
>> disagree.  It looks like there is a big overhead keeping track of the
>> sum `s` being NA, which is supported by the fact that summing over
'z'
>> is costs half of 'y'.
>>
>> Now, I *cannot* reproduce the above using the following
'inline' example:
>>
>>> sum2 <- inline::cfunction(sig=c(x="double",
narm="logical"), body='
>>  double *x_ = REAL(x);
>>  int narm_ = asLogical(narm);
>>  int n = length(x);
>>  double sum = 0;
>>  for (R_xlen_t i = 0; i < n; i++) {
>>    if (!narm_ || !ISNAN(x_[i])) sum += x_[i];
>>  }
>>  return ScalarReal(sum);
>> ')
>>
>>> x <- rep(0, 1e8)
>>> stopifnot(typeof(x) == "double")
>>> system.time(sum2(x, narm=FALSE))
>>    user  system elapsed
>>    0.16    0.00    0.16
>>
>>> y <- rep(NA_real_, 1e8)
>>> stopifnot(typeof(y) == "double")
>>> system.time(sum2(y, narm=FALSE))
>>    user  system elapsed
>>    0.16    0.00    0.15
>>
>>> z <- x; z[length(z)/2] <- NA_real_
>>> stopifnot(typeof(z) == "double")
>>> system.time(sum2(z, narm=FALSE))
>>    user  system elapsed
>>    0.16    0.00    0.15
>>
>> This is why I suspect it's related to how R was configured when it
was
>> built. What's going on? Can someone please bring some light on
this?
>>
>> Thanks
>>
>> Henrik

Reasonably Related Threads

Search for more possibly parallel threads

R devel - Jun 2015 - sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

[Rd] sum(..., na.rm=FALSE): Summing over NA_real_ values much more expensive than non-NAs for na.rm=FALSE? Hmm...

Reasonably Related Threads