thr3ads.net - R devel - [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31 [Jan 2018]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2018-Jan-25 17:30 UTC

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

Just following up on this old thread since matrixStats 0.53.0 is now
out, which supports this use case:
> x <- rep(TRUE, times = 2^31)
> y <- sum(x)
> y[1] NA
Warning message:
In sum(x) : integer overflow - use sum(as.numeric(.))
> y <- matrixStats::sum2(x, mode = "double")
> y
[1] 2147483648> str(y) num 2.15e+09

No coercion is taking place, so the memory overhead is zero:
> profmem::profmem(y <- matrixStats::sum2(x, mode = "double"))Rprofmem memory profiling of:
y <- matrixStats::sum2(x, mode = "double")

Memory allocations:
      bytes calls
total     0

/Henrik

On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
<henrik.bengtsson at gmail.com> wrote:> I second this feature request (it's understandable that this and
> possibly other parts of the code was left behind / forgotten after the
> introduction of long vector).
>
> I think mean() avoids full copies, so in the meanwhile, you can work
> around this limitation using:
>
> countTRUE <- function(x, na.rm = FALSE) {
>   nx <- length(x)
>   if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
>   nx * mean(x, na.rm = na.rm)
> }
>
> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 !=
0)
>
> x <- rep(TRUE, times = .Machine$integer.max+1)
> object.size(x)
> ## 8589934632 bytes
>
> p <- profmem::profmem( n <- countTRUE(x) )
> str(n)
> ## num 2.15e+09
> print(n == .Machine$integer.max + 1)
> ## [1] TRUE
>
> print(p)
> ## Rprofmem memory profiling of:
> ## n <- countTRUE(x)
> ##
> ## Memory allocations:
> ##      bytes calls
> ## total     0
>
>
> FYI / related: I've just updated matrixStats::sum2() to support
> logicals (develop branch) and I'll also try to update
> matrixStats::count() to count beyond .Machine$integer.max.
>
> /Henrik
>
> On Fri, Jun 2, 2017 at 4:05 AM, Herv? Pag?s <hpages at fredhutch.org>
wrote:
>> Hi,
>>
>> I have a long numeric vector 'xx' and I want to use sum() to
count
>> the number of elements that satisfy some criteria like non-zero
>> values or values lower than a certain threshold etc...
>>
>> The problem is: sum() returns an NA (with a warning) if the count
>> is greater than 2^31. For example:
>>
>>   > xx <- runif(3e9)
>>   > sum(xx < 0.9)
>>   [1] NA
>>   Warning message:
>>   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>>
>> This already takes a long time and doing sum(as.numeric(.)) would
>> take even longer and require allocation of 24Gb of memory just to
>> store an intermediate numeric vector made of 0s and 1s. Plus, having
>> to do sum(as.numeric(.)) every time I need to count things is not
>> convenient and is easy to forget.
>>
>> It seems that sum() on a logical vector could be modified to return
>> the count as a double when it cannot be represented as an integer.
>> Note that length() already does this so that wouldn't create a
>> precedent. Also and FWIW prod() avoids the problem by always returning
>> a double, whatever the type of the input is (except on a complex
>> vector).
>>
>> I can provide a patch if this change sounds reasonable.
>>
>> Cheers,
>> H.
>>
>> --
>> Herv? Pag?s
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler

2018-Jan-27 11:06 UTC

head link

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>     on Thu, 25 Jan 2018 09:30:42 -0800 writes:
    > Just following up on this old thread since matrixStats 0.53.0 is now
    > out, which supports this use case:

    >> x <- rep(TRUE, times = 2^31)

    >> y <- sum(x)
    >> y
    > [1] NA
    > Warning message:
    > In sum(x) : integer overflow - use sum(as.numeric(.))

    >> y <- matrixStats::sum2(x, mode = "double")
    >> y
    > [1] 2147483648
    >> str(y)
    > num 2.15e+09

    > No coercion is taking place, so the memory overhead is zero:

    >> profmem::profmem(y <- matrixStats::sum2(x, mode =
"double"))
    > Rprofmem memory profiling of:
    > y <- matrixStats::sum2(x, mode = "double")

    > Memory allocations:
    > bytes calls
    > total     0

    > /Henrik

Thank you, Henrik, for the reminder.

Back in June, I had mentioned to Herv? and R-devel that
'logical' should remain to be treated as 'integer' as in all
arithmetic in (S and) R.     Herv? did mention the isum()
function in the C code which is relevant here .. which does have
a LONG INT counter already -- *but* if we consider that sum()
has '...' i.e. a conceptually arbitrary number of long vector
integer arguments that counter won't suffice even there.

Before talking about implementation / patch, I think we should
consider 2 possible goals of a change --- I agree the status quo
is not a real option

1) sum(x) for logical and integer x  would return a double
      in any case and overflow should not happen (unless for
      the case where the result would be larger the
      .Machine$double.max which I think will not be possible
      even with "arbitrary" nargs() of sum.

2) sum(x) for logical and integer x  should return an integer in
       all cases there is no overflow, including returning
       NA_integer_ in case of NAs.
   If there would be an overflow it must be detected "in time"
   and the result should be double.

The big advantage of 2) is that it is back compatible in 99.x %
of use cases, and another advantage that it may be a very small
bit more efficient.  Also, in the case of "counting" (logical),
it is nice to get an integer instead of double when we can --
entirely analogously to the behavior of length() which returns
integer whenever possible.

The advantage of 1) is uniformity.

We should (at least provisionally) decide between 1) and 2) and then go for
that.
It could be that going for 1) may have bad
compatibility-consequences in package space, because indeed we
had documented sum() would be integer for logical and integer arguments.

I currently don't really have time to
{work on implementing + dealing with the consequences}
for either ..

Martin

    > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
    > <henrik.bengtsson at gmail.com> wrote:
    >> I second this feature request (it's understandable that this
and
    >> possibly other parts of the code was left behind / forgotten after
the
    >> introduction of long vector).
    >> 
    >> I think mean() avoids full copies, so in the meanwhile, you can
work
    >> around this limitation using:
    >> 
    >> countTRUE <- function(x, na.rm = FALSE) {
    >> nx <- length(x)
    >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
    >> nx * mean(x, na.rm = na.rm)
    >> }
    >> 
    >> (not sure if one needs to worry about rounding errors, i.e. where n
%% 0 != 0)
    >> 
    >> x <- rep(TRUE, times = .Machine$integer.max+1)
    >> object.size(x)
    >> ## 8589934632 bytes
    >> 
    >> p <- profmem::profmem( n <- countTRUE(x) )
    >> str(n)
    >> ## num 2.15e+09
    >> print(n == .Machine$integer.max + 1)
    >> ## [1] TRUE
    >> 
    >> print(p)
    >> ## Rprofmem memory profiling of:
    >> ## n <- countTRUE(x)
    >> ##
    >> ## Memory allocations:
    >> ##      bytes calls
    >> ## total     0
    >> 
    >> 
    >> FYI / related: I've just updated matrixStats::sum2() to support
    >> logicals (develop branch) and I'll also try to update
    >> matrixStats::count() to count beyond .Machine$integer.max.
    >> 
    >> /Henrik
    >> 
    >> On Fri, Jun 2, 2017 at 4:05 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
    >>> Hi,
    >>> 
    >>> I have a long numeric vector 'xx' and I want to use
sum() to count
    >>> the number of elements that satisfy some criteria like non-zero
    >>> values or values lower than a certain threshold etc...
    >>> 
    >>> The problem is: sum() returns an NA (with a warning) if the
count
    >>> is greater than 2^31. For example:
    >>> 
    >>> > xx <- runif(3e9)
    >>> > sum(xx < 0.9)
    >>> [1] NA
    >>> Warning message:
    >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
    >>> 
    >>> This already takes a long time and doing sum(as.numeric(.))
would
    >>> take even longer and require allocation of 24Gb of memory just
to
    >>> store an intermediate numeric vector made of 0s and 1s. Plus,
having
    >>> to do sum(as.numeric(.)) every time I need to count things is
not
    >>> convenient and is easy to forget.
    >>> 
    >>> It seems that sum() on a logical vector could be modified to
return
    >>> the count as a double when it cannot be represented as an
integer.
    >>> Note that length() already does this so that wouldn't
create a
    >>> precedent. Also and FWIW prod() avoids the problem by always
returning
    >>> a double, whatever the type of the input is (except on a
complex
    >>> vector).
    >>> 
    >>> I can provide a patch if this change sounds reasonable.
    >>> 
    >>> Cheers,
    >>> H.
    >>> 
    >>> --
    >>> Herv? Pag?s

Hervé Pagès

2018-Jan-30 21:30 UTC

head link

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

Hi Martin, Henrik,

Thanks for the follow up.

@Martin: I vote for 2) without *any* hesitation :-)

(and uniformity could be restored at some point in the
future by having prod(), rowSums(), colSums(), and others
align with the behavior of length() and sum())

Cheers,
H.


On 01/27/2018 03:06 AM, Martin Maechler wrote:>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>>      on Thu, 25 Jan 2018 09:30:42 -0800 writes:
> 
>      > Just following up on this old thread since matrixStats 0.53.0 is
now
>      > out, which supports this use case:
> 
>      >> x <- rep(TRUE, times = 2^31)
> 
>      >> y <- sum(x)
>      >> y
>      > [1] NA
>      > Warning message:
>      > In sum(x) : integer overflow - use sum(as.numeric(.))
> 
>      >> y <- matrixStats::sum2(x, mode = "double")
>      >> y
>      > [1] 2147483648
>      >> str(y)
>      > num 2.15e+09
> 
>      > No coercion is taking place, so the memory overhead is zero:
> 
>      >> profmem::profmem(y <- matrixStats::sum2(x, mode =
"double"))
>      > Rprofmem memory profiling of:
>      > y <- matrixStats::sum2(x, mode = "double")
> 
>      > Memory allocations:
>      > bytes calls
>      > total     0
> 
>      > /Henrik
> 
> Thank you, Henrik, for the reminder.
> 
> Back in June, I had mentioned to Herv? and R-devel that
> 'logical' should remain to be treated as 'integer' as in
all
> arithmetic in (S and) R.     Herv? did mention the isum()
> function in the C code which is relevant here .. which does have
> a LONG INT counter already -- *but* if we consider that sum()
> has '...' i.e. a conceptually arbitrary number of long vector
> integer arguments that counter won't suffice even there.
> 
> Before talking about implementation / patch, I think we should
> consider 2 possible goals of a change --- I agree the status quo
> is not a real option
> 
> 1) sum(x) for logical and integer x  would return a double
>        in any case and overflow should not happen (unless for
>        the case where the result would be larger the
>        .Machine$double.max which I think will not be possible
>        even with "arbitrary" nargs() of sum.
> 
> 2) sum(x) for logical and integer x  should return an integer in
>         all cases there is no overflow, including returning
>         NA_integer_ in case of NAs.
>     If there would be an overflow it must be detected "in time"
>     and the result should be double.
> 
> The big advantage of 2) is that it is back compatible in 99.x %
> of use cases, and another advantage that it may be a very small
> bit more efficient.  Also, in the case of "counting" (logical),
> it is nice to get an integer instead of double when we can --
> entirely analogously to the behavior of length() which returns
> integer whenever possible.
> 
> The advantage of 1) is uniformity.
> 
> We should (at least provisionally) decide between 1) and 2) and then go for
that.
> It could be that going for 1) may have bad
> compatibility-consequences in package space, because indeed we
> had documented sum() would be integer for logical and integer arguments.
> 
> I currently don't really have time to
> {work on implementing + dealing with the consequences}
> for either ..
> 
> Martin
> 
>      > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
>      > <henrik.bengtsson at gmail.com> wrote:
>      >> I second this feature request (it's understandable that
this and
>      >> possibly other parts of the code was left behind / forgotten
after the
>      >> introduction of long vector).
>      >>
>      >> I think mean() avoids full copies, so in the meanwhile, you
can work
>      >> around this limitation using:
>      >>
>      >> countTRUE <- function(x, na.rm = FALSE) {
>      >> nx <- length(x)
>      >> if (nx < .Machine$integer.max) return(sum(x, na.rm =
na.rm))
>      >> nx * mean(x, na.rm = na.rm)
>      >> }
>      >>
>      >> (not sure if one needs to worry about rounding errors, i.e.
where n %% 0 != 0)
>      >>
>      >> x <- rep(TRUE, times = .Machine$integer.max+1)
>      >> object.size(x)
>      >> ## 8589934632 bytes
>      >>
>      >> p <- profmem::profmem( n <- countTRUE(x) )
>      >> str(n)
>      >> ## num 2.15e+09
>      >> print(n == .Machine$integer.max + 1)
>      >> ## [1] TRUE
>      >>
>      >> print(p)
>      >> ## Rprofmem memory profiling of:
>      >> ## n <- countTRUE(x)
>      >> ##
>      >> ## Memory allocations:
>      >> ##      bytes calls
>      >> ## total     0
>      >>
>      >>
>      >> FYI / related: I've just updated matrixStats::sum2() to
support
>      >> logicals (develop branch) and I'll also try to update
>      >> matrixStats::count() to count beyond .Machine$integer.max.
>      >>
>      >> /Henrik
>      >>
>      >> On Fri, Jun 2, 2017 at 4:05 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>      >>> Hi,
>      >>>
>      >>> I have a long numeric vector 'xx' and I want to
use sum() to count
>      >>> the number of elements that satisfy some criteria like
non-zero
>      >>> values or values lower than a certain threshold etc...
>      >>>
>      >>> The problem is: sum() returns an NA (with a warning) if
the count
>      >>> is greater than 2^31. For example:
>      >>>
>      >>> > xx <- runif(3e9)
>      >>> > sum(xx < 0.9)
>      >>> [1] NA
>      >>> Warning message:
>      >>> In sum(xx < 0.9) : integer overflow - use
sum(as.numeric(.))
>      >>>
>      >>> This already takes a long time and doing
sum(as.numeric(.)) would
>      >>> take even longer and require allocation of 24Gb of memory
just to
>      >>> store an intermediate numeric vector made of 0s and 1s.
Plus, having
>      >>> to do sum(as.numeric(.)) every time I need to count
things is not
>      >>> convenient and is easy to forget.
>      >>>
>      >>> It seems that sum() on a logical vector could be modified
to return
>      >>> the count as a double when it cannot be represented as an
integer.
>      >>> Note that length() already does this so that wouldn't
create a
>      >>> precedent. Also and FWIW prod() avoids the problem by
always returning
>      >>> a double, whatever the type of the input is (except on a
complex
>      >>> vector).
>      >>>
>      >>> I can provide a patch if this change sounds reasonable.
>      >>>
>      >>> Cheers,
>      >>> H.
>      >>>
>      >>> --
>      >>> Herv? Pag?s
>      
> 
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Possibly Parallel Threads

Search for more apparently analagous threads

R devel - Jan 2018 - sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Possibly Parallel Threads

R devel - Jan 2018 - sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31