thr3ads.net - R devel - [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31 [Feb 2018]

If this information is useful, please help other people find it:
Share via:

Hervé Pagès

2018-Jan-30 21:30 UTC

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

Hi Martin, Henrik,

Thanks for the follow up.

@Martin: I vote for 2) without *any* hesitation :-)

(and uniformity could be restored at some point in the
future by having prod(), rowSums(), colSums(), and others
align with the behavior of length() and sum())

Cheers,
H.


On 01/27/2018 03:06 AM, Martin Maechler wrote:>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com>
>>>>>>      on Thu, 25 Jan 2018 09:30:42 -0800 writes:
> 
>      > Just following up on this old thread since matrixStats 0.53.0 is
now
>      > out, which supports this use case:
> 
>      >> x <- rep(TRUE, times = 2^31)
> 
>      >> y <- sum(x)
>      >> y
>      > [1] NA
>      > Warning message:
>      > In sum(x) : integer overflow - use sum(as.numeric(.))
> 
>      >> y <- matrixStats::sum2(x, mode = "double")
>      >> y
>      > [1] 2147483648
>      >> str(y)
>      > num 2.15e+09
> 
>      > No coercion is taking place, so the memory overhead is zero:
> 
>      >> profmem::profmem(y <- matrixStats::sum2(x, mode =
"double"))
>      > Rprofmem memory profiling of:
>      > y <- matrixStats::sum2(x, mode = "double")
> 
>      > Memory allocations:
>      > bytes calls
>      > total     0
> 
>      > /Henrik
> 
> Thank you, Henrik, for the reminder.
> 
> Back in June, I had mentioned to Herv? and R-devel that
> 'logical' should remain to be treated as 'integer' as in
all
> arithmetic in (S and) R.     Herv? did mention the isum()
> function in the C code which is relevant here .. which does have
> a LONG INT counter already -- *but* if we consider that sum()
> has '...' i.e. a conceptually arbitrary number of long vector
> integer arguments that counter won't suffice even there.
> 
> Before talking about implementation / patch, I think we should
> consider 2 possible goals of a change --- I agree the status quo
> is not a real option
> 
> 1) sum(x) for logical and integer x  would return a double
>        in any case and overflow should not happen (unless for
>        the case where the result would be larger the
>        .Machine$double.max which I think will not be possible
>        even with "arbitrary" nargs() of sum.
> 
> 2) sum(x) for logical and integer x  should return an integer in
>         all cases there is no overflow, including returning
>         NA_integer_ in case of NAs.
>     If there would be an overflow it must be detected "in time"
>     and the result should be double.
> 
> The big advantage of 2) is that it is back compatible in 99.x %
> of use cases, and another advantage that it may be a very small
> bit more efficient.  Also, in the case of "counting" (logical),
> it is nice to get an integer instead of double when we can --
> entirely analogously to the behavior of length() which returns
> integer whenever possible.
> 
> The advantage of 1) is uniformity.
> 
> We should (at least provisionally) decide between 1) and 2) and then go for
that.
> It could be that going for 1) may have bad
> compatibility-consequences in package space, because indeed we
> had documented sum() would be integer for logical and integer arguments.
> 
> I currently don't really have time to
> {work on implementing + dealing with the consequences}
> for either ..
> 
> Martin
> 
>      > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
>      > <henrik.bengtsson at gmail.com> wrote:
>      >> I second this feature request (it's understandable that
this and
>      >> possibly other parts of the code was left behind / forgotten
after the
>      >> introduction of long vector).
>      >>
>      >> I think mean() avoids full copies, so in the meanwhile, you
can work
>      >> around this limitation using:
>      >>
>      >> countTRUE <- function(x, na.rm = FALSE) {
>      >> nx <- length(x)
>      >> if (nx < .Machine$integer.max) return(sum(x, na.rm =
na.rm))
>      >> nx * mean(x, na.rm = na.rm)
>      >> }
>      >>
>      >> (not sure if one needs to worry about rounding errors, i.e.
where n %% 0 != 0)
>      >>
>      >> x <- rep(TRUE, times = .Machine$integer.max+1)
>      >> object.size(x)
>      >> ## 8589934632 bytes
>      >>
>      >> p <- profmem::profmem( n <- countTRUE(x) )
>      >> str(n)
>      >> ## num 2.15e+09
>      >> print(n == .Machine$integer.max + 1)
>      >> ## [1] TRUE
>      >>
>      >> print(p)
>      >> ## Rprofmem memory profiling of:
>      >> ## n <- countTRUE(x)
>      >> ##
>      >> ## Memory allocations:
>      >> ##      bytes calls
>      >> ## total     0
>      >>
>      >>
>      >> FYI / related: I've just updated matrixStats::sum2() to
support
>      >> logicals (develop branch) and I'll also try to update
>      >> matrixStats::count() to count beyond .Machine$integer.max.
>      >>
>      >> /Henrik
>      >>
>      >> On Fri, Jun 2, 2017 at 4:05 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
>      >>> Hi,
>      >>>
>      >>> I have a long numeric vector 'xx' and I want to
use sum() to count
>      >>> the number of elements that satisfy some criteria like
non-zero
>      >>> values or values lower than a certain threshold etc...
>      >>>
>      >>> The problem is: sum() returns an NA (with a warning) if
the count
>      >>> is greater than 2^31. For example:
>      >>>
>      >>> > xx <- runif(3e9)
>      >>> > sum(xx < 0.9)
>      >>> [1] NA
>      >>> Warning message:
>      >>> In sum(xx < 0.9) : integer overflow - use
sum(as.numeric(.))
>      >>>
>      >>> This already takes a long time and doing
sum(as.numeric(.)) would
>      >>> take even longer and require allocation of 24Gb of memory
just to
>      >>> store an intermediate numeric vector made of 0s and 1s.
Plus, having
>      >>> to do sum(as.numeric(.)) every time I need to count
things is not
>      >>> convenient and is easy to forget.
>      >>>
>      >>> It seems that sum() on a logical vector could be modified
to return
>      >>> the count as a double when it cannot be represented as an
integer.
>      >>> Note that length() already does this so that wouldn't
create a
>      >>> precedent. Also and FWIW prod() avoids the problem by
always returning
>      >>> a double, whatever the type of the input is (except on a
complex
>      >>> vector).
>      >>>
>      >>> I can provide a patch if this change sounds reasonable.
>      >>>
>      >>> Cheers,
>      >>> H.
>      >>>
>      >>> --
>      >>> Herv? Pag?s
>      
> 
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Martin Maechler

2018-Feb-01 15:34 UTC

head link

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

>>>>> Herv? Pag?s <hpages at fredhutch.org>
>>>>>     on Tue, 30 Jan 2018 13:30:18 -0800 writes:
    > Hi Martin, Henrik,
    > Thanks for the follow up.

    > @Martin: I vote for 2) without *any* hesitation :-)

    > (and uniformity could be restored at some point in the
    > future by having prod(), rowSums(), colSums(), and others
    > align with the behavior of length() and sum())

As a matter of fact, I had procrastinated and worked at
implementing '2)' already a bit on the weekend and made it work
- more or less.  It needs a bit more work, and I had also been considering
replacing the numbers in the current overflow check

	if (ii++ > 1000) {	 \
	    ii = 0;							\
	    if (s > 9000000000000000L || s < -9000000000000000L) {	\
		if(!updated) updated = TRUE;				\
		*value = NA_INTEGER;					\
		warningcall(call, _("integer overflow - use sum(as.numeric(.))")); \
		return updated;						\
	    }								\
	}								\

i.e. think of tweaking the '1000' and '9000000000000000L', 
but decided to leave these and add comments there about why. For
the moment.
They may look arbitrary, but are not at all: If you multiply
them (which looks correct, if we check the sum 's' only every 1000-th
time ...((still not sure they *are* correct))) you get  9*10^18
which is only slightly smaller than  2^63 - 1 which may be the
maximal "LONG_INT" integer we have.

So, in the end, at least for now, we do not quite go all they way
but overflow a bit earlier,... but do potentially gain a bit of
speed, notably with the ITERATE_BY_REGION(..) macros
(which I did not show above).

Will hopefully become available in R-devel real soon now.

Martin

    > Cheers,
    > H.


    > On 01/27/2018 03:06 AM, Martin Maechler wrote:
    >>>>>>> Henrik Bengtsson <henrik.bengtsson at
gmail.com>
    >>>>>>> on Thu, 25 Jan 2018 09:30:42 -0800 writes:
    >> 
    >> > Just following up on this old thread since matrixStats 0.53.0
is now
    >> > out, which supports this use case:
    >> 
    >> >> x <- rep(TRUE, times = 2^31)
    >> 
    >> >> y <- sum(x)
    >> >> y
    >> > [1] NA
    >> > Warning message:
    >> > In sum(x) : integer overflow - use sum(as.numeric(.))
    >> 
    >> >> y <- matrixStats::sum2(x, mode = "double")
    >> >> y
    >> > [1] 2147483648
    >> >> str(y)
    >> > num 2.15e+09
    >> 
    >> > No coercion is taking place, so the memory overhead is zero:
    >> 
    >> >> profmem::profmem(y <- matrixStats::sum2(x, mode =
"double"))
    >> > Rprofmem memory profiling of:
    >> > y <- matrixStats::sum2(x, mode = "double")
    >> 
    >> > Memory allocations:
    >> > bytes calls
    >> > total     0
    >> 
    >> > /Henrik
    >> 
    >> Thank you, Henrik, for the reminder.
    >> 
    >> Back in June, I had mentioned to Herv? and R-devel that
    >> 'logical' should remain to be treated as 'integer'
as in all
    >> arithmetic in (S and) R.     Herv? did mention the isum()
    >> function in the C code which is relevant here .. which does have
    >> a LONG INT counter already -- *but* if we consider that sum()
    >> has '...' i.e. a conceptually arbitrary number of long
vector
    >> integer arguments that counter won't suffice even there.
    >> 
    >> Before talking about implementation / patch, I think we should
    >> consider 2 possible goals of a change --- I agree the status quo
    >> is not a real option
    >> 
    >> 1) sum(x) for logical and integer x  would return a double
    >> in any case and overflow should not happen (unless for
    >> the case where the result would be larger the
    >> .Machine$double.max which I think will not be possible
    >> even with "arbitrary" nargs() of sum.
    >> 
    >> 2) sum(x) for logical and integer x  should return an integer in
    >> all cases there is no overflow, including returning
    >> NA_integer_ in case of NAs.
    >> If there would be an overflow it must be detected "in
time"
    >> and the result should be double.
    >> 
    >> The big advantage of 2) is that it is back compatible in 99.x %
    >> of use cases, and another advantage that it may be a very small
    >> bit more efficient.  Also, in the case of "counting"
(logical),
    >> it is nice to get an integer instead of double when we can --
    >> entirely analogously to the behavior of length() which returns
    >> integer whenever possible.
    >> 
    >> The advantage of 1) is uniformity.
    >> 
    >> We should (at least provisionally) decide between 1) and 2) and
then go for that.
    >> It could be that going for 1) may have bad
    >> compatibility-consequences in package space, because indeed we
    >> had documented sum() would be integer for logical and integer
arguments.
    >> 
    >> I currently don't really have time to
    >> {work on implementing + dealing with the consequences}
    >> for either ..
    >> 
    >> Martin
    >> 
    >> > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson
    >> > <henrik.bengtsson at gmail.com> wrote:
    >> >> I second this feature request (it's understandable
that this and
    >> >> possibly other parts of the code was left behind /
forgotten after the
    >> >> introduction of long vector).
    >> >>
    >> >> I think mean() avoids full copies, so in the meanwhile,
you can work
    >> >> around this limitation using:
    >> >>
    >> >> countTRUE <- function(x, na.rm = FALSE) {
    >> >> nx <- length(x)
    >> >> if (nx < .Machine$integer.max) return(sum(x, na.rm =
na.rm))
    >> >> nx * mean(x, na.rm = na.rm)
    >> >> }
    >> >>
    >> >> (not sure if one needs to worry about rounding errors,
i.e. where n %% 0 != 0)
    >> >>
    >> >> x <- rep(TRUE, times = .Machine$integer.max+1)
    >> >> object.size(x)
    >> >> ## 8589934632 bytes
    >> >>
    >> >> p <- profmem::profmem( n <- countTRUE(x) )
    >> >> str(n)
    >> >> ## num 2.15e+09
    >> >> print(n == .Machine$integer.max + 1)
    >> >> ## [1] TRUE
    >> >>
    >> >> print(p)
    >> >> ## Rprofmem memory profiling of:
    >> >> ## n <- countTRUE(x)
    >> >> ##
    >> >> ## Memory allocations:
    >> >> ##      bytes calls
    >> >> ## total     0
    >> >>
    >> >>
    >> >> FYI / related: I've just updated matrixStats::sum2()
to support
    >> >> logicals (develop branch) and I'll also try to update
    >> >> matrixStats::count() to count beyond .Machine$integer.max.
    >> >>
    >> >> /Henrik
    >> >>
    >> >> On Fri, Jun 2, 2017 at 4:05 AM, Herv? Pag?s <hpages at
fredhutch.org> wrote:
    >> >>> Hi,
    >> >>>
    >> >>> I have a long numeric vector 'xx' and I want
to use sum() to count
    >> >>> the number of elements that satisfy some criteria like
non-zero
    >> >>> values or values lower than a certain threshold etc...
    >> >>>
    >> >>> The problem is: sum() returns an NA (with a warning)
if the count
    >> >>> is greater than 2^31. For example:
    >> >>>
    >> >>> > xx <- runif(3e9)
    >> >>> > sum(xx < 0.9)
    >> >>> [1] NA
    >> >>> Warning message:
    >> >>> In sum(xx < 0.9) : integer overflow - use
sum(as.numeric(.))
    >> >>>
    >> >>> This already takes a long time and doing
sum(as.numeric(.)) would
    >> >>> take even longer and require allocation of 24Gb of
memory just to
    >> >>> store an intermediate numeric vector made of 0s and
1s. Plus, having
    >> >>> to do sum(as.numeric(.)) every time I need to count
things is not
    >> >>> convenient and is easy to forget.
    >> >>>
    >> >>> It seems that sum() on a logical vector could be
modified to return
    >> >>> the count as a double when it cannot be represented as
an integer.
    >> >>> Note that length() already does this so that
wouldn't create a
    >> >>> precedent. Also and FWIW prod() avoids the problem by
always returning
    >> >>> a double, whatever the type of the input is (except on
a complex
    >> >>> vector).
    >> >>>
    >> >>> I can provide a patch if this change sounds
reasonable.
    >> >>>
    >> >>> Cheers,
    >> >>> H.
    >> >>>
    >> >>> --
    >> >>> Herv? Pag?s
    >> 
    >> 

    > -- 
    > Herv? Pag?s

    > Program in Computational Biology
    > Division of Public Health Sciences
    > Fred Hutchinson Cancer Research Center
    > 1100 Fairview Ave. N, M1-B514
    > P.O. Box 19024
    > Seattle, WA 98109-1024

    > E-mail: hpages at fredhutch.org
    > Phone:  (206) 667-5791
    > Fax:    (206) 667-1319

Martin Maechler

2018-Feb-05 12:43 UTC

head link

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Thu, 1 Feb 2018 16:34:04 +0100 writes:
> >>>>> Herv? Pag?s <hpages at fredhutch.org>
> >>>>>     on Tue, 30 Jan 2018 13:30:18 -0800 writes:
> 
>     > Hi Martin, Henrik,
>     > Thanks for the follow up.
> 
>     > @Martin: I vote for 2) without *any* hesitation :-)
> 
>     > (and uniformity could be restored at some point in the
>     > future by having prod(), rowSums(), colSums(), and others
>     > align with the behavior of length() and sum())
> 
> As a matter of fact, I had procrastinated and worked at
> implementing '2)' already a bit on the weekend and made it work
> - more or less.  It needs a bit more work, and I had also been considering
> replacing the numbers in the current overflow check
> 
> 	if (ii++ > 1000) {	 \
> 	    ii = 0;							\
> 	    if (s > 9000000000000000L || s < -9000000000000000L) {	\
> 		if(!updated) updated = TRUE;				\
> 		*value = NA_INTEGER;					\
> 		warningcall(call, _("integer overflow - use
sum(as.numeric(.))")); \
> 		return updated;						\
> 	    }								\
> 	}								\
> 
> i.e. think of tweaking the '1000' and '9000000000000000L', 
> but decided to leave these and add comments there about why. For
> the moment.
> They may look arbitrary, but are not at all: If you multiply
> them (which looks correct, if we check the sum 's' only every
1000-th
> time ...((still not sure they *are* correct))) you get  9*10^18
> which is only slightly smaller than  2^63 - 1 which may be the
> maximal "LONG_INT" integer we have.
> 
> So, in the end, at least for now, we do not quite go all they way
> but overflow a bit earlier,... but do potentially gain a bit of
> speed, notably with the ITERATE_BY_REGION(..) macros
> (which I did not show above).
> 
> Will hopefully become available in R-devel real soon now.
>
> Martin
After finishing that... I challenged myself that one should be able to do
better, namely "no overflow" (because of large/many
integer/logical), and so introduced  irsum()  which uses a double 
precision accumulator for integer/logical  ... but would really
only be used when the 64-bit int accumulator would get close to
overflow.
The resulting code is not really beautiful, and also contains a
a comment     " (a waste, rare; FIXME ?) "
If anybody feels like finding a more elegant version without the
"waste" case, go ahead and be our guest ! 

Testing the code does need access to a platform with enough GB
RAM, say 32 (and I have run the checks only on servers with >
100 GB RAM). This concerns the new checks at the (current) end
of <R-devel_R>/tests/reg-large.R

In R-devel svn rev >= 74208  for a few minutes now.

Martin

Reasonably Related Threads

Search for more reasonably related threads

R devel - Feb 2018 - sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

Reasonably Related Threads

R devel - Feb 2018 - sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31

[Rd] sum() returns NA on a long logical vector when nb of TRUE values exceeds 2^31