Hervé Pagès
2018-Jan-30 21:30 UTC
[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
Hi Martin, Henrik, Thanks for the follow up. @Martin: I vote for 2) without *any* hesitation :-) (and uniformity could be restored at some point in the future by having prod(), rowSums(), colSums(), and others align with the behavior of length() and sum()) Cheers, H. On 01/27/2018 03:06 AM, Martin Maechler wrote:>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com> >>>>>> on Thu, 25 Jan 2018 09:30:42 -0800 writes: > > > Just following up on this old thread since matrixStats 0.53.0 is now > > out, which supports this use case: > > >> x <- rep(TRUE, times = 2^31) > > >> y <- sum(x) > >> y > > [1] NA > > Warning message: > > In sum(x) : integer overflow - use sum(as.numeric(.)) > > >> y <- matrixStats::sum2(x, mode = "double") > >> y > > [1] 2147483648 > >> str(y) > > num 2.15e+09 > > > No coercion is taking place, so the memory overhead is zero: > > >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double")) > > Rprofmem memory profiling of: > > y <- matrixStats::sum2(x, mode = "double") > > > Memory allocations: > > bytes calls > > total 0 > > > /Henrik > > Thank you, Henrik, for the reminder. > > Back in June, I had mentioned to Herv? and R-devel that > 'logical' should remain to be treated as 'integer' as in all > arithmetic in (S and) R. Herv? did mention the isum() > function in the C code which is relevant here .. which does have > a LONG INT counter already -- *but* if we consider that sum() > has '...' i.e. a conceptually arbitrary number of long vector > integer arguments that counter won't suffice even there. > > Before talking about implementation / patch, I think we should > consider 2 possible goals of a change --- I agree the status quo > is not a real option > > 1) sum(x) for logical and integer x would return a double > in any case and overflow should not happen (unless for > the case where the result would be larger the > .Machine$double.max which I think will not be possible > even with "arbitrary" nargs() of sum. > > 2) sum(x) for logical and integer x should return an integer in > all cases there is no overflow, including returning > NA_integer_ in case of NAs. > If there would be an overflow it must be detected "in time" > and the result should be double. > > The big advantage of 2) is that it is back compatible in 99.x % > of use cases, and another advantage that it may be a very small > bit more efficient. Also, in the case of "counting" (logical), > it is nice to get an integer instead of double when we can -- > entirely analogously to the behavior of length() which returns > integer whenever possible. > > The advantage of 1) is uniformity. > > We should (at least provisionally) decide between 1) and 2) and then go for that. > It could be that going for 1) may have bad > compatibility-consequences in package space, because indeed we > had documented sum() would be integer for logical and integer arguments. > > I currently don't really have time to > {work on implementing + dealing with the consequences} > for either .. > > Martin > > > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson > > <henrik.bengtsson at gmail.com> wrote: > >> I second this feature request (it's understandable that this and > >> possibly other parts of the code was left behind / forgotten after the > >> introduction of long vector). > >> > >> I think mean() avoids full copies, so in the meanwhile, you can work > >> around this limitation using: > >> > >> countTRUE <- function(x, na.rm = FALSE) { > >> nx <- length(x) > >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm)) > >> nx * mean(x, na.rm = na.rm) > >> } > >> > >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0) > >> > >> x <- rep(TRUE, times = .Machine$integer.max+1) > >> object.size(x) > >> ## 8589934632 bytes > >> > >> p <- profmem::profmem( n <- countTRUE(x) ) > >> str(n) > >> ## num 2.15e+09 > >> print(n == .Machine$integer.max + 1) > >> ## [1] TRUE > >> > >> print(p) > >> ## Rprofmem memory profiling of: > >> ## n <- countTRUE(x) > >> ## > >> ## Memory allocations: > >> ## bytes calls > >> ## total 0 > >> > >> > >> FYI / related: I've just updated matrixStats::sum2() to support > >> logicals (develop branch) and I'll also try to update > >> matrixStats::count() to count beyond .Machine$integer.max. > >> > >> /Henrik > >> > >> On Fri, Jun 2, 2017 at 4:05 AM, Herv? Pag?s <hpages at fredhutch.org> wrote: > >>> Hi, > >>> > >>> I have a long numeric vector 'xx' and I want to use sum() to count > >>> the number of elements that satisfy some criteria like non-zero > >>> values or values lower than a certain threshold etc... > >>> > >>> The problem is: sum() returns an NA (with a warning) if the count > >>> is greater than 2^31. For example: > >>> > >>> > xx <- runif(3e9) > >>> > sum(xx < 0.9) > >>> [1] NA > >>> Warning message: > >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.)) > >>> > >>> This already takes a long time and doing sum(as.numeric(.)) would > >>> take even longer and require allocation of 24Gb of memory just to > >>> store an intermediate numeric vector made of 0s and 1s. Plus, having > >>> to do sum(as.numeric(.)) every time I need to count things is not > >>> convenient and is easy to forget. > >>> > >>> It seems that sum() on a logical vector could be modified to return > >>> the count as a double when it cannot be represented as an integer. > >>> Note that length() already does this so that wouldn't create a > >>> precedent. Also and FWIW prod() avoids the problem by always returning > >>> a double, whatever the type of the input is (except on a complex > >>> vector). > >>> > >>> I can provide a patch if this change sounds reasonable. > >>> > >>> Cheers, > >>> H. > >>> > >>> -- > >>> Herv? Pag?s > >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
Martin Maechler
2018-Feb-01 15:34 UTC
[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
>>>>> Herv? Pag?s <hpages at fredhutch.org> >>>>> on Tue, 30 Jan 2018 13:30:18 -0800 writes:> Hi Martin, Henrik, > Thanks for the follow up. > @Martin: I vote for 2) without *any* hesitation :-) > (and uniformity could be restored at some point in the > future by having prod(), rowSums(), colSums(), and others > align with the behavior of length() and sum()) As a matter of fact, I had procrastinated and worked at implementing '2)' already a bit on the weekend and made it work - more or less. It needs a bit more work, and I had also been considering replacing the numbers in the current overflow check if (ii++ > 1000) { \ ii = 0; \ if (s > 9000000000000000L || s < -9000000000000000L) { \ if(!updated) updated = TRUE; \ *value = NA_INTEGER; \ warningcall(call, _("integer overflow - use sum(as.numeric(.))")); \ return updated; \ } \ } \ i.e. think of tweaking the '1000' and '9000000000000000L', but decided to leave these and add comments there about why. For the moment. They may look arbitrary, but are not at all: If you multiply them (which looks correct, if we check the sum 's' only every 1000-th time ...((still not sure they *are* correct))) you get 9*10^18 which is only slightly smaller than 2^63 - 1 which may be the maximal "LONG_INT" integer we have. So, in the end, at least for now, we do not quite go all they way but overflow a bit earlier,... but do potentially gain a bit of speed, notably with the ITERATE_BY_REGION(..) macros (which I did not show above). Will hopefully become available in R-devel real soon now. Martin > Cheers, > H. > On 01/27/2018 03:06 AM, Martin Maechler wrote: >>>>>>> Henrik Bengtsson <henrik.bengtsson at gmail.com> >>>>>>> on Thu, 25 Jan 2018 09:30:42 -0800 writes: >> >> > Just following up on this old thread since matrixStats 0.53.0 is now >> > out, which supports this use case: >> >> >> x <- rep(TRUE, times = 2^31) >> >> >> y <- sum(x) >> >> y >> > [1] NA >> > Warning message: >> > In sum(x) : integer overflow - use sum(as.numeric(.)) >> >> >> y <- matrixStats::sum2(x, mode = "double") >> >> y >> > [1] 2147483648 >> >> str(y) >> > num 2.15e+09 >> >> > No coercion is taking place, so the memory overhead is zero: >> >> >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double")) >> > Rprofmem memory profiling of: >> > y <- matrixStats::sum2(x, mode = "double") >> >> > Memory allocations: >> > bytes calls >> > total 0 >> >> > /Henrik >> >> Thank you, Henrik, for the reminder. >> >> Back in June, I had mentioned to Herv? and R-devel that >> 'logical' should remain to be treated as 'integer' as in all >> arithmetic in (S and) R. Herv? did mention the isum() >> function in the C code which is relevant here .. which does have >> a LONG INT counter already -- *but* if we consider that sum() >> has '...' i.e. a conceptually arbitrary number of long vector >> integer arguments that counter won't suffice even there. >> >> Before talking about implementation / patch, I think we should >> consider 2 possible goals of a change --- I agree the status quo >> is not a real option >> >> 1) sum(x) for logical and integer x would return a double >> in any case and overflow should not happen (unless for >> the case where the result would be larger the >> .Machine$double.max which I think will not be possible >> even with "arbitrary" nargs() of sum. >> >> 2) sum(x) for logical and integer x should return an integer in >> all cases there is no overflow, including returning >> NA_integer_ in case of NAs. >> If there would be an overflow it must be detected "in time" >> and the result should be double. >> >> The big advantage of 2) is that it is back compatible in 99.x % >> of use cases, and another advantage that it may be a very small >> bit more efficient. Also, in the case of "counting" (logical), >> it is nice to get an integer instead of double when we can -- >> entirely analogously to the behavior of length() which returns >> integer whenever possible. >> >> The advantage of 1) is uniformity. >> >> We should (at least provisionally) decide between 1) and 2) and then go for that. >> It could be that going for 1) may have bad >> compatibility-consequences in package space, because indeed we >> had documented sum() would be integer for logical and integer arguments. >> >> I currently don't really have time to >> {work on implementing + dealing with the consequences} >> for either .. >> >> Martin >> >> > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson >> > <henrik.bengtsson at gmail.com> wrote: >> >> I second this feature request (it's understandable that this and >> >> possibly other parts of the code was left behind / forgotten after the >> >> introduction of long vector). >> >> >> >> I think mean() avoids full copies, so in the meanwhile, you can work >> >> around this limitation using: >> >> >> >> countTRUE <- function(x, na.rm = FALSE) { >> >> nx <- length(x) >> >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm)) >> >> nx * mean(x, na.rm = na.rm) >> >> } >> >> >> >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0) >> >> >> >> x <- rep(TRUE, times = .Machine$integer.max+1) >> >> object.size(x) >> >> ## 8589934632 bytes >> >> >> >> p <- profmem::profmem( n <- countTRUE(x) ) >> >> str(n) >> >> ## num 2.15e+09 >> >> print(n == .Machine$integer.max + 1) >> >> ## [1] TRUE >> >> >> >> print(p) >> >> ## Rprofmem memory profiling of: >> >> ## n <- countTRUE(x) >> >> ## >> >> ## Memory allocations: >> >> ## bytes calls >> >> ## total 0 >> >> >> >> >> >> FYI / related: I've just updated matrixStats::sum2() to support >> >> logicals (develop branch) and I'll also try to update >> >> matrixStats::count() to count beyond .Machine$integer.max. >> >> >> >> /Henrik >> >> >> >> On Fri, Jun 2, 2017 at 4:05 AM, Herv? Pag?s <hpages at fredhutch.org> wrote: >> >>> Hi, >> >>> >> >>> I have a long numeric vector 'xx' and I want to use sum() to count >> >>> the number of elements that satisfy some criteria like non-zero >> >>> values or values lower than a certain threshold etc... >> >>> >> >>> The problem is: sum() returns an NA (with a warning) if the count >> >>> is greater than 2^31. For example: >> >>> >> >>> > xx <- runif(3e9) >> >>> > sum(xx < 0.9) >> >>> [1] NA >> >>> Warning message: >> >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.)) >> >>> >> >>> This already takes a long time and doing sum(as.numeric(.)) would >> >>> take even longer and require allocation of 24Gb of memory just to >> >>> store an intermediate numeric vector made of 0s and 1s. Plus, having >> >>> to do sum(as.numeric(.)) every time I need to count things is not >> >>> convenient and is easy to forget. >> >>> >> >>> It seems that sum() on a logical vector could be modified to return >> >>> the count as a double when it cannot be represented as an integer. >> >>> Note that length() already does this so that wouldn't create a >> >>> precedent. Also and FWIW prod() avoids the problem by always returning >> >>> a double, whatever the type of the input is (except on a complex >> >>> vector). >> >>> >> >>> I can provide a patch if this change sounds reasonable. >> >>> >> >>> Cheers, >> >>> H. >> >>> >> >>> -- >> >>> Herv? Pag?s >> >> > -- > Herv? Pag?s > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > E-mail: hpages at fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
Martin Maechler
2018-Feb-05 12:43 UTC
[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
>>>>> Martin Maechler <maechler at stat.math.ethz.ch> >>>>> on Thu, 1 Feb 2018 16:34:04 +0100 writes:> >>>>> Herv? Pag?s <hpages at fredhutch.org> > >>>>> on Tue, 30 Jan 2018 13:30:18 -0800 writes: > > > Hi Martin, Henrik, > > Thanks for the follow up. > > > @Martin: I vote for 2) without *any* hesitation :-) > > > (and uniformity could be restored at some point in the > > future by having prod(), rowSums(), colSums(), and others > > align with the behavior of length() and sum()) > > As a matter of fact, I had procrastinated and worked at > implementing '2)' already a bit on the weekend and made it work > - more or less. It needs a bit more work, and I had also been considering > replacing the numbers in the current overflow check > > if (ii++ > 1000) { \ > ii = 0; \ > if (s > 9000000000000000L || s < -9000000000000000L) { \ > if(!updated) updated = TRUE; \ > *value = NA_INTEGER; \ > warningcall(call, _("integer overflow - use sum(as.numeric(.))")); \ > return updated; \ > } \ > } \ > > i.e. think of tweaking the '1000' and '9000000000000000L', > but decided to leave these and add comments there about why. For > the moment. > They may look arbitrary, but are not at all: If you multiply > them (which looks correct, if we check the sum 's' only every 1000-th > time ...((still not sure they *are* correct))) you get 9*10^18 > which is only slightly smaller than 2^63 - 1 which may be the > maximal "LONG_INT" integer we have. > > So, in the end, at least for now, we do not quite go all they way > but overflow a bit earlier,... but do potentially gain a bit of > speed, notably with the ITERATE_BY_REGION(..) macros > (which I did not show above). > > Will hopefully become available in R-devel real soon now. > > MartinAfter finishing that... I challenged myself that one should be able to do better, namely "no overflow" (because of large/many integer/logical), and so introduced irsum() which uses a double precision accumulator for integer/logical ... but would really only be used when the 64-bit int accumulator would get close to overflow. The resulting code is not really beautiful, and also contains a a comment " (a waste, rare; FIXME ?) " If anybody feels like finding a more elegant version without the "waste" case, go ahead and be our guest ! Testing the code does need access to a platform with enough GB RAM, say 32 (and I have run the checks only on servers with > 100 GB RAM). This concerns the new checks at the (current) end of <R-devel_R>/tests/reg-large.R In R-devel svn rev >= 74208 for a few minutes now. Martin
Possibly Parallel Threads
- sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
- sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
- sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
- sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31
- sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31