Michael Chirico
2021-Jun-20 22:20 UTC
[Rd] Should last default to .Machine$integer.max-1 for substring()
Currently, substring defaults to last=1000000L, which strongly suggests the intent is to default to "nchar(x)" without having to compute/allocate that up front. Unfortunately, this default makes no sense for "very large" strings which may exceed 1000000L in "width". The max width of a string is .Machine$integer.max-1: # works x = strrep(" ", .Machine$integer.max-1L) # fails x = strrep(" ", .Machine$integer.max) Error in strrep(" ", .Machine$integer.max) : 'Calloc' could not allocate memory (18446744071562067968 of 1 bytes) (see also the comment in src/main/character.c: "Character strings in R are less than 2^31-1 bytes, so we use int not size_t.") So it seems to me either .Machine$integer.max or .Machine$integer.max-1L would be a more sensible default. Am I missing something? Mike C
brodie gaslam
2021-Jun-21 01:28 UTC
[Rd] Should last default to .Machine$integer.max-1 for substring()
> On Sunday, June 20, 2021, 6:21:22 PM EDT, Michael Chirico <michaelchirico4 at gmail.com> wrote: > > Currently, substring defaults to last=1000000L, which strongly > suggests the intent is to default to "nchar(x)" without having to > compute/allocate that up front. > > Unfortunately, this default makes no sense for "very large" strings > which may exceed 1000000L in "width". > > The max width of a string is .Machine$integer.max-1:I think the max width is .Machine$integer.max.? What happened below is a bug due to buffer overflow in `strrep`:> # works > x = strrep(" ", .Machine$integer.max-1L) > # fails > x = strrep(" ", .Machine$integer.max) > Error in strrep(" ", .Machine$integer.max) : >?? 'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)Notice the very large number that was tried to be Calloc'ed.? That's (size_t) -1. The problem is (src/include/R_ext/RS.h at 85): ??? #define CallocCharBuf(n) (char *) R_chk_calloc((R_SIZE_T) ((n)+1), sizeof(char)) The `((n) + 1)` overflows `int` and produces -1 (well, undefined behavior so who knows), which when cast to size_t produces that very large number which can't be allocated. I think this should be: ??? #define CallocCharBuf(n) (char *) R_chk_calloc(((R_SIZE_T)(n))+1, sizeof(char)) I can reproduce the failure before the change.? After the change I get: ??? > x = strrep(" ", .Machine$integer.max) ??? Error in strrep(" ", .Machine$integer.max) : ????? 'Calloc' could not allocate memory (2147483648 of 1 bytes) I believe this to be the expected result on a machine that doesn't have enough memory to allocate INT_MAX + 1 bytes, as happens to be the case on my R build system (it's a VM that gets 2GB total as the host machine can barely spare that to begin with).> (see also the comment in src/main/character.c: "Character strings in R > are less than 2^31-1 bytes, so we use int not size_t.")FWIW WRE states:> Note that R character strings are restricted to 2^31 - 1 bytesThis is INT_MAX or .Machine$integer.max, at least on machines for which `int` is 32 bits, which I think typical for machines R builds on.?? From having looked at the code a while ago I think WRE is right (so maybe the comment in the code is wrong), but it was a while ago and I haven't tried to allocate an INT_MAX long string. Sorry this doesn't answer your original question. Best, Brodie.> > > So it seems to me either .Machine$integer.max or > .Machine$integer.max-1L would be a more sensible default. Am I missing > something? > > Mike C > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Martin Maechler
2021-Jun-21 07:35 UTC
[Rd] Should last default to .Machine$integer.max-1 for substring()
>>>>> Michael Chirico >>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:> Currently, substring defaults to last=1000000L, which > strongly suggests the intent is to default to "nchar(x)" > without having to compute/allocate that up front. > Unfortunately, this default makes no sense for "very > large" strings which may exceed 1000000L in "width". Yes; and I tend to agree with you that this default is outdated (Remember : R was written to work and run on 2 (or 4?) MB of RAM on the student lab Macs in Auckland in ca 1994). > The max width of a string is .Machine$integer.max-1: (which Brodie showed was only almost true) > So it seems to me either .Machine$integer.max or > .Machine$integer.max-1L would be a more sensible default. Am I missing > something? The "drawback" is of course that .Machine$integer.max is still a function call (as R beginners may forget) contrary to <nnnnn>L, but that may even be inlined by the byte compiler (? how would we check ?) and even if it's not, it does more clearly convey the concept and idea *and* would probably even port automatically if ever integer would be increased in R. Martin