Martin Maechler
2021-Jun-21 08:32 UTC
[Rd] Should last default to .Machine$integer.max-1 for substring()
>>>>> Tomas Kalibera >>>>> on Mon, 21 Jun 2021 10:08:37 +0200 writes:> On 6/21/21 9:35 AM, Martin Maechler wrote: >>>>>>> Michael Chirico >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes: >> > Currently, substring defaults to last=1000000L, which >> > strongly suggests the intent is to default to "nchar(x)" >> > without having to compute/allocate that up front. >> >> > Unfortunately, this default makes no sense for "very >> > large" strings which may exceed 1000000L in "width". >> >> Yes; and I tend to agree with you that this default is outdated >> (Remember : R was written to work and run on 2 (or 4?) MB of RAM on the >> student lab Macs in Auckland in ca 1994). >> >> > The max width of a string is .Machine$integer.max-1: >> >> (which Brodie showed was only almost true) >> >> > So it seems to me either .Machine$integer.max or >> > .Machine$integer.max-1L would be a more sensible default. Am I missing >> > something? >> >> The "drawback" is of course that .Machine$integer.max is still >> a function call (as R beginners may forget) contrary to <nnnnn>L, >> but that may even be inlined by the byte compiler (? how would we check ?) >> and even if it's not, it does more clearly convey the concept >> and idea *and* would probably even port automatically if ever >> integer would be increased in R. > We still have the problem that we need to count characters, not bytes, > if we want the default semantics of "until the end of the string". > I think we would have to fix this either by really using > "nchar(type="c"))" or by using e.g. NULL and then treating this as a > special case, that would be probably faster. > Tomas You are right, as always, Tomas. I agree that would be better and we should do it if/when we change the default there. Martin
Michael Chirico
2021-Jun-21 17:21 UTC
[Rd] Should last default to .Machine$integer.max-1 for substring()
Thanks all, great points well taken. Indeed it seems the default of 1000000 predates SVN tracking in 1997. I think a NULL default behaving as "end of string" regardless of encoding makes sense and avoids the overheads of a $ call and a much heavier nchar() calculation. Mike C On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler <maechler at stat.math.ethz.ch> wrote:> > >>>>> Tomas Kalibera > >>>>> on Mon, 21 Jun 2021 10:08:37 +0200 writes: > > > On 6/21/21 9:35 AM, Martin Maechler wrote: > >>>>>>> Michael Chirico > >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes: > >> > Currently, substring defaults to last=1000000L, which > >> > strongly suggests the intent is to default to "nchar(x)" > >> > without having to compute/allocate that up front. > >> > >> > Unfortunately, this default makes no sense for "very > >> > large" strings which may exceed 1000000L in "width". > >> > >> Yes; and I tend to agree with you that this default is outdated > >> (Remember : R was written to work and run on 2 (or 4?) MB of RAM on the > >> student lab Macs in Auckland in ca 1994). > >> > >> > The max width of a string is .Machine$integer.max-1: > >> > >> (which Brodie showed was only almost true) > >> > >> > So it seems to me either .Machine$integer.max or > >> > .Machine$integer.max-1L would be a more sensible default. Am I missing > >> > something? > >> > >> The "drawback" is of course that .Machine$integer.max is still > >> a function call (as R beginners may forget) contrary to <nnnnn>L, > >> but that may even be inlined by the byte compiler (? how would we check ?) > >> and even if it's not, it does more clearly convey the concept > >> and idea *and* would probably even port automatically if ever > >> integer would be increased in R. > > > We still have the problem that we need to count characters, not bytes, > > if we want the default semantics of "until the end of the string". > > > I think we would have to fix this either by really using > > "nchar(type="c"))" or by using e.g. NULL and then treating this as a > > special case, that would be probably faster. > > > Tomas > > You are right, as always, Tomas. > I agree that would be better and we should do it if/when we change > the default there. > > Martin