Tomas Kalibera
2021-Jun-21 08:08 UTC
[Rd] Should last default to .Machine$integer.max-1 for substring()
On 6/21/21 9:35 AM, Martin Maechler wrote:>>>>>> Michael Chirico >>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes: > > Currently, substring defaults to last=1000000L, which > > strongly suggests the intent is to default to "nchar(x)" > > without having to compute/allocate that up front. > > > Unfortunately, this default makes no sense for "very > > large" strings which may exceed 1000000L in "width". > > Yes; and I tend to agree with you that this default is outdated > (Remember : R was written to work and run on 2 (or 4?) MB of RAM on the > student lab Macs in Auckland in ca 1994). > > > The max width of a string is .Machine$integer.max-1: > > (which Brodie showed was only almost true) > > > So it seems to me either .Machine$integer.max or > > .Machine$integer.max-1L would be a more sensible default. Am I missing > > something? > > The "drawback" is of course that .Machine$integer.max is still > a function call (as R beginners may forget) contrary to <nnnnn>L, > but that may even be inlined by the byte compiler (? how would we check ?) > and even if it's not, it does more clearly convey the concept > and idea *and* would probably even port automatically if ever > integer would be increased in R.We still have the problem that we need to count characters, not bytes, if we want the default semantics of "until the end of the string". I think we would have to fix this either by really using "nchar(type="c"))" or by using e.g. NULL and then treating this as a special case, that would be probably faster. Tomas> > Martin> > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Martin Maechler
2021-Jun-21 08:32 UTC
[Rd] Should last default to .Machine$integer.max-1 for substring()
>>>>> Tomas Kalibera >>>>> on Mon, 21 Jun 2021 10:08:37 +0200 writes:> On 6/21/21 9:35 AM, Martin Maechler wrote: >>>>>>> Michael Chirico >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes: >> > Currently, substring defaults to last=1000000L, which >> > strongly suggests the intent is to default to "nchar(x)" >> > without having to compute/allocate that up front. >> >> > Unfortunately, this default makes no sense for "very >> > large" strings which may exceed 1000000L in "width". >> >> Yes; and I tend to agree with you that this default is outdated >> (Remember : R was written to work and run on 2 (or 4?) MB of RAM on the >> student lab Macs in Auckland in ca 1994). >> >> > The max width of a string is .Machine$integer.max-1: >> >> (which Brodie showed was only almost true) >> >> > So it seems to me either .Machine$integer.max or >> > .Machine$integer.max-1L would be a more sensible default. Am I missing >> > something? >> >> The "drawback" is of course that .Machine$integer.max is still >> a function call (as R beginners may forget) contrary to <nnnnn>L, >> but that may even be inlined by the byte compiler (? how would we check ?) >> and even if it's not, it does more clearly convey the concept >> and idea *and* would probably even port automatically if ever >> integer would be increased in R. > We still have the problem that we need to count characters, not bytes, > if we want the default semantics of "until the end of the string". > I think we would have to fix this either by really using > "nchar(type="c"))" or by using e.g. NULL and then treating this as a > special case, that would be probably faster. > Tomas You are right, as always, Tomas. I agree that would be better and we should do it if/when we change the default there. Martin