thr3ads.net - R devel - [Rd] Should last default to .Machine$integer.max-1 for substring() [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2021-Jun-21 08:32 UTC

[Rd] Should last default to .Machine$integer.max-1 for substring()

>>>>> Tomas Kalibera 
>>>>>     on Mon, 21 Jun 2021 10:08:37 +0200 writes:
    > On 6/21/21 9:35 AM, Martin Maechler wrote:
    >>>>>>> Michael Chirico
    >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
    >> > Currently, substring defaults to last=1000000L, which
    >> > strongly suggests the intent is to default to
"nchar(x)"
    >> > without having to compute/allocate that up front.
    >> 
    >> > Unfortunately, this default makes no sense for "very
    >> > large" strings which may exceed 1000000L in
"width".
    >> 
    >> Yes;  and I tend to agree with you that this default is outdated
    >> (Remember :  R was written to work and run on 2 (or 4?) MB of RAM
on the
    >> student lab  Macs in Auckland in ca 1994).
    >> 
    >> > The max width of a string is .Machine$integer.max-1:
    >> 
    >> (which Brodie showed was only almost true)
    >> 
    >> > So it seems to me either .Machine$integer.max or
    >> > .Machine$integer.max-1L would be a more sensible default. Am I
missing
    >> > something?
    >> 
    >> The "drawback" is of course that .Machine$integer.max  is
still
    >> a function call (as R beginners may forget) contrary to
<nnnnn>L,
    >> but that may even be inlined by the byte compiler (? how would we
check ?)
    >> and even if it's not, it does more clearly convey the concept
    >> and idea  *and* would probably even port automatically if ever
    >> integer would be increased in R.

    > We still have the problem that we need to count characters, not bytes, 
    > if we want the default semantics of "until the end of the
string".

    > I think we would have to fix this either by really using 
    > "nchar(type="c"))" or by using e.g. NULL and then
treating this as a
    > special case, that would be probably faster.

    > Tomas

You are right, as always, Tomas.
I agree that would be better and we should do it if/when we change
the default there.

Martin

Michael Chirico

2021-Jun-21 17:21 UTC

head link

[Rd] Should last default to .Machine$integer.max-1 for substring()

Thanks all, great points well taken. Indeed it seems the default of
1000000 predates SVN tracking in 1997.

I think a NULL default behaving as "end of string" regardless of
encoding makes sense and avoids the overheads of a $ call and a much
heavier nchar() calculation.

Mike C

On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler
<maechler at stat.math.ethz.ch> wrote:>
> >>>>> Tomas Kalibera
> >>>>>     on Mon, 21 Jun 2021 10:08:37 +0200 writes:
>
>     > On 6/21/21 9:35 AM, Martin Maechler wrote:
>     >>>>>>> Michael Chirico
>     >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
>     >> > Currently, substring defaults to last=1000000L, which
>     >> > strongly suggests the intent is to default to
"nchar(x)"
>     >> > without having to compute/allocate that up front.
>     >>
>     >> > Unfortunately, this default makes no sense for "very
>     >> > large" strings which may exceed 1000000L in
"width".
>     >>
>     >> Yes;  and I tend to agree with you that this default is
outdated
>     >> (Remember :  R was written to work and run on 2 (or 4?) MB of
RAM on the
>     >> student lab  Macs in Auckland in ca 1994).
>     >>
>     >> > The max width of a string is .Machine$integer.max-1:
>     >>
>     >> (which Brodie showed was only almost true)
>     >>
>     >> > So it seems to me either .Machine$integer.max or
>     >> > .Machine$integer.max-1L would be a more sensible default.
Am I missing
>     >> > something?
>     >>
>     >> The "drawback" is of course that
.Machine$integer.max  is still
>     >> a function call (as R beginners may forget) contrary to
<nnnnn>L,
>     >> but that may even be inlined by the byte compiler (? how would
we check ?)
>     >> and even if it's not, it does more clearly convey the
concept
>     >> and idea  *and* would probably even port automatically if ever
>     >> integer would be increased in R.
>
>     > We still have the problem that we need to count characters, not
bytes,
>     > if we want the default semantics of "until the end of the
string".
>
>     > I think we would have to fix this either by really using
>     > "nchar(type="c"))" or by using e.g. NULL and
then treating this as a
>     > special case, that would be probably faster.
>
>     > Tomas
>
> You are right, as always, Tomas.
> I agree that would be better and we should do it if/when we change
> the default there.
>
> Martin

R devel - Jun 2021 - Should last default to .Machine$integer.max-1 for substring()

[Rd] Should last default to .Machine$integer.max-1 for substring()

[Rd] Should last default to .Machine$integer.max-1 for substring()