thr3ads.net - R devel - [Rd] Should last default to .Machine$integer.max-1 for substring() [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Michael Chirico

2021-Jun-21 17:21 UTC

[Rd] Should last default to .Machine$integer.max-1 for substring()

Thanks all, great points well taken. Indeed it seems the default of
1000000 predates SVN tracking in 1997.

I think a NULL default behaving as "end of string" regardless of
encoding makes sense and avoids the overheads of a $ call and a much
heavier nchar() calculation.

Mike C

On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler
<maechler at stat.math.ethz.ch> wrote:>
> >>>>> Tomas Kalibera
> >>>>>     on Mon, 21 Jun 2021 10:08:37 +0200 writes:
>
>     > On 6/21/21 9:35 AM, Martin Maechler wrote:
>     >>>>>>> Michael Chirico
>     >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
>     >> > Currently, substring defaults to last=1000000L, which
>     >> > strongly suggests the intent is to default to
"nchar(x)"
>     >> > without having to compute/allocate that up front.
>     >>
>     >> > Unfortunately, this default makes no sense for "very
>     >> > large" strings which may exceed 1000000L in
"width".
>     >>
>     >> Yes;  and I tend to agree with you that this default is
outdated
>     >> (Remember :  R was written to work and run on 2 (or 4?) MB of
RAM on the
>     >> student lab  Macs in Auckland in ca 1994).
>     >>
>     >> > The max width of a string is .Machine$integer.max-1:
>     >>
>     >> (which Brodie showed was only almost true)
>     >>
>     >> > So it seems to me either .Machine$integer.max or
>     >> > .Machine$integer.max-1L would be a more sensible default.
Am I missing
>     >> > something?
>     >>
>     >> The "drawback" is of course that
.Machine$integer.max  is still
>     >> a function call (as R beginners may forget) contrary to
<nnnnn>L,
>     >> but that may even be inlined by the byte compiler (? how would
we check ?)
>     >> and even if it's not, it does more clearly convey the
concept
>     >> and idea  *and* would probably even port automatically if ever
>     >> integer would be increased in R.
>
>     > We still have the problem that we need to count characters, not
bytes,
>     > if we want the default semantics of "until the end of the
string".
>
>     > I think we would have to fix this either by really using
>     > "nchar(type="c"))" or by using e.g. NULL and
then treating this as a
>     > special case, that would be probably faster.
>
>     > Tomas
>
> You are right, as always, Tomas.
> I agree that would be better and we should do it if/when we change
> the default there.
>
> Martin

Bill Dunlap

2021-Jun-21 19:25 UTC

head link

[Rd] Should last default to .Machine$integer.max-1 for substring()

NULL cannot be in an integer or numeric vector so it would not be a good
fit for substring's 'first' or 'last' argument (or
substr's 'start' and
'stop').  Also, it is conceivable that string lengths may be 64 bit
integers in the future, so why not use Inf as the default?  Then the
following would give 4 identical results with no warning:
> substring("abcde", 3, c(10, 2^31-1, 2^31, Inf))[1] "cde" "cde" NA    NA
Warning message:
In substring("abcde", 3, c(10, 2^31 - 1, 2^31, Inf)) :
  NAs introduced by coercion to integer range

-Bill

On Mon, Jun 21, 2021 at 10:22 AM Michael Chirico <michaelchirico4 at
gmail.com>
wrote:
> Thanks all, great points well taken. Indeed it seems the default of
> 1000000 predates SVN tracking in 1997.
>
> I think a NULL default behaving as "end of string" regardless of
> encoding makes sense and avoids the overheads of a $ call and a much
> heavier nchar() calculation.
>
> Mike C
>
> On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler
> <maechler at stat.math.ethz.ch> wrote:
> >
> > >>>>> Tomas Kalibera
> > >>>>>     on Mon, 21 Jun 2021 10:08:37 +0200 writes:
> >
> >     > On 6/21/21 9:35 AM, Martin Maechler wrote:
> >     >>>>>>> Michael Chirico
> >     >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700
writes:
> >     >> > Currently, substring defaults to last=1000000L,
which
> >     >> > strongly suggests the intent is to default to
"nchar(x)"
> >     >> > without having to compute/allocate that up front.
> >     >>
> >     >> > Unfortunately, this default makes no sense for
"very
> >     >> > large" strings which may exceed 1000000L in
"width".
> >     >>
> >     >> Yes;  and I tend to agree with you that this default is
outdated
> >     >> (Remember :  R was written to work and run on 2 (or 4?)
MB of RAM
> on the
> >     >> student lab  Macs in Auckland in ca 1994).
> >     >>
> >     >> > The max width of a string is .Machine$integer.max-1:
> >     >>
> >     >> (which Brodie showed was only almost true)
> >     >>
> >     >> > So it seems to me either .Machine$integer.max or
> >     >> > .Machine$integer.max-1L would be a more sensible
default. Am I
> missing
> >     >> > something?
> >     >>
> >     >> The "drawback" is of course that
.Machine$integer.max  is still
> >     >> a function call (as R beginners may forget) contrary to
<nnnnn>L,
> >     >> but that may even be inlined by the byte compiler (? how
would we
> check ?)
> >     >> and even if it's not, it does more clearly convey the
concept
> >     >> and idea  *and* would probably even port automatically if
ever
> >     >> integer would be increased in R.
> >
> >     > We still have the problem that we need to count characters,
not
> bytes,
> >     > if we want the default semantics of "until the end of
the string".
> >
> >     > I think we would have to fix this either by really using
> >     > "nchar(type="c"))" or by using e.g. NULL
and then treating this as
> a
> >     > special case, that would be probably faster.
> >
> >     > Tomas
> >
> > You are right, as always, Tomas.
> > I agree that would be better and we should do it if/when we change
> > the default there.
> >
> > Martin
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

R devel - Jun 2021 - Should last default to .Machine$integer.max-1 for substring()

[Rd] Should last default to .Machine$integer.max-1 for substring()

[Rd] Should last default to .Machine$integer.max-1 for substring()