On 7/29/24 09:37, Ivan Krylov via R-devel wrote:> ? Sun, 28 Jul 2024 20:02:21 -0400
> Duncan Murdoch <murdoch.duncan at gmail.com> ?????:
>
>> gsub("^([0-9]{,5}).*","\\1","123456789")
>> [1] "123456"
> This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns
{.rm_so
> = 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})"
and above it
> returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
>
> Compiling with TRE_DEBUG, I see it parsed correctly:
>
> catenation, sub 0, 0 tags
> assertions: bol
> iteration {-1, 2}, sub -1, 0 tags, greedy
> literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>
> ...but after tre_expand_ast I see
>
> catenation, sub 0, 1 tags
> assertions: bol
> catenation, sub -1, 1 tags
> tag 0
> union, sub -1, 0 tags
> literal empty
> catenation, sub -1, 0 tags
> literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
> union, sub -1, 0 tags
> literal empty
> catenation, sub -1, 0 tags
> literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
> union, sub -1, 0 tags
> literal empty
> literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>
> ...which has one too many copies of "literal (0,9)". I think
it's due
> to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
>
> for (j = iter->min; j < iter->max; j++)
>
> ...where 'min' is -1 to denote no minimum. This is further
confirmed by
> "{0,3}", "{1,3}", "{2,3}", "{3,3}"
all working correctly.
>
> Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
> from my reading, it looks like if the upper boundary is specified, the
> lower boundary must be specified too. But if we do want to fix this, it
> will have to be a special case for iter->min == -1.
Thanks. It seems that TRE is now maintained again upstream, so it would
be best to discuss this with TRE maintainers directly (if not already
solved by https://github.com/laurikari/tre/pull/98).
The same applies to any other open TRE issues.
Best Tomas