Thanks Tomas. Do note that my original post also mentioned a bug or doc
error in the PCRE docs for this regexp:
> - perl = TRUE does *not* give the documented result on at least one
> system (which is "123456789", because "{,5}" is
documented to not be a
> quantifier, so it should only match the literal string "{,5}").
Duncan
On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:>
> On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
>> ? Sun, 28 Jul 2024 20:02:21 -0400
>> Duncan Murdoch <murdoch.duncan at gmail.com> ?????:
>>
>>>
gsub("^([0-9]{,5}).*","\\1","123456789")
>>> [1] "123456"
>> This is in TRE itself: for "^([0-9]{,1})" tre_regexecb
returns {.rm_so
>> = 0, .rm_eo = 1}, matching "1", but for
"^([0-9]{,2})" and above it
>> returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
>>
>> Compiling with TRE_DEBUG, I see it parsed correctly:
>>
>> catenation, sub 0, 0 tags
>> assertions: bol
>> iteration {-1, 2}, sub -1, 0 tags, greedy
>> literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>
>> ...but after tre_expand_ast I see
>>
>> catenation, sub 0, 1 tags
>> assertions: bol
>> catenation, sub -1, 1 tags
>> tag 0
>> union, sub -1, 0 tags
>> literal empty
>> catenation, sub -1, 0 tags
>> literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
>> union, sub -1, 0 tags
>> literal empty
>> catenation, sub -1, 0 tags
>> literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
>> union, sub -1, 0 tags
>> literal empty
>> literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
>>
>> ...which has one too many copies of "literal (0,9)". I think
it's due
>> to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
>>
>> for (j = iter->min; j < iter->max; j++)
>>
>> ...where 'min' is -1 to denote no minimum. This is further
confirmed by
>> "{0,3}", "{1,3}", "{2,3}",
"{3,3}" all working correctly.
>>
>> Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
>> from my reading, it looks like if the upper boundary is specified, the
>> lower boundary must be specified too. But if we do want to fix this, it
>> will have to be a special case for iter->min == -1.
>
> Thanks. It seems that TRE is now maintained again upstream, so it would
> be best to discuss this with TRE maintainers directly (if not already
> solved by https://github.com/laurikari/tre/pull/98).
>
> The same applies to any other open TRE issues.
>
> Best Tomas
>