https://bugs.r-project.org/show_bug.cgi?id=16745 (from 2016, still labelled
'UNCONFIRMED") contains some other examples of strsplit misbehaving
when
using 0-length perl look-behinds.  E.g.,
> strsplit(split="[[:<:]]", "One, two; three!",
perl=TRUE)[[1]]
 [1] "O"  "n"  "e"  ", " "t" 
"w"  "o"  "; " "t"  "h" 
"r"  "e"  "e" 
"!"> gsub(pattern="[[:<:]]", "#", "One, two;
three!", perl=TRUE)
[1] "#One, #two; #three!"
The bug report includes the comment
It may be possible that strsplit is not using the startoffset argument
to pcre_exec
  pcre/pcre/doc/html/pcreapi.html
    A non-zero starting offset is useful when searching for another match
    in the same subject by calling pcre_exec() again after a previous
    success. Setting startoffset differs from just passing over a
    shortened string and setting PCRE_NOTBOL in the case of a pattern that
    begins with any kind of lookbehind.
or it could be something else.
On Fri, May 5, 2023 at 3:25?AM Ivan Krylov <krylov.r00t at gmail.com>
wrote:
> On Thu, 4 May 2023 23:59:33 +0300
> Leonard Mada via R-help <r-help at r-project.org> wrote:
>
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![
])", perl=T)
> > # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?=[^ ])",
> > perl=T)
> > # "a"    "bc"   ","    "def" 
","    ""     "adef" ","   
"," "gh"
> >
> >
> > Is this correct?
>
> Perl seems to return the results you expect:
>
> $ perl -E '
>  say("$_:\n ", join " ", map qq["$_"], split
$_, q[a bc,def, adef ,,gh])
>  for (
>   qr[ |(?=,)|(?<=,)(?![ ])],
>   qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
>   qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
> )'
> (?^u: |(?=,)|(?<=,)(?![ ])):
>  "a" "bc" "," "def" ","
"adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
>  "a" "bc" "," "def" ","
"adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
>  "a" "bc" "," "def" ","
"adef" "," "," "gh"
>
> The same thing happens when I ask R to replace the separators instead
> of splitting by them:
>
> sapply(setNames(nm = c(
>  " |(?=,)|(?<=,)(?![ ])",
>  " |(?<! )(?=,)|(?<=,)(?![ ])",
>  " |(?<! )(?=,)|(?<=,)(?=[^ ])")
> ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
> #               |(?=,)|(?<=,)(?![ ])         |(?<!
)(?=,)|(?<=,)(?![ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
"a[]bc[],[]def[],[]adef[],[],[]gh"
> #        |(?<! )(?=,)|(?<=,)(?=[^ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
>
> I think that something strange happens when the delimeter pattern
> matches more than once in the same place:
>
> gsub(
>  '(?=<--)|(?<=-->)', '[]', 'split here
--><-- split here',
>  perl = TRUE
> )
> # [1] "split here -->[]<-- split here"
>
> (Both Perl's split() and s///g agree with R's gsub() here, although
I
> would have accepted "split here -->[][]<-- split here"
too.)
>
> On the other hand, the following doesn't look right:
>
> strsplit(
>  'split here --><-- split here',
'(?=<--)|(?<=-->)',
>  perl = TRUE
> )
> # [[1]]
> # [1] "split here -->" "<"              "--
split here"
>
> The "<" is definitely not followed by "<--", and
the rightmost "--" is
> definitely not preceded by "-->".
>
> Perhaps strsplit() incorrectly advances the match position after one
> match?
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]