https://bugs.r-project.org/show_bug.cgi?id=16745 (from 2016, still labelled
'UNCONFIRMED") contains some other examples of strsplit misbehaving
when
using 0-length perl look-behinds. E.g.,
> strsplit(split="[[:<:]]", "One, two; three!",
perl=TRUE)[[1]]
[1] "O" "n" "e" ", " "t"
"w" "o" "; " "t" "h"
"r" "e" "e"
"!"> gsub(pattern="[[:<:]]", "#", "One, two;
three!", perl=TRUE)
[1] "#One, #two; #three!"
The bug report includes the comment
It may be possible that strsplit is not using the startoffset argument
to pcre_exec
pcre/pcre/doc/html/pcreapi.html
A non-zero starting offset is useful when searching for another match
in the same subject by calling pcre_exec() again after a previous
success. Setting startoffset differs from just passing over a
shortened string and setting PCRE_NOTBOL in the case of a pattern that
begins with any kind of lookbehind.
or it could be something else.
On Fri, May 5, 2023 at 3:25?AM Ivan Krylov <krylov.r00t at gmail.com>
wrote:
> On Thu, 4 May 2023 23:59:33 +0300
> Leonard Mada via R-help <r-help at r-project.org> wrote:
>
> > strsplit("a bc,def, adef ,,gh", " |(?=,)|(?<=,)(?![
])", perl=T)
> > # "a" "bc" "," "def"
"," "" "adef" ","
"," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?![ ])", perl=T)
> > # "a" "bc" "," "def"
"," "" "adef" ","
"," "gh"
> >
> > strsplit("a bc,def, adef ,,gh", " |(?<!
)(?=,)|(?<=,)(?=[^ ])",
> > perl=T)
> > # "a" "bc" "," "def"
"," "" "adef" ","
"," "gh"
> >
> >
> > Is this correct?
>
> Perl seems to return the results you expect:
>
> $ perl -E '
> say("$_:\n ", join " ", map qq["$_"], split
$_, q[a bc,def, adef ,,gh])
> for (
> qr[ |(?=,)|(?<=,)(?![ ])],
> qr[ |(?<! )(?=,)|(?<=,)(?![ ])],
> qr[ |(?<! )(?=,)|(?<=,)(?=[^ ])]
> )'
> (?^u: |(?=,)|(?<=,)(?![ ])):
> "a" "bc" "," "def" ","
"adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?![ ])):
> "a" "bc" "," "def" ","
"adef" "," "," "gh"
> (?^u: |(?<! )(?=,)|(?<=,)(?=[^ ])):
> "a" "bc" "," "def" ","
"adef" "," "," "gh"
>
> The same thing happens when I ask R to replace the separators instead
> of splitting by them:
>
> sapply(setNames(nm = c(
> " |(?=,)|(?<=,)(?![ ])",
> " |(?<! )(?=,)|(?<=,)(?![ ])",
> " |(?<! )(?=,)|(?<=,)(?=[^ ])")
> ), gsub, '[]', "a bc,def, adef ,,gh", perl = TRUE)
> # |(?=,)|(?<=,)(?![ ]) |(?<!
)(?=,)|(?<=,)(?![ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
"a[]bc[],[]def[],[]adef[],[],[]gh"
> # |(?<! )(?=,)|(?<=,)(?=[^ ])
> # "a[]bc[],[]def[],[]adef[],[],[]gh"
>
> I think that something strange happens when the delimeter pattern
> matches more than once in the same place:
>
> gsub(
> '(?=<--)|(?<=-->)', '[]', 'split here
--><-- split here',
> perl = TRUE
> )
> # [1] "split here -->[]<-- split here"
>
> (Both Perl's split() and s///g agree with R's gsub() here, although
I
> would have accepted "split here -->[][]<-- split here"
too.)
>
> On the other hand, the following doesn't look right:
>
> strsplit(
> 'split here --><-- split here',
'(?=<--)|(?<=-->)',
> perl = TRUE
> )
> # [[1]]
> # [1] "split here -->" "<" "--
split here"
>
> The "<" is definitely not followed by "<--", and
the rightmost "--" is
> definitely not preceded by "-->".
>
> Perhaps strsplit() incorrectly advances the match position after one
> match?
>
> --
> Best regards,
> Ivan
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]