thr3ads.net - R devel - [Rd] gsub('(.).(.)(.)', '\\3\\2\\1', 'gsub') [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Wacek Kusnierczyk

2009-Mar-22 00:36 UTC

[Rd] gsub('(.).(.)(.)', '\\3\\2\\1', 'gsub')

there seems to be something wrong with r's regexing.  consider the
following example:

    gregexpr('a*|b', 'ab')
    # positions: 1 2
    # lengths: 1 1

    gsub('a*|b', '.', 'ab')
    # ..

where the pattern matches any number of 'a's or one b, and replaces the
match with a dot, globally.  the answer is correct (assuming a dfa
engine).  however,

    gregexpr('a*|b', 'ab', perl=TRUE)
    # positions: 1 2
    # lengths: 1 0

    gsub('a*|b', '.', 'ab', perl=TRUE)
    # .b.

where the pattern is identical, but the result is wrong.  perl uses an
nfa (if it used a dfa, the result would still be wrong), and in the
above example it should find *four* matches, collectively including
*all* letters in the input, thus producing *four* dots (and *only* dots)
in the output:

    perl -le '
       $input = qq|ab|;
       print qq|match: "$_"| foreach $input =~ /a*|b/g;
       $input =~ s/a*|b/./g;
       print qq|output: "$input"|;'
    # match: "a"
    # match: ""
    # match: "b"
    # match: ""
    # output: "...."

since with perl=TRUE both gregexpr and gsub seem to use pcre, i've
checked the example with pcretest, and also with a trivial c program
(available on demand) using the pcre api;  there were four matches,
exactly as in the perl bit above.

the results above are surprising, and suggest a bug in r's use of pcre
rather than in pcre itself.  possibly, the issue is that when an empty
sting is matched (with a*, for example), the next attempt is not trying
to match a non-empty string at the same position, but rather an empty
string again at the next position.  for example,

    gsub('a|b|c', '.', 'abc', perl=TRUE)
    # "...", correct

    gsub('a*|b|c', '.', 'abc', perl=TRUE)
    # ".b.c.", wrong

    gsub('a|b*|c', '.', 'abc', perl=TRUE)
    # "..c.", wrong (but now only 'c' remains)

    gsub('a|b*|c', '.', 'aba', perl=TRUE)
    # "...", incidentally correct


without detailed analysis of the code, i guess the bug is located
somewhere in src/main/pcre.c, and is distributed among the do_p*
functions, so that multiple fixes may be needed.

vQ

Possibly Parallel Threads

Search for more maybe matching threads

R devel - Mar 2009 - gsub('(.).(.)(.)', '\\3\\2\\1', 'gsub')

[Rd] gsub('(.).(.)(.)', '\\3\\2\\1', 'gsub')

Possibly Parallel Threads

Wisdom of the Ancients