waku at idi.ntnu.no
2009-Mar-22 00:40 UTC
[Rd] gsub('(.).(.)(.)', '\\3\\2\\1', 'gsub') (PR#13617)
Full_Name: Wacek Kusnierczyk Version: 2.10.0 r48181 OS: Ubuntu 8.04 Linux 32bit Submission from: (NULL) (129.241.199.135) there seems to be something wrong with r's regexing. consider the following example: gregexpr('a*|b', 'ab') # positions: 1 2 # lengths: 1 1 gsub('a*|b', '.', 'ab') # .. where the pattern matches any number of 'a's or one b, and replaces the match with a dot, globally. the answer is correct (assuming a dfa engine). however, gregexpr('a*|b', 'ab', perl=TRUE) # positions: 1 2 # lengths: 1 0 gsub('a*|b', '.', 'ab', perl=TRUE) # .b. where the pattern is identical, but the result is wrong. perl uses an nfa (if it used a dfa, the result would still be wrong), and in the above example it should find *four* matches, collectively including *all* letters in the input, thus producing *four* dots (and *only* dots) in the output: perl -le ' $input = qq|ab|; print qq|match: "$_"| foreach $input =~ /a*|b/g; $input =~ s/a*|b/./g; print qq|output: "$input"|;' # match: "a" # match: "" # match: "b" # match: "" # output: "...." since with perl=TRUE both gregexpr and gsub seem to use pcre, i've checked the example with pcretest, and also with a trivial c program (available on demand) using the pcre api; there were four matches, exactly as in the perl bit above. the results above are surprising, and suggest a bug in r's use of pcre rather than in pcre itself. possibly, the issue is that when an empty sting is matched (with a*, for example), the next attempt is not trying to match a non-empty string at the same position, but rather an empty string again at the next position. for example, gsub('a|b|c', '.', 'abc', perl=TRUE) # "...", correct gsub('a*|b|c', '.', 'abc', perl=TRUE) # ".b.c.", wrong gsub('a|b*|c', '.', 'abc', perl=TRUE) # "..c.", wrong (but now only 'c' remains) gsub('a|b*|c', '.', 'aba', perl=TRUE) # "...", incidentally correct without detailed analysis of the code, i guess the bug is located somewhere in src/main/pcre.c, and is distributed among the do_p* functions, so that multiple fixes may be needed.