rthompso at aecom.yu.edu
2008-Dec-12 17:05 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
Full_Name: Reid Thompson Version: 2.8.0 RC (2008-10-12 r46696) OS: darwin9.5.0 Submission from: (NULL) (129.98.107.177) the gregexpr() function does NOT return a complete list of global matches as it should. this occurs when a pattern matches two overlapping portions of a string, only the first match is returned. the following function call demonstrates this error (although this is not how I initially discovered the problem): gregexpr("11221122", paste(rep("1122", 10), collapse="")) instead of returning 9 matches as one would expect, only 5 matches are returned . . . [[1]] [1] 1 9 17 25 33 attr(,"match.length") [1] 8 8 8 8 8 you will note, essentially, that the entire first match is then excluded from subsequent matching
Prof Brian Ripley
2008-Dec-12 19:23 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
Please do your own homework: the help page says For 'gregexpr' a list of the same length as 'text' each element of which is an integer vector as in 'regexpr', except that the starting positions of every (disjoint) match are given. ^^^^^^^^ If that is still not clear enough for you, please ask your supervisor for remedial help. On Fri, 12 Dec 2008, rthompso at aecom.yu.edu wrote:> Full_Name: Reid Thompson > Version: 2.8.0 RC (2008-10-12 r46696) > OS: darwin9.5.0 > Submission from: (NULL) (129.98.107.177) > > > the gregexpr() function does NOT return a complete list of global matches as it > should. this occurs when a pattern matches two overlapping portions of a > string, only the first match is returned. > > the following function call demonstrates this error (although this is not how I > initially discovered the problem): > gregexpr("11221122", paste(rep("1122", 10), collapse="")) > > instead of returning 9 matches as one would expect, only 5 matches are returned > . . . > > [[1]] > [1] 1 9 17 25 33 > attr(,"match.length") > [1] 8 8 8 8 8 > > you will note, essentially, that the entire first match is then excluded from > subsequent matching > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Where do you get "should" and "expect" from? All the regular expression tools that I am familiar with only match non-overlapping patterns unless you do extra to specify otherwise. One of the standard references for regular expressions if you really want to understand what is going on is "Mastering Regular Expressions" by Jeffrey Friedl. You should really read through that book before passing judgment on the correctness of an implementation. If you want the overlaps, you need to come up with a regular expression that will match without consuming all of the string. Here is one way to do it with your example: > gregexpr("1122(?=1122)", paste(rep("1122", 10), collapse=""), perl=TRUE) [[1]] [1] 1 5 9 13 17 21 25 29 33 attr(,"match.length") [1] 4 4 4 4 4 4 4 4 4 -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r- > project.org] On Behalf Of rthompso at aecom.yu.edu > Sent: Friday, December 12, 2008 10:05 AM > To: r-devel at stat.math.ethz.ch > Cc: R-bugs at r-project.org > Subject: [Rd] gregexpr - match overlap mishandled (PR#13391) > > Full_Name: Reid Thompson > Version: 2.8.0 RC (2008-10-12 r46696) > OS: darwin9.5.0 > Submission from: (NULL) (129.98.107.177) > > > the gregexpr() function does NOT return a complete list of global > matches as it > should. this occurs when a pattern matches two overlapping portions of > a > string, only the first match is returned. > > the following function call demonstrates this error (although this is > not how I > initially discovered the problem): > gregexpr("11221122", paste(rep("1122", 10), collapse="")) > > instead of returning 9 matches as one would expect, only 5 matches are > returned > . . . > > [[1]] > [1] 1 9 17 25 33 > attr(,"match.length") > [1] 8 8 8 8 8 > > you will note, essentially, that the entire first match is then > excluded from > subsequent matching > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Greg.Snow at imail.org
2008-Dec-12 21:35 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
Where do you get "should" and "expect" from? All the regular expression tools that I am familiar with only match non-overlapping patterns unless you do extra to specify otherwise. One of the standard references for regular expressions if you really want to understand what is going on is "Mastering Regular Expressions" by Jeffrey Friedl. You should really read through that book before passing judgment on the correctness of an implementation. If you want the overlaps, you need to come up with a regular expression that will match without consuming all of the string. Here is one way to do it with your example: > gregexpr("1122(?=3D1122)", paste(rep("1122", 10), collapse=3D""), perl=3DTRUE) [[1]] [1] 1 5 9 13 17 21 25 29 33 attr(,"match.length") [1] 4 4 4 4 4 4 4 4 4 -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r- > project.org] On Behalf Of rthompso at aecom.yu.edu > Sent: Friday, December 12, 2008 10:05 AM > To: r-devel at stat.math.ethz.ch > Cc: R-bugs at r-project.org > Subject: [Rd] gregexpr - match overlap mishandled (PR#13391) > > Full_Name: Reid Thompson > Version: 2.8.0 RC (2008-10-12 r46696) > OS: darwin9.5.0 > Submission from: (NULL) (129.98.107.177) > > > the gregexpr() function does NOT return a complete list of global > matches as it > should. this occurs when a pattern matches two overlapping portions of > a > string, only the first match is returned. > > the following function call demonstrates this error (although this is > not how I > initially discovered the problem): > gregexpr("11221122", paste(rep("1122", 10), collapse=3D"")) > > instead of returning 9 matches as one would expect, only 5 matches are > returned > . . . > > [[1]] > [1] 1 9 17 25 33 > attr(,"match.length") > [1] 8 8 8 8 8 > > you will note, essentially, that the entire first match is then > excluded from > subsequent matching > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Gabor Grothendieck
2008-Dec-12 22:58 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
On Fri, Dec 12, 2008 at 4:35 PM, <Greg.Snow at imail.org> wrote:> do extra to specify otherwise. One of the standard references for regular > expressions if you really want to understand what is going on is "Mastering> Regular Expressions" by Jeffrey Friedl. You should really read through thThere are also regular expression links in the Links box of the gsubfn home page: http://gsubfn.googlecode.com