Full_Name: Peter Dolan Version: 2.5.1 OS: Windows Submission from: (NULL) (128.193.227.43) gregexpr does not find all matching substrings if the substrings overlap:> gregexpr("abab","ababab")[[1]] [1] 1 attr(,"match.length") [1] 4 It does work correctly in Version 2.3.1 under linux.
If you want all the matches (including overlaps) then you could try one of these:> gregexpr("(?=abab)","ababab",perl=TRUE)[[1]] [1] 1 3 attr(,"match.length") [1] 0 0> gregexpr("ab(?=ab)","ababab",perl=TRUE)[[1]] [1] 1 3 attr(,"match.length") [1] 2 2 The book "Mastering Regular Expressions" by Jeffrey Friedl has a lot of detail on the hows and whys of regular expression matching. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111> -----Original Message----- > From: r-devel-bounces at r-project.org > [mailto:r-devel-bounces at r-project.org] On Behalf Of > dolanp at science.oregonstate.edu > Sent: Wednesday, October 10, 2007 8:36 AM > To: r-devel at stat.math.ethz.ch > Cc: R-bugs at biostat.ku.dk > Subject: [Rd] gregexpr (PR#9965) > > Full_Name: Peter Dolan > Version: 2.5.1 > OS: Windows > Submission from: (NULL) (128.193.227.43) > > > gregexpr does not find all matching substrings if the > substrings overlap: > > > gregexpr("abab","ababab") > [[1]] > [1] 1 > attr(,"match.length") > [1] 4 > > It does work correctly in Version 2.3.1 under linux. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
This was a deliberate change for R 2.4.0 with SVN log: r38145 | rgentlem | 2006-05-20 23:58:14 +0100 (Sat, 20 May 2006) | 2 lines fixing gregexpr infelicity So it seems the author of gregexpr believed that the bug was in 2.3.1, not 2.5.1. On Wed, 10 Oct 2007, dolanp at science.oregonstate.edu wrote:> Full_Name: Peter Dolan > Version: 2.5.1 > OS: Windows > Submission from: (NULL) (128.193.227.43) > > > gregexpr does not find all matching substrings if the substrings overlap: > >> gregexpr("abab","ababab") > [[1]] > [1] 1 > attr(,"match.length") > [1] 4 > > It does work correctly in Version 2.3.1 under linux.'correctly' is a matter of definition, I believe: this could be considered to be vaguely worded in the help.> ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
If you want all the matches (including overlaps) then you could try one of these:> gregexpr("(?=3Dabab)","ababab",perl=3DTRUE)[[1]] [1] 1 3 attr(,"match.length") [1] 0 0> gregexpr("ab(?=3Dab)","ababab",perl=3DTRUE)[[1]] [1] 1 3 attr(,"match.length") [1] 2 2 The book "Mastering Regular Expressions" by Jeffrey Friedl has a lot of detail on the hows and whys of regular expression matching. --=20 Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 =20 =20> -----Original Message----- > From: r-devel-bounces at r-project.org=20 > [mailto:r-devel-bounces at r-project.org] On Behalf Of=20 > dolanp at science.oregonstate.edu > Sent: Wednesday, October 10, 2007 8:36 AM > To: r-devel at stat.math.ethz.ch > Cc: R-bugs at biostat.ku.dk > Subject: [Rd] gregexpr (PR#9965) >=20 > Full_Name: Peter Dolan > Version: 2.5.1 > OS: Windows > Submission from: (NULL) (128.193.227.43) >=20 >=20 > gregexpr does not find all matching substrings if the=20 > substrings overlap: >=20 > > gregexpr("abab","ababab") > [[1]] > [1] 1 > attr(,"match.length") > [1] 4 >=20 > It does work correctly in Version 2.3.1 under linux. >=20 > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >=20
Yes, we had originally wanted it to find all matches, but user complaints that it did not perform as Perl does were taken to prevail. There are different ways to do this, but it seems the notion that one not start looking for the next match until after the previous one is more common. I did consciously decide not to have a switch, and instead we wrote something that does what we wanted it to do and put it in the Biostrings package (from Bioconductor) as geregexpr2 (sorry but only fixed = TRUE is supported, since that is all we needed). best wishes Robert Prof Brian Ripley wrote:> This was a deliberate change for R 2.4.0 with SVN log: > > r38145 | rgentlem | 2006-05-20 23:58:14 +0100 (Sat, 20 May 2006) | 2 lines > fixing gregexpr infelicity > > So it seems the author of gregexpr believed that the bug was in 2.3.1, not > 2.5.1. > > On Wed, 10 Oct 2007, dolanp at science.oregonstate.edu wrote: > >> Full_Name: Peter Dolan >> Version: 2.5.1 >> OS: Windows >> Submission from: (NULL) (128.193.227.43) >> >> >> gregexpr does not find all matching substrings if the substrings overlap: >> >>> gregexpr("abab","ababab") >> [[1]] >> [1] 1 >> attr(,"match.length") >> [1] 4 >> >> It does work correctly in Version 2.3.1 under linux. > > 'correctly' is a matter of definition, I believe: this could be considered > to be vaguely worded in the help. > >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> >-- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org