rthompso at aecom.yu.edu
2008-Dec-12 17:05 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
Full_Name: Reid Thompson
Version: 2.8.0 RC (2008-10-12 r46696)
OS: darwin9.5.0
Submission from: (NULL) (129.98.107.177)
the gregexpr() function does NOT return a complete list of global matches as it
should. this occurs when a pattern matches two overlapping portions of a
string, only the first match is returned.
the following function call demonstrates this error (although this is not how I
initially discovered the problem):
gregexpr("11221122", paste(rep("1122", 10),
collapse=""))
instead of returning 9 matches as one would expect, only 5 matches are returned
. . .
[[1]]
[1] 1 9 17 25 33
attr(,"match.length")
[1] 8 8 8 8 8
you will note, essentially, that the entire first match is then excluded from
subsequent matching
Prof Brian Ripley
2008-Dec-12 19:23 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
Please do your own homework: the help page says
For 'gregexpr' a list of the same length as 'text' each
element of
which is an integer vector as in 'regexpr', except that the
starting positions of every (disjoint) match are given.
^^^^^^^^
If that is still not clear enough for you, please ask your supervisor for
remedial help.
On Fri, 12 Dec 2008, rthompso at aecom.yu.edu wrote:
> Full_Name: Reid Thompson
> Version: 2.8.0 RC (2008-10-12 r46696)
> OS: darwin9.5.0
> Submission from: (NULL) (129.98.107.177)
>
>
> the gregexpr() function does NOT return a complete list of global matches
as it
> should. this occurs when a pattern matches two overlapping portions of a
> string, only the first match is returned.
>
> the following function call demonstrates this error (although this is not
how I
> initially discovered the problem):
> gregexpr("11221122", paste(rep("1122", 10),
collapse=""))
>
> instead of returning 9 matches as one would expect, only 5 matches are
returned
> . . .
>
> [[1]]
> [1] 1 9 17 25 33
> attr(,"match.length")
> [1] 8 8 8 8 8
>
> you will note, essentially, that the entire first match is then excluded
from
> subsequent matching
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
Where do you get "should" and "expect" from? All the
regular expression tools that I am familiar with only match non-overlapping
patterns unless you do extra to specify otherwise. One of the standard
references for regular expressions if you really want to understand what is
going on is "Mastering Regular Expressions" by Jeffrey Friedl. You
should really read through that book before passing judgment on the correctness
of an implementation.
If you want the overlaps, you need to come up with a regular expression that
will match without consuming all of the string. Here is one way to do it with
your example:
> gregexpr("1122(?=1122)", paste(rep("1122", 10),
collapse=""), perl=TRUE)
[[1]]
[1] 1 5 9 13 17 21 25 29 33
attr(,"match.length")
[1] 4 4 4 4 4 4 4 4 4
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> project.org] On Behalf Of rthompso at aecom.yu.edu
> Sent: Friday, December 12, 2008 10:05 AM
> To: r-devel at stat.math.ethz.ch
> Cc: R-bugs at r-project.org
> Subject: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Full_Name: Reid Thompson
> Version: 2.8.0 RC (2008-10-12 r46696)
> OS: darwin9.5.0
> Submission from: (NULL) (129.98.107.177)
>
>
> the gregexpr() function does NOT return a complete list of global
> matches as it
> should. this occurs when a pattern matches two overlapping portions of
> a
> string, only the first match is returned.
>
> the following function call demonstrates this error (although this is
> not how I
> initially discovered the problem):
> gregexpr("11221122", paste(rep("1122", 10),
collapse=""))
>
> instead of returning 9 matches as one would expect, only 5 matches are
> returned
> . . .
>
> [[1]]
> [1] 1 9 17 25 33
> attr(,"match.length")
> [1] 8 8 8 8 8
>
> you will note, essentially, that the entire first match is then
> excluded from
> subsequent matching
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
Greg.Snow at imail.org
2008-Dec-12 21:35 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
Where do you get "should" and "expect" from? All the
regular expression tools that I am familiar with only match non-overlapping
patterns unless you do extra to specify otherwise. One of the standard
references for regular expressions if you really want to understand what is
going on is "Mastering Regular Expressions" by Jeffrey Friedl. You
should really read through that book before passing judgment on the correctness
of an implementation.
If you want the overlaps, you need to come up with a regular expression that
will match without consuming all of the string. Here is one way to do it with
your example:
> gregexpr("1122(?=3D1122)", paste(rep("1122", 10),
collapse=3D""), perl=3DTRUE)
[[1]]
[1] 1 5 9 13 17 21 25 29 33
attr(,"match.length")
[1] 4 4 4 4 4 4 4 4 4
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
> -----Original Message-----
> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> project.org] On Behalf Of rthompso at aecom.yu.edu
> Sent: Friday, December 12, 2008 10:05 AM
> To: r-devel at stat.math.ethz.ch
> Cc: R-bugs at r-project.org
> Subject: [Rd] gregexpr - match overlap mishandled (PR#13391)
>
> Full_Name: Reid Thompson
> Version: 2.8.0 RC (2008-10-12 r46696)
> OS: darwin9.5.0
> Submission from: (NULL) (129.98.107.177)
>
>
> the gregexpr() function does NOT return a complete list of global
> matches as it
> should. this occurs when a pattern matches two overlapping portions of
> a
> string, only the first match is returned.
>
> the following function call demonstrates this error (although this is
> not how I
> initially discovered the problem):
> gregexpr("11221122", paste(rep("1122", 10),
collapse=3D""))
>
> instead of returning 9 matches as one would expect, only 5 matches are
> returned
> . . .
>
> [[1]]
> [1] 1 9 17 25 33
> attr(,"match.length")
> [1] 8 8 8 8 8
>
> you will note, essentially, that the entire first match is then
> excluded from
> subsequent matching
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
Gabor Grothendieck
2008-Dec-12 22:58 UTC
[Rd] gregexpr - match overlap mishandled (PR#13391)
On Fri, Dec 12, 2008 at 4:35 PM, <Greg.Snow at imail.org> wrote:> do extra to specify otherwise. One of the standard references for regular > expressions if you really want to understand what is going on is "Mastering> Regular Expressions" by Jeffrey Friedl. You should really read through thThere are also regular expression links in the Links box of the gsubfn home page: http://gsubfn.googlecode.com