dietmar.schindler at manroland-web.com
2017-Apr-04 08:45 UTC
[Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound
Dear Sirs, while> regexpr('(.{1,2})\\1', 'foo')[1] 2 attr(,"match.length") [1] 2 attr(,"useBytes") [1] TRUE yields the correct match, an incremented upper bound in> regexpr('(.{1,3})\\1', 'foo')[1] -1 attr(,"match.length") [1] -1 attr(,"useBytes") [1] TRUE incorrectly yields no match. R versions tested: 2.11.1 on i486-pc-linux-gnu 2.15.1 on x86_64-pc-linux-gnu 3.2.1 on i386-w64-mingw32 3.2.1 on x86_64-w64-mingw32 3.3.3 on x86_64-w64-mingw32 -- Best regards, Dietmar Schindler ________________________________ manroland web systems GmbH -- Managing Director: Alexander Wassermann Registered Office: Augsburg -- Trade Register: AG Augsburg -- HRB-No.: 26816 -- VAT: DE281389840 Confidentiality note: This eMail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any use or dissemination of this communication is strictly prohibited. If you have received this eMail in error, then please delete this eMail.
Martin Maechler
2017-Apr-05 09:15 UTC
[Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound
>>>>> <dietmar.schindler at manroland-web.com> >>>>> on Tue, 4 Apr 2017 08:45:30 +0000 writes:> Dear Sirs, > while >> regexpr('(.{1,2})\\1', 'foo') > [1] 2 > attr(,"match.length") > [1] 2 > attr(,"useBytes") > [1] TRUE > yields the correct match, an incremented upper bound in >> regexpr('(.{1,3})\\1', 'foo') > [1] -1 > attr(,"match.length") > [1] -1 > attr(,"useBytes") > [1] TRUE > incorrectly yields no match. Hmm, yes, I would also say that this is incorrect (though I'm always cautious: The ?regex help page explicitly mentions greedy repetitions, and these can "bite you" ..) The behavior is also different from the perl=TRUE one which is correct (according to the above understanding). Using grep() instead of regexpr() makes the behavior easier to parse. The following code ---------------------------------------------------------------------- tx <- c("ab","abc", paste0("foo", c("", "b", "o", "bar", "oofy"))) setNames(nchar(tx), tx) ## ab abc foo foob fooo foobar foooofy ## 2 3 3 4 4 6 7 grep1r <- function(n, txt, ...) { pattern <- paste0('(.{1,',n,'})\\1', collapse="") ## can have empty n ans <- grep(pattern, txt, value=TRUE, ...) cat(sprintf("pattern '%s' : ", pattern)); print(ans, quote=FALSE) invisible(ans) } grep1r({}, tx)# '.{1,}' : because of _greedy_ matching there is __no__ repetiion! grep1r(100,tx)# i.e., these both give an empty match : character(0) ## matching at most once: grep1r(1, tx)# matches all 5 starting with "foo" grep1r(2, tx)# ditto : all have more than 2 chars grep1r(3, tx)# not "foo": those with more than 3 chars grep1r(4, tx)# .. those with more than 4 characters grep1r(5, tx)# .. those with more than 5 characters grep1r(6, tx)# .. those with more than 6 characters grep1r(7, tx)# NONE (= those with more than 7 characters) for(p in c(FALSE,TRUE)) { cat("\ngrep(*, perl =", p, ") :\n") for(n in c(list(NULL), 1:7)) grep1r(n, tx, perl = p) } ---------------------------------------------------------------------- ends with> for(p in c(FALSE,TRUE)) {+ cat("\ngrep(*, perl =", p, ") :\n") + for(n in c(list(NULL), 1:7)) + grep1r(n, tx, perl = p) + } grep(*, perl = FALSE ) : pattern '(.{1,})\1' : character(0) pattern '(.{1,1})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,2})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,3})\1' : [1] foob fooo foobar foooofy pattern '(.{1,4})\1' : [1] foobar foooofy pattern '(.{1,5})\1' : [1] foobar foooofy pattern '(.{1,6})\1' : [1] foooofy pattern '(.{1,7})\1' : character(0) grep(*, perl = TRUE ) : pattern '(.{1,})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,1})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,2})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,3})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,4})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,5})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,6})\1' : [1] foo foob fooo foobar foooofy pattern '(.{1,7})\1' : [1] foo foob fooo foobar foooofy>
dietmar.schindler at manroland-web.com
2017-Apr-11 06:16 UTC
[Rd] Bug report: POSIX regular expression doesn't match for somewhat higher values of upper bound
> Von: Martin Maechler [mailto:maechler at stat.math.ethz.ch] > Gesendet: Mittwoch, 5. April 2017 11:15 > > >>>>> <dietmar.schindler at manroland-web.com> > >>>>> on Tue, 4 Apr 2017 08:45:30 +0000 writes: > > > Dear Sirs, > > while > > >> regexpr('(.{1,2})\\1', 'foo') > > [1] 2 > > attr(,"match.length") > > [1] 2 > > attr(,"useBytes") > > [1] TRUE > > > yields the correct match, an incremented upper bound in > > >> regexpr('(.{1,3})\\1', 'foo') > > [1] -1 > > attr(,"match.length") > > [1] -1 > > attr(,"useBytes") > > [1] TRUE > > > incorrectly yields no match. > > Hmm, yes, I would also say that this is incorrect > (though I'm always cautious: The ?regex help page explicitly > mentions greedy repetitions, and these can "bite you" ..) > > The behavior is also different from the perl=TRUE one which is > correct (according to the above understanding). > > ...Shouldn't this be submitted on R's Bugzilla then (which I as a non-member can't)? -- Best regards, Dietmar Schindler ________________________________ manroland web systems GmbH -- Managing Director: Alexander Wassermann Registered Office: Augsburg -- Trade Register: AG Augsburg -- HRB-No.: 26816 -- VAT: DE281389840 Confidentiality note: This eMail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any use or dissemination of this communication is strictly prohibited. If you have received this eMail in error, then please delete this eMail.