Dear list!> gregexpr("a+(b+)", "abcdaabbc")[[1]] [1] 1 5 attr(,"match.length") [1] 2 4 What I want is the offsets of the matches for the group (b+), i.e. 2 and 7, not the offsets of the complete matches. Is there a way in R to get that? I know about gsubgn and strapply, but they only give me the strings matched by groups not their offsets. I could write something myself that first takes the above matches ("ab" and "aabb") and then searches again using only the group (b+). For this to work, I'd have to parse the regular expression and search several times (> 2, for nested groups) instead of just once. But I'm sure there is a better way to do this. Thanks for any suggestion! Titus
try this:> x <- gregexpr("a+(b+)", "abcdaabbcaaacaaab") > justA <- gregexpr("a+", "abcdaabbcaaacaaab") > # find matches in 'x' for 'justA' > indx <- which(justA[[1]] %in% x[[1]]) > # now determine where 'b' starts > justA[[1]][indx] + attr(justA[[1]], 'match.length')[indx][1] 2 7 17>On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg <malsburg at gmail.com> wrote:> Dear list! > >> gregexpr("a+(b+)", "abcdaabbc") > [[1]] > [1] 1 5 > attr(,"match.length") > [1] 2 4 > > What I want is the offsets of the matches for the group (b+), i.e. 2 > and 7, not the offsets of the complete matches. ?Is there a way in R > to get that? > > I know about gsubgn and strapply, but they only give me the strings > matched by groups not their offsets. > > I could write something myself that first takes the above matches > ("ab" and "aabb") and then searches again using only the group (b+). > For this to work, I'd have to parse the regular expression and search > several times (> 2, for nested groups) instead of just once. ?But I'm > sure there is a better way to do this. > > Thanks for any suggestion! > > ? Titus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
You've tried: gregexpr("b+", "abcdaabbc") On Mon, Sep 27, 2010 at 12:48 PM, Titus von der Malsburg <malsburg@gmail.com> wrote:> Dear list! > > > gregexpr("a+(b+)", "abcdaabbc") > [[1]] > [1] 1 5 > attr(,"match.length") > [1] 2 4 > > What I want is the offsets of the matches for the group (b+), i.e. 2 > and 7, not the offsets of the complete matches. Is there a way in R > to get that? > > I know about gsubgn and strapply, but they only give me the strings > matched by groups not their offsets. > > I could write something myself that first takes the above matches > ("ab" and "aabb") and then searches again using only the group (b+). > For this to work, I'd have to parse the regular expression and search > several times (> 2, for nested groups) instead of just once. But I'm > sure there is a better way to do this. > > Thanks for any suggestion! > > Titus > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
On Mon, Sep 27, 2010 at 7:16 PM, Henrique Dallazuanna <wwwhsd at gmail.com> wrote:> You've tried: > > gregexpr("b+", "abcdaabbc")But this would match the third occurrence of b+ in "abcdaabbcbb". But in this example I'm only interested in b+ if it's preceded by a+. Titus
On Mon, Sep 27, 2010 at 11:48 AM, Titus von der Malsburg <malsburg at gmail.com> wrote:> Dear list! > >> gregexpr("a+(b+)", "abcdaabbc") > [[1]] > [1] 1 5 > attr(,"match.length") > [1] 2 4 > > What I want is the offsets of the matches for the group (b+), i.e. 2 > and 7, not the offsets of the complete matches. ?Is there a way in R > to get that? > > I know about gsubgn and strapply, but they only give me the strings > matched by groups not their offsets. > > I could write something myself that first takes the above matches > ("ab" and "aabb") and then searches again using only the group (b+). > For this to work, I'd have to parse the regular expression and search > several times (> 2, for nested groups) instead of just once. ?But I'm > sure there is a better way to do this. >Try this zero width negative look behind expression:> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE)[[1]] [1] 2 7 attr(,"match.length") [1] 1 2 See ?regexp for more info. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Try this zero width negative look behind expression: > >> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE) > [[1]] > [1] 2 7 > attr(,"match.length") > [1] 1 2Thanks Gabor, but this gives me the same result as gregexpr("b+", "abcdaabbc", perl = TRUE) which is wrong if the string is "abcdaabbcbbb". Titus
On Mon, Sep 27, 2010 at 1:34 PM, Titus von der Malsburg <malsburg at gmail.com> wrote:> On Mon, Sep 27, 2010 at 7:29 PM, Gabor Grothendieck > <ggrothendieck at gmail.com> wrote: >> Try this zero width negative look behind expression: >> >>> gregexpr("(?!a+)(b+)", "abcdaabbc", perl = TRUE) >> [[1]] >> [1] 2 7 >> attr(,"match.length") >> [1] 1 2 > > Thanks Gabor, but this gives me the same result as > > ?gregexpr("b+", "abcdaabbc", perl = TRUE) > > which is wrong if the string is "abcdaabbcbbb". >Sorry, try this:> gregexpr("(?<=a)b+", "abcdaabbcbbb", perl = TRUE)[[1]] [1] 2 7 attr(,"match.length") [1] 1 2 Note that it does not give the same answer as:> gregexpr("b+", "abcdaabbcbbb", perl = TRUE)[[1]] [1] 2 7 10 attr(,"match.length") [1] 1 2 3 gregexpr("(?<=a)b+", "abcdaabbcbbb", perl = TRUE) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
What Titus wants to do is akin to retrieving capturing groups from a Matcher object in Java. I also thought there must be an existing, elegant solution to this some time ago and searched for it, including looking at the sources (albeit with not much expertise) but came up blank. I also looked at the stringr package (which is nice) but it doesn't quite do it either. Michael On 28 September 2010 01:48, Titus von der Malsburg <malsburg at gmail.com> wrote:> Dear list! > >> gregexpr("a+(b+)", "abcdaabbc") > [[1]] > [1] 1 5 > attr(,"match.length") > [1] 2 4 > > What I want is the offsets of the matches for the group (b+), i.e. 2 > and 7, not the offsets of the complete matches. ?Is there a way in R > to get that? > > I know about gsubgn and strapply, but they only give me the strings > matched by groups not their offsets. > > I could write something myself that first takes the above matches > ("ab" and "aabb") and then searches again using only the group (b+). > For this to work, I'd have to parse the regular expression and search > several times (> 2, for nested groups) instead of just once. ?But I'm > sure there is a better way to do this. > > Thanks for any suggestion! > > ? Titus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >