Stefan Th. Gries
2008-Nov-30 19:33 UTC
[R] Regex: workaround for variable length negative lookbehind
Hi all I have the following regular expression problem: I want to find complete elements of a vector that end in a repeated character but where the repetition doesn't make up the whole word. That is, for the vector vec: vec<-c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa") I would like to get "baaa" "bbaa" "baamm">From tools where negative lookbehind can involve variable lengths, onewould think this would work: grep("(?<!(?:\\1|^))(.)\\1{1,}$", vec, perl=T) But then R doesn't like it that much ... I also know I can get it like this: whole.word.rep <- grep("^(.)\\1{1,}$", vec, perl=T) # 1 6 rep.at.end <- grep("(.)\\1{1,}$", vec, perl=T) # 1 2 3 5 6 setdiff(rep.at.end, whole.word.rep) # 2 3 5 But is there a one-line grep thingy to do this? Thx for any pointers, STG
Stefan Evert
2008-Nov-30 19:59 UTC
[R] Regex: workaround for variable length negative lookbehind
>>Hi Stefan! :-)>> From tools where negative lookbehind can involve variable lengths, >> one > would think this would work: > > grep("(?<!(?:\\1|^))(.)\\1{1,}$", vec, perl=T) > > But then R doesn't like it that much ...It's really the PCRE library that doesn't like your regexp, not R. The problem is that negative behind is only possible with a fixed- length expression, and since \1 may hold an arbitrary string, the PCRE library can't be sure it's just a single character. I'm also surprised that you're allowed to use \1 before defining it.> > But is there a one-line grep thingy to do this?Can't think of a one-liner, but a three-line solution you can easily enough wrap in a small function: vec<-c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa") idx.1 <- grep("(.)\\1$", vec) idx.2 <- grep("^(.)\\1*$", vec) vec[setdiff(idx.1, idx.2)] Cheers, Stefan -- The wonders of Googleology (episode 1) "from collectibles to cars" 84,700,000 -- Google 9,443,672 -- Google N-grams (Web 1T5) 1 -- ukWaC [ stefan.evert at uos.de | http://purl.org/stefan.evert ]
Gabor Grothendieck
2008-Nov-30 20:26 UTC
[R] Regex: workaround for variable length negative lookbehind
Try this:> vec <- c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa")> grep("^(?!(.)\\1{1,}$).*(.)\\2{1,}$", vec, perl = TRUE)[1] 2 3 5 The (?...) succeeds only if the string is not all the same character and since that consumes no characters it restarts at the beginning to match anything followed by repeated characters to the end. On Sun, Nov 30, 2008 at 2:33 PM, Stefan Th. Gries <stgries at gmail.com> wrote:> Hi all > > I have the following regular expression problem: I want to find > complete elements of a vector that end in a repeated character but > where the repetition doesn't make up the whole word. That is, for the > vector vec: > > vec<-c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa") > > I would like to get > "baaa" > "bbaa" > "baamm" > > >From tools where negative lookbehind can involve variable lengths, one > would think this would work: > > grep("(?<!(?:\\1|^))(.)\\1{1,}$", vec, perl=T) > > But then R doesn't like it that much ... I also know I can get it like this: > > whole.word.rep <- grep("^(.)\\1{1,}$", vec, perl=T) # 1 6 > rep.at.end <- grep("(.)\\1{1,}$", vec, perl=T) # 1 2 3 5 6 > setdiff(rep.at.end, whole.word.rep) # 2 3 5 > > But is there a one-line grep thingy to do this? > > Thx for any pointers, > STG > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Stefan Evert
2008-Nov-30 20:59 UTC
[R] Regex: workaround for variable length negative lookbehind
>> But is there a one-line grep thingy to do this? > > Can't think of a one-liner, but a three-line solution you can easily > enough wrap in a small function: > > vec<-c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa") > idx.1 <- grep("(.)\\1$", vec) > idx.2 <- grep("^(.)\\1*$", vec) > vec[setdiff(idx.1, idx.2)]Oops, my bad, that solution was in Stefan's original mail already. I got his example mixed up with some other text and thought he was talking about something different. Still, I think it's better to write a few lines of R code than to abuse regular expressions to do something they were never intended to do. How do other people on this list feel about that issue? Sorry again, and next time I'll trink a cup of coffee _before_ I post! Stefan
Wacek Kusnierczyk
2008-Dec-01 05:20 UTC
[R] Regex: workaround for variable length negative lookbehind
Gabor Grothendieck wrote:> Try this: > > >> vec <- c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa") >> > > >> grep("^(?!(.)\\1{1,}$).*(.)\\2{1,}$", vec, perl = TRUE) >>or even grep("^(?!(.)\\1+$).*(.)\\2+$", vec, perl = TRUE) vQ
Gabor Grothendieck
2008-Dec-01 11:18 UTC
[R] Regex: workaround for variable length negative lookbehind
On Mon, Dec 1, 2008 at 12:20 AM, Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote:> Gabor Grothendieck wrote: >> Try this: >> >> >>> vec <- c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa") >>> >> >> >>> grep("^(?!(.)\\1{1,}$).*(.)\\2{1,}$", vec, perl = TRUE) >>> > > or even > > grep("^(?!(.)\\1+$).*(.)\\2+$", vec, perl = TRUE) >Or combining the previous simplification I posted with yours:> grep("^(?!(.)\\1+$).*(.)\\2$", vec, perl = TRUE)[1] 2 3 5
Wacek Kusnierczyk
2008-Dec-01 11:31 UTC
[R] Regex: workaround for variable length negative lookbehind
Gabor Grothendieck wrote:> On Mon, Dec 1, 2008 at 12:20 AM, Wacek Kusnierczyk > <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no> wrote: > >> Gabor Grothendieck wrote: >> >>> Try this: >>> >>> >>> >>>> vec <- c("aaaa", "baaa", "bbaa", "bbba", "baamm", "aa") >>>> >>>> >>> >>>> grep("^(?!(.)\\1{1,}$).*(.)\\2{1,}$", vec, perl = TRUE) >>>> >>>> >> or even >> >> grep("^(?!(.)\\1+$).*(.)\\2+$", vec, perl = TRUE) >> >> > > Or combining the previous simplification I posted with yours: > > >> grep("^(?!(.)\\1+$).*(.)\\2$", vec, perl = TRUE) >>ha! you can even make it a tiny little bit faster: grep("^(?!(.)\\1+$).+(.)\\2$", vec, perl = TRUE) ;) vQ
Maybe Matching Threads
- AD, 4.5.0, DRS or deletion question
- [LLVMdev] Does current LLVM target-independent code generator supports my strange chip?
- [LLVMdev] Does current LLVM target-independent code generator supports my strange chip?
- regex -> negate a word
- [LLVMdev] Does current LLVM target-independent code generator supports my strange chip?