Stefan Th. Gries
2006-Aug-16 09:12 UTC
[R] Regular expressions: retrieving matches depending on intervening strings [Follow-up]
Dear all This is a follow-up to an earlier posting today regarding a regular expression question. In the meantime, this is the best approximation I could come up with and should give you a better idea what I am talking about. a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w DT0>that<c PUN>.", "<w AT0>a <w NN1>blockage <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w NN1>blockage <w CJC>and<c PUN>, <w DT0>that<c PUN>.", "<w AT0>a <w NN1>blockage <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.") matches<-gregexpr("<w CJC>[^<]+(?:<[^wc].*?>.*?)*<w DT0>that", a, perl=TRUE) starts<-unlist(matches) lengths<-unlist(sapply(matches, attributes)) stops<-starts+lengths-1 substr(a, starts, stops) What is still missing is that the disallowed string is not just "<[wc]" but "<[wc] " and I don't know how to do that. Any ideas (preferably with lookarounds)? Thanks a bunch, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries ----------------------------------------------- ORIGINAL MESSAGE> Dear all > > I again have a regular expression question. I have this character vector a: > > a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w DT0>that<c PUN>.", > "<w AT0>a <w NN1>blockage <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.", > "<w AT0>a <w NN1>blockage <w CJC>and<c PUN>, <w DT0>that<c PUN>.", > "<w AT0>a <w NN1>blockage <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.") > > I would like to retrieve those elements of a in which "<w CJC>" and "<w DT0>" are > > - directly adjacent, as in a[1] or > - not interrupted by "<[wc] ", as in a[2] > > And, of these elements I would like to consume all characters from the "<" in "<w CJC" to the last character after "<w DT0>" that is not a "<". For example, if I was only searching a[1], I would like something like this: > > matches<-gregexpr("<w CJC>[^<]+?<w DT0>[^<]+", a[1], perl=TRUE) > substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]], "match.length"))-1) > > I have been fiddling around with negative lookahead but I really can't get my head around this. Any pointers would be greatly appreciated. Thanks a lot, > STG
Seemingly Similar Threads
- Problems with samba as PDC
- Training nnet in two ways, trying to understand the performance difference - with (i hope!) commented, minimal, self-contained, reproducible code
- grepping and splitting (with R 2.1.1)
- FW: neural network not using all observations
- Erlang Loss Function in R