Stefan Th. Gries
2006-Aug-16 07:17 UTC
[R] Regular expressions: retrieving matches depending on intervening strings
Dear all I again have a regular expression question. I have this character vector a: a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w DT0>that<c PUN>.", "<w AT0>a <w NN1>blockage <w CJC>and <ptr target=KB2LC003><w DT0>that<c PUN>.", "<w AT0>a <w NN1>blockage <w CJC>and<c PUN>, <w DT0>that<c PUN>.", "<w AT0>a <w NN1>blockage <w CJC>and <w AJ0>hungry <w DT0>that<c PUN>.") I would like to retrieve those elements of a in which "<w CJC>" and "<w DT0>" are - directly adjacent, as in a[1] or - not interrupted by "<[wc] ", as in a[2] And, of these elements I would like to consume all characters from the "<" in "<w CJC" to the last character after "<w DT0>" that is not a "<". For example, if I was only searching a[1], I would like something like this: matches<-gregexpr("<w CJC>[^<]+?<w DT0>[^<]+", a[1], perl=TRUE) substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]], "match.length"))-1) I have been fiddling around with negative lookahead but I really can't get my head around this. Any pointers would be greatly appreciated. Thanks a lot, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries