Stefan Th. Gries
2006-Aug-16 07:17 UTC
[R] Regular expressions: retrieving matches depending on intervening strings
Dear all
I again have a regular expression question. I have this character vector a:
a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w
DT0>that<c PUN>.",
"<w AT0>a <w NN1>blockage <w CJC>and <ptr
target=KB2LC003><w DT0>that<c PUN>.",
"<w AT0>a <w NN1>blockage <w CJC>and<c PUN>,
<w DT0>that<c PUN>.",
"<w AT0>a <w NN1>blockage <w CJC>and <w
AJ0>hungry <w DT0>that<c PUN>.")
I would like to retrieve those elements of a in which "<w CJC>"
and "<w DT0>" are
- directly adjacent, as in a[1] or
- not interrupted by "<[wc] ", as in a[2]
And, of these elements I would like to consume all characters from the
"<" in "<w CJC" to the last character after
"<w DT0>" that is not a "<". For example, if I was
only searching a[1], I would like something like this:
matches<-gregexpr("<w CJC>[^<]+?<w DT0>[^<]+",
a[1], perl=TRUE)
substr(a[1], unlist(matches), unlist(matches)+unlist(attributes(matches[[1]],
"match.length"))-1)
I have been fiddling around with negative lookahead but I really can't get
my head around this. Any pointers would be greatly appreciated. Thanks a lot,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries