thr3ads.net - R help - [R] Regular expressions: retrieving matches depending on intervening strings [Follow-up] [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Stefan Th. Gries

2006-Aug-16 09:12 UTC

[R] Regular expressions: retrieving matches depending on intervening strings [Follow-up]

Dear all

This is a follow-up to an earlier posting today regarding a regular expression
question. In the meantime, this is the best approximation I could come up with
and should give you a better idea what I am talking about.

a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w
DT0>that<c PUN>.",
     "<w AT0>a <w NN1>blockage <w CJC>and <ptr
target=KB2LC003><w DT0>that<c PUN>.",
     "<w AT0>a <w NN1>blockage <w CJC>and<c PUN>,
<w DT0>that<c PUN>.",
     "<w AT0>a <w NN1>blockage <w CJC>and <w
AJ0>hungry <w DT0>that<c PUN>.")
matches<-gregexpr("<w CJC>[^<]+(?:<[^wc].*?>.*?)*<w
DT0>that", a, perl=TRUE)
starts<-unlist(matches)
lengths<-unlist(sapply(matches, attributes))
stops<-starts+lengths-1
substr(a, starts, stops)

What is still missing is that the disallowed string is not just
"<[wc]" but "<[wc] " and I don't know how to do
that. Any ideas (preferably with lookarounds)?
Thanks a bunch,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------


ORIGINAL MESSAGE> Dear all
>
> I again have a regular expression question. I have this character vector a:
>
> a<-c("<w AT0>a <w NN1>blockage <w CJC>and <w
DT0>that<c PUN>.",
> "<w AT0>a <w NN1>blockage <w CJC>and <ptr
target=KB2LC003><w DT0>that<c PUN>.",
> "<w AT0>a <w NN1>blockage <w CJC>and<c PUN>,
<w DT0>that<c PUN>.",
> "<w AT0>a <w NN1>blockage <w CJC>and <w
AJ0>hungry <w DT0>that<c PUN>.")
>
> I would like to retrieve those elements of a in which "<w
CJC>" and "<w DT0>" are
>
> - directly adjacent, as in a[1] or
> - not interrupted by "<[wc] ", as in a[2]
>
> And, of these elements I would like to consume all characters from the
"<" in "<w CJC" to the last character after
"<w DT0>" that is not a "<". For example, if I was
only searching a[1], I would like something like this:
>
> matches<-gregexpr("<w CJC>[^<]+?<w
DT0>[^<]+", a[1], perl=TRUE)
> substr(a[1], unlist(matches),
unlist(matches)+unlist(attributes(matches[[1]], "match.length"))-1)
>
> I have been fiddling around with negative lookahead but I really can't
get my head around this. Any pointers would be greatly appreciated. Thanks a
lot,
> STG

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Aug 2006 - Regular expressions: retrieving matches depending on intervening strings [Follow-up]

[R] Regular expressions: retrieving matches depending on intervening strings [Follow-up]

Seemingly Similar Threads

Wisdom of the Ancients