thr3ads.net - R help - [R] grepping and splitting (with R 2.1.1) [Sep 2005]

If this information is useful, please help other people find it:
Share via:

Stefan Th. Gries

2005-Sep-12 14:24 UTC

[R] grepping and splitting (with R 2.1.1)

Hi R experts

I have the following regular expression problem. I am writing a basic corpus
retrieval program, i.e. a concordancer/function where a user enters
- a set or a directory of text files to search;
- a regular expression to search for in these files.

I want to provide an output in which the matches of the regular expression are
listed in one central column and the neighboring columns given the words before
and after the matching word. For example, a concordance of the word
"the" for the previous sentence with a user-defined span of 3 would
lool like this:
-3	-2	-1	0	1	2	3
output	in	which	the	matches	of	the
the	matches	of	the	regular	expression	are
central	column	and	the	neighboring	columns	given
neighboring	columns	given	the	words	before	and
before	and	after	the	matching	word	.

As you can see, there may be multiple hits per line. This works all perfectly
fine for cases where the regular expression matches just one of the kind of
elements to be separated in the table. 'Unfortunately', apart from
'normal' text files, I also have text files in which every word is
preceded by a tag giving its word class, for example

a<-c("<w TO0>to <w VV1>find <w VVN>expected <w
TO0>to <w VV2>skivvy <w DT0>much <c PUN>.",
     "<w VVN>seen <w TO0>to <w VV3>kill <w
DT0>many")

Now, as long as the regular expression entered by the user is something like
   b<-<w TO0>to
or even
   b<-(?Ui)<w VVN>[^<]*<
this works fine: I identify hits using grep(b, a, perl=T), split up the line
using strsplit, and provide as many words before and after my search string as
are necessary (and available in the line).

But if the regular expression entered by a user (when prompted by scan(nmax=1,
what="char")) is
   b<-b<-"(?Ui)(<w TO0>to <w VV.>[^<]*<)"
I run into several related problems. As you all know, grep and regexpr will only
give me the first hit anyway - which is how I identified the lines in the first
place - but for the desired output I need all the hits per line together with
their context. But, obviously, when I split up the line using strplit and
"<w " as a separator so that I can get all hits and all words for
the columns -3 to -1 and 1 to 3, the expression matched by the search string b
is also split up and cannot be put into one tab-separated central column anymore
and I don't seem to be able to extract all hits to store them and insert
them again at a later stage ... Basically, I need to split up the element of the
vector containing at least one match into x parts, where x is the number of hits
plus the number of elements when the surrounding material is split up so that I
can generate this kind of display (I leave aside the issue of spaces for now and
transpose the above kind of display for expository reasons):

(the first hit in a[1])
-3	
-2	
-1	
0	<w TO0>to <w VV1>find
1	<w VVN>expected
2	<w TO0>to
3	<w VV2>skivvy

and the next line of the output would be the second hit in a[1]:

-3	<w TO0>to
-2	<w VV1>find
-1	<w VVN>expected
0	<w TO0>to <w VV2>skivvy
1	
2	
3	

and the next line would be the only hit in a[2]. The short question after this
long intro now is, is there any way of splitting up the elements containing
matches in such a way?

I use R 2.1.1 on a Windows XP Pro SP2 machine (with Perl 5.8.7 in case that
matters for PRCE). Thanks,
STG


Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - ??ber 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Sep 2005 - grepping and splitting (with R 2.1.1)

[R] grepping and splitting (with R 2.1.1)

Possibly Parallel Threads

Wisdom of the Ancients