Hi all,
I'm trying to write a function that will search and extract from a long
character string, but with a twist: I want to use the characters before and
the characters after what I want to extract as reference points. For
example, say I'm working with data entries that looks like this:
Drink=Coffee:Location=Office:Time=Morning:Market=Flat
Drink=Water:Location=Office:Time=Afternoon:Market=Up
Drink=Water:Location=Gym:Time=Evening:Market=Closed
Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed
...
For my function, I'd like to find what's located between
"Location=", and
":Time=" in every instance, and extract it, to return something like
"Office, Office, Gym, Restaurant".
In a previous discussion I found
(http://tolstoy.newcastle.edu.au/R/help/05/03/0344.html), someone wrote a
function where you could find and substitute characters in a string, based
on "pre" and "post" variables:
interp <- function(x, e = parent.frame(), pre = "\\$", post =
"" ) {
for(el in ls(e)) {
tag <- paste(pre, el, post, sep = "")
if (length(grep(tag, x))) x <- gsub(tag, eval(parse(text = el), e), x)
}
x
}
I'm not sure how to modify it, however, to do what I want it to do. Any
suggestions?
Thanks in advance,
Andrew
--
View this message in context:
http://r.789695.n4.nabble.com/Search-and-extract-string-function-tp2290268p2290268.html
Sent from the R help mailing list archive at Nabble.com.
On Jul 15, 2010, at 9:48 AM, AndrewPage wrote:> > Hi all, > > I'm trying to write a function that will search and extract from a long > character string, but with a twist: I want to use the characters before and > the characters after what I want to extract as reference points. For > example, say I'm working with data entries that looks like this: > > Drink=Coffee:Location=Office:Time=Morning:Market=Flat > > Drink=Water:Location=Office:Time=Afternoon:Market=Up > > Drink=Water:Location=Gym:Time=Evening:Market=Closed > > Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed > > > ... > > For my function, I'd like to find what's located between "Location=", and > ":Time=" in every instance, and extract it, to return something like > "Office, Office, Gym, Restaurant". > > In a previous discussion I found > (http://tolstoy.newcastle.edu.au/R/help/05/03/0344.html), someone wrote a > function where you could find and substitute characters in a string, based > on "pre" and "post" variables: > > interp <- function(x, e = parent.frame(), pre = "\\$", post = "" ) { > for(el in ls(e)) { > tag <- paste(pre, el, post, sep = "") > if (length(grep(tag, x))) x <- gsub(tag, eval(parse(text = el), e), x) > } > x > } > > I'm not sure how to modify it, however, to do what I want it to do. Any > suggestions? > > Thanks in advance, > > Andrew> Vec[1] "Drink=Coffee:Location=Office:Time=Morning:Market=Flat" [2] "Drink=Water:Location=Office:Time=Afternoon:Market=Up" [3] "Drink=Water:Location=Gym:Time=Evening:Market=Closed" [4] "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed"> gsub(".*Location=(.+):Time=.*", "\\1", Vec)[1] "Office" "Office" "Gym" "Restaurant" This returns the back reference within the parens, found between the two bounding sets of characters. HTH, Marc Schwartz
On Thu, Jul 15, 2010 at 10:48 AM, AndrewPage <savejarvis at yahoo.com> wrote:> > Hi all, > > I'm trying to write a function that will search and extract from a long > character string, but with a twist: I want to use the characters before and > the characters after what I want to extract as reference points. ?For > example, say I'm working with data entries that looks like this: > > Drink=Coffee:Location=Office:Time=Morning:Market=Flat > > Drink=Water:Location=Office:Time=Afternoon:Market=Up > > Drink=Water:Location=Gym:Time=Evening:Market=Closed > > Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed > > > ... > > For my function, I'd like to find what's located between "Location=", and > ":Time=" in every instance, and extract it, to return something like > "Office, Office, Gym, Restaurant". > > In a previous discussion I found > (http://tolstoy.newcastle.edu.au/R/help/05/03/0344.html), someone wrote a > function where you could find and substitute characters in a string, based > on "pre" and "post" variables: > > interp <- function(x, e = parent.frame(), pre = "\\$", post = "" ) { > ? ? ? ?for(el in ls(e)) { > ? ? ? ? ? ? ? ?tag <- paste(pre, el, post, sep = "") > ? ? ? ? ? ? ? ?if (length(grep(tag, x))) x <- gsub(tag, eval(parse(text = el), e), x) > ? ? ? ? ? ? ? ?} > ? ? ? ?x > } > > I'm not sure how to modify it, however, to do what I want it to do. ?Any > suggestions?The strapply function in gsubfn can do that. By default it returns the back reference, i.e. the part of the regular expression between parentheses:> s <- c("Drink=Coffee:Location=Office:Time=Morning:Market=Flat",+ "Drink=Water:Location=Office:Time=Afternoon:Market=Up", + "Drink=Water:Location=Gym:Time=Evening:Market=Closed", + "Drink=Wine:Location=Restaurant:Time=LateEvening:Market=Closed")> > library(gsubfn) > strapply(s, "Location=(.*):Time", simplify = TRUE)[1] "Office" "Office" "Gym" "Restaurant"> > # since we know that the field we want is composed of > # word characters and followed by a non-word character > # we can even avoid specifying :Office by specifying > # word characters (\\w+) instead: > > strapply(s, "Location=(\\w+)", simplify = TRUE)[1] "Office" "Office" "Gym" "Restaurant" See http://gsubfn.googlecode.com for more.