Hi all,
I would like to detect all strings in the vector 'content' that
contain the strings from the vector 'search'. Here a code example:
content <- data.frame(urls=c(
"http://www.google.com/search?source=ig&hl=en&rlz=&=&q=stuff&aq=f&aqi=g10&aql=&oq=&gs_rfai=CrrIS3",
"http://search.yahoo.com/search;_ylt=Atvki9MVpnxuEcPmXLEWgMqbvZx4?p=stuff&toggle=1")
)
search <- data.frame(signatures=c("http://www.google.com/search"))
subset(content, search$signatures %in% content$urls)
I am getting an error:
[1] urls
<0 rows> (or 0-length row.names)
What I would like to achieve is the return of
"http://www.google.com/search?source=ig&hl=en&rlz=&=&q=stuff&aq=f&aqi=g10&aql=&oq=&gs_rfai=CrrIS3".
Is that possible? In practice I would like to run this over 1000s of
strings in 'content' and 100s of strings in 'search'. Could I
run into
performance issues with this approach and, if so, are there better
ways?
Best,
Ralf
well %in% is really checking if the element is in the set and is not a substring operator. To get the result you want, try content[grepl(search$signatures, content$urls),] For multiple operations you could try sapply(search$signatures, grepl, x=content$urls) Nikhil Kaza Asst. Professor, City and Regional Planning University of North Carolina nikhil.list at gmail.com On Jul 13, 2010, at 8:22 AM, Ralf B wrote:> Hi all, > > I would like to detect all strings in the vector 'content' that > contain the strings from the vector 'search'. Here a code example: > > content <- data.frame(urls=c( > "http://www.google.com/search?source=ig&hl=en&rlz=&=&q=stuff&aq=f&aqi=g10&aql=&oq=&gs_rfai=CrrIS3 > ", > "http://search.yahoo.com/search;_ylt=Atvki9MVpnxuEcPmXLEWgMqbvZx4?p=stuff&toggle=1 > ") > ) > search <- data.frame(signatures=c("http://www.google.com/search")) > subset(content, search$signatures %in% content$urls) > > I am getting an error: > > [1] urls > <0 rows> (or 0-length row.names) > > > What I would like to achieve is the return of > "http://www.google.com/search?source=ig&hl=en&rlz=&=&q=stuff&aq=f&aqi=g10&aql=&oq=&gs_rfai=CrrIS3 > ". > Is that possible? In practice I would like to run this over 1000s of > strings in 'content' and 100s of strings in 'search'. Could I run into > performance issues with this approach and, if so, are there better > ways? > > Best, > Ralf > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
The high-level concept you need is called "Regular Expressions". R supports these through several functions, see ?regex . Ralf B wrote:> Hi all, > > I would like to detect all strings in the vector 'content' that > contain the strings from the vector 'search'. Here a code example: > > content <- data.frame(urls=c( > "http://www.google.com/search?source=ig&hl=en&rlz=&=&q=stuff&aq=f&aqi=g10&aql=&oq=&gs_rfai=CrrIS3", > "http://search.yahoo.com/search;_ylt=Atvki9MVpnxuEcPmXLEWgMqbvZx4?p=stuff&toggle=1") > ) > search <- data.frame(signatures=c("http://www.google.com/search")) > subset(content, search$signatures %in% content$urls) > > I am getting an error: > > [1] urls > <0 rows> (or 0-length row.names) > > > What I would like to achieve is the return of > "http://www.google.com/search?source=ig&hl=en&rlz=&=&q=stuff&aq=f&aqi=g10&aql=&oq=&gs_rfai=CrrIS3". > Is that possible? In practice I would like to run this over 1000s of > strings in 'content' and 100s of strings in 'search'. Could I run into > performance issues with this approach and, if so, are there better > ways? > > Best, > Ralf > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.