Christopher W. Ryan
2015-Jul-09 02:23 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Running R 3.1.1 on windows 7 I want to identify as a case any record in a dataframe that contains any of several keywords in any of several variables. Example: # create a dataframe with 4 variables and 10 records v2 <- c("white bird", "blue bird", "green turtle", "quick brown fox", "big black dog", "waffle the hamster", "benny likes food a lot", "hello world", "yellow giraffe with a long neck", "black bear") v3 <- c("harry potter", "hermione grainger", "ronald weasley", "ginny weasley", "dudley dursley", "red sparks", "blue sparks", "white dress robes", "gandalf the white", "gandalf the grey") zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, lambda=2), stringsAsFactors=FALSE) str(zz) zz # here are the keywords alarm.words <- c("red", "green", "turtle", "gandalf") # For each row/record, I want to test whether the string in v2 or the string in v3 contains any of the strings in alarm.words. And then if so, set zz$v5=TRUE for that record. # I'm thinking the str_detect function in the stringr package ought to be able to help, perhaps with some use of apply over the rows, but I obviously misunderstand something about how str_detect works library(stringr) str_detect(zz[,2:3], alarm.words) # error: the target of the search # must be a vector, not multiple # columns str_detect(zz[1:4,2:3], alarm.words) # same error str_detect(zz[,2], alarm.words) # error, length of alarm.words # is less than the number of # rows I am using for the # comparison str_detect(zz[1:4,2], alarm.words) # works as hoped when length(alarm.words) # confining nrows # to the length of alarm.words str_detect(zz, alarm.words) # obviously not right # maybe I need apply() ? my.f <- function(x){str_detect(x, alarm.words)} apply(zz[,2], 1, my.f) # again, a mismatch in lengths # between alarm.words and that # in which I am searching for # matching strings apply(zz, 2, my.f) # now I'm getting somewhere apply(zz[1:4,], 2, my.f) # but still only works with 4 # rows of the dataframe # perhaps %in% could do the job? Appreciate any advice. --Chris Ryan
John Fox
2015-Jul-09 13:05 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Dear Chris, If I understand correctly what you want, how about the following?> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, grepl, x=x))) > zz[rows, ]v1 v2 v3 v4 3 -1.022329 green turtle ronald weasley 2 6 0.336599 waffle the hamster red sparks 1 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 10 1.130622 black bear gandalf the grey 2 I hope this helps, John ------------------------------------------------ John Fox, Professor McMaster University Hamilton, Ontario, Canada http://socserv.mcmaster.ca/jfox/ On Wed, 08 Jul 2015 22:23:37 -0400 "Christopher W. Ryan" <cryan at binghamton.edu> wrote:> Running R 3.1.1 on windows 7 > > I want to identify as a case any record in a dataframe that contains any > of several keywords in any of several variables. > > Example: > > # create a dataframe with 4 variables and 10 records > v2 <- c("white bird", "blue bird", "green turtle", "quick brown fox", > "big black dog", "waffle the hamster", "benny likes food a lot", "hello > world", "yellow giraffe with a long neck", "black bear") > v3 <- c("harry potter", "hermione grainger", "ronald weasley", "ginny > weasley", "dudley dursley", "red sparks", "blue sparks", "white dress > robes", "gandalf the white", "gandalf the grey") > zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, lambda=2), > stringsAsFactors=FALSE) > str(zz) > zz > > # here are the keywords > alarm.words <- c("red", "green", "turtle", "gandalf") > > # For each row/record, I want to test whether the string in v2 or the > string in v3 contains any of the strings in alarm.words. And then if so, > set zz$v5=TRUE for that record. > > # I'm thinking the str_detect function in the stringr package ought to > be able to help, perhaps with some use of apply over the rows, but I > obviously misunderstand something about how str_detect works > > library(stringr) > > str_detect(zz[,2:3], alarm.words) # error: the target of the search > # must be a vector, not multiple > # columns > > str_detect(zz[1:4,2:3], alarm.words) # same error > > str_detect(zz[,2], alarm.words) # error, length of alarm.words > # is less than the number of > # rows I am using for the > # comparison > > str_detect(zz[1:4,2], alarm.words) # works as hoped when > length(alarm.words) # confining nrows > # to the length of alarm.words > > str_detect(zz, alarm.words) # obviously not right > > # maybe I need apply() ? > my.f <- function(x){str_detect(x, alarm.words)} > > apply(zz[,2], 1, my.f) # again, a mismatch in lengths > # between alarm.words and that > # in which I am searching for > # matching strings > > apply(zz, 2, my.f) # now I'm getting somewhere > apply(zz[1:4,], 2, my.f) # but still only works with 4 > # rows of the dataframe > > > # perhaps %in% could do the job? > > Appreciate any advice. > > --Chris Ryan > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2015-Jul-09 15:51 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Here's a way to do it that uses %in% (i.e. match() ) and uses only a single, not a double, loop. It should be more efficient.> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),+ function(x)any(x %in% alarm.words)) [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE The idea is to paste the strings in each row (do.call allows an arbitrary number of columns) into a single string and then use strsplit to break the string into individual "words" on whitespace. Then the matching is vectorized with the any( %in% ... ) call. Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:> Dear Chris, > > If I understand correctly what you want, how about the following? > >> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, grepl, x=x))) >> zz[rows, ] > > v1 v2 v3 v4 > 3 -1.022329 green turtle ronald weasley 2 > 6 0.336599 waffle the hamster red sparks 1 > 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 > 10 1.130622 black bear gandalf the grey 2 > > I hope this helps, > John > > ------------------------------------------------ > John Fox, Professor > McMaster University > Hamilton, Ontario, Canada > http://socserv.mcmaster.ca/jfox/ > > > On Wed, 08 Jul 2015 22:23:37 -0400 > "Christopher W. Ryan" <cryan at binghamton.edu> wrote: >> Running R 3.1.1 on windows 7 >> >> I want to identify as a case any record in a dataframe that contains any >> of several keywords in any of several variables. >> >> Example: >> >> # create a dataframe with 4 variables and 10 records >> v2 <- c("white bird", "blue bird", "green turtle", "quick brown fox", >> "big black dog", "waffle the hamster", "benny likes food a lot", "hello >> world", "yellow giraffe with a long neck", "black bear") >> v3 <- c("harry potter", "hermione grainger", "ronald weasley", "ginny >> weasley", "dudley dursley", "red sparks", "blue sparks", "white dress >> robes", "gandalf the white", "gandalf the grey") >> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, lambda=2), >> stringsAsFactors=FALSE) >> str(zz) >> zz >> >> # here are the keywords >> alarm.words <- c("red", "green", "turtle", "gandalf") >> >> # For each row/record, I want to test whether the string in v2 or the >> string in v3 contains any of the strings in alarm.words. And then if so, >> set zz$v5=TRUE for that record. >> >> # I'm thinking the str_detect function in the stringr package ought to >> be able to help, perhaps with some use of apply over the rows, but I >> obviously misunderstand something about how str_detect works >> >> library(stringr) >> >> str_detect(zz[,2:3], alarm.words) # error: the target of the search >> # must be a vector, not multiple >> # columns >> >> str_detect(zz[1:4,2:3], alarm.words) # same error >> >> str_detect(zz[,2], alarm.words) # error, length of alarm.words >> # is less than the number of >> # rows I am using for the >> # comparison >> >> str_detect(zz[1:4,2], alarm.words) # works as hoped when >> length(alarm.words) # confining nrows >> # to the length of alarm.words >> >> str_detect(zz, alarm.words) # obviously not right >> >> # maybe I need apply() ? >> my.f <- function(x){str_detect(x, alarm.words)} >> >> apply(zz[,2], 1, my.f) # again, a mismatch in lengths >> # between alarm.words and that >> # in which I am searching for >> # matching strings >> >> apply(zz, 2, my.f) # now I'm getting somewhere >> apply(zz[1:4,], 2, my.f) # but still only works with 4 >> # rows of the dataframe >> >> >> # perhaps %in% could do the job? >> >> Appreciate any advice. >> >> --Chris Ryan >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.