Bert Gunter
2015-Jul-09 17:52 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Yup, that does it. Let grep figure out what's a word rather than doing it manually. Forgot about "\b" Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> Just add a word break marker before and after: > > zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) ) > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- > Sent from my phone. Please excuse my brevity. > > On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote: >>Jeff: >> >>Well, it would be much better (no loops!) except, I think, for one >>issue: "red" would match "barred" and I don't think that this is what >>is wanted: the matches should be on whole "words" not just string >>patterns. >> >>So you would need to fix up the matching pattern to make this work, >>but it may be a little tricky, as arbitrary whitespace characters, >>e.g. " " or "\n" etc. could be in the strings to be matched separating >>the words or ending the "sentence." I'm sure it can be done, but I'll >>leave it to you or others to figure it out. >> >>Of course, if my diagnosis is wrong or silly, please point this out. >> >>Cheers, >>Bert >> >> >>Bert Gunter >> >>"Data is not information. Information is not knowledge. And knowledge >>is certainly not wisdom." >> -- Clifford Stoll >> >> >>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller >><jdnewmil at dcn.davis.ca.us> wrote: >>> I think grep is better suited to this: >>> >>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( paste, >>zz[ , 2:3 ] ) ) ) >>> >>--------------------------------------------------------------------------- >>> Jeff Newmiller The ..... ..... Go >>Live... >>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>Go... >>> Live: OO#.. Dead: OO#.. >>Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. >>rocks...1k >>> >>--------------------------------------------------------------------------- >>> Sent from my phone. Please excuse my brevity. >>> >>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> >>wrote: >>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only a >>>>single, not a double, loop. It should be more efficient. >>>> >>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"), >>>>+ function(x)any(x %in% alarm.words)) >>>> >>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE >>>> >>>>The idea is to paste the strings in each row (do.call allows an >>>>arbitrary number of columns) into a single string and then use >>>>strsplit to break the string into individual "words" on whitespace. >>>>Then the matching is vectorized with the any( %in% ... ) call. >>>> >>>>Cheers, >>>>Bert >>>>Bert Gunter >>>> >>>>"Data is not information. Information is not knowledge. And knowledge >>>>is certainly not wisdom." >>>> -- Clifford Stoll >>>> >>>> >>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote: >>>>> Dear Chris, >>>>> >>>>> If I understand correctly what you want, how about the following? >>>>> >>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, >>>>grepl, x=x))) >>>>>> zz[rows, ] >>>>> >>>>> v1 v2 v3 v4 >>>>> 3 -1.022329 green turtle ronald weasley 2 >>>>> 6 0.336599 waffle the hamster red sparks 1 >>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 >>>>> 10 1.130622 black bear gandalf the grey 2 >>>>> >>>>> I hope this helps, >>>>> John >>>>> >>>>> ------------------------------------------------ >>>>> John Fox, Professor >>>>> McMaster University >>>>> Hamilton, Ontario, Canada >>>>> http://socserv.mcmaster.ca/jfox/ >>>>> >>>>> >>>>> On Wed, 08 Jul 2015 22:23:37 -0400 >>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote: >>>>>> Running R 3.1.1 on windows 7 >>>>>> >>>>>> I want to identify as a case any record in a dataframe that >>contains >>>>any >>>>>> of several keywords in any of several variables. >>>>>> >>>>>> Example: >>>>>> >>>>>> # create a dataframe with 4 variables and 10 records >>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown >>>>fox", >>>>>> "big black dog", "waffle the hamster", "benny likes food a lot", >>>>"hello >>>>>> world", "yellow giraffe with a long neck", "black bear") >>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", >>>>"ginny >>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white >>>>dress >>>>>> robes", "gandalf the white", "gandalf the grey") >>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, >>lambda=2), >>>>>> stringsAsFactors=FALSE) >>>>>> str(zz) >>>>>> zz >>>>>> >>>>>> # here are the keywords >>>>>> alarm.words <- c("red", "green", "turtle", "gandalf") >>>>>> >>>>>> # For each row/record, I want to test whether the string in v2 or >>>>the >>>>>> string in v3 contains any of the strings in alarm.words. And then >>if >>>>so, >>>>>> set zz$v5=TRUE for that record. >>>>>> >>>>>> # I'm thinking the str_detect function in the stringr package >>ought >>>>to >>>>>> be able to help, perhaps with some use of apply over the rows, but >>I >>>>>> obviously misunderstand something about how str_detect works >>>>>> >>>>>> library(stringr) >>>>>> >>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the >>>>search >>>>>> # must be a vector, not >>>>multiple >>>>>> # columns >>>>>> >>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error >>>>>> >>>>>> str_detect(zz[,2], alarm.words) # error, length of >>alarm.words >>>>>> # is less than the number of >>>>>> # rows I am using for the >>>>>> # comparison >>>>>> >>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when >>>>>> length(alarm.words) # confining nrows >>>>>> # to the length of >>alarm.words >>>>>> >>>>>> str_detect(zz, alarm.words) # obviously not right >>>>>> >>>>>> # maybe I need apply() ? >>>>>> my.f <- function(x){str_detect(x, alarm.words)} >>>>>> >>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths >>>>>> # between alarm.words and that >>>>>> # in which I am searching for >>>>>> # matching strings >>>>>> >>>>>> apply(zz, 2, my.f) # now I'm getting somewhere >>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4 >>>>>> # rows of the dataframe >>>>>> >>>>>> >>>>>> # perhaps %in% could do the job? >>>>>> >>>>>> Appreciate any advice. >>>>>> >>>>>> --Chris Ryan >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>>______________________________________________ >>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>https://stat.ethz.ch/mailman/listinfo/r-help >>>>PLEASE do read the posting guide >>>>http://www.R-project.org/posting-guide.html >>>>and provide commented, minimal, self-contained, reproducible code. >>> >
Christopher W Ryan
2015-Jul-09 18:48 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Thanks everyone. John's original solution worked great. And with 27,000 records, 65 alarm.words, and 6 columns to search, it takes only about 15 seconds. That is certainly adequate for my needs. But I will try out the other strategies too. And thanks also for lot's of new R things to learn--grep, grepl, do.call . . . that's always a bonus! --Chris Ryan On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> Yup, that does it. Let grep figure out what's a word rather than doing > it manually. Forgot about "\b" > > Cheers, > Bert > > > Bert Gunter > > "Data is not information. Information is not knowledge. And knowledge > is certainly not wisdom." > -- Clifford Stoll > > > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller > <jdnewmil at dcn.davis.ca.us> wrote: >> Just add a word break marker before and after: >> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) ) >> --------------------------------------------------------------------------- >> Jeff Newmiller The ..... ..... Go Live... >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... >> Live: OO#.. Dead: OO#.. Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k >> --------------------------------------------------------------------------- >> Sent from my phone. Please excuse my brevity. >> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote: >>>Jeff: >>> >>>Well, it would be much better (no loops!) except, I think, for one >>>issue: "red" would match "barred" and I don't think that this is what >>>is wanted: the matches should be on whole "words" not just string >>>patterns. >>> >>>So you would need to fix up the matching pattern to make this work, >>>but it may be a little tricky, as arbitrary whitespace characters, >>>e.g. " " or "\n" etc. could be in the strings to be matched separating >>>the words or ending the "sentence." I'm sure it can be done, but I'll >>>leave it to you or others to figure it out. >>> >>>Of course, if my diagnosis is wrong or silly, please point this out. >>> >>>Cheers, >>>Bert >>> >>> >>>Bert Gunter >>> >>>"Data is not information. Information is not knowledge. And knowledge >>>is certainly not wisdom." >>> -- Clifford Stoll >>> >>> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller >>><jdnewmil at dcn.davis.ca.us> wrote: >>>> I think grep is better suited to this: >>>> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( paste, >>>zz[ , 2:3 ] ) ) ) >>>> >>>--------------------------------------------------------------------------- >>>> Jeff Newmiller The ..... ..... Go >>>Live... >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>>Go... >>>> Live: OO#.. Dead: OO#.. >>>Playing >>>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>>> /Software/Embedded Controllers) .OO#. .OO#. >>>rocks...1k >>>> >>>--------------------------------------------------------------------------- >>>> Sent from my phone. Please excuse my brevity. >>>> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> >>>wrote: >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only a >>>>>single, not a double, loop. It should be more efficient. >>>>> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"), >>>>>+ function(x)any(x %in% alarm.words)) >>>>> >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE >>>>> >>>>>The idea is to paste the strings in each row (do.call allows an >>>>>arbitrary number of columns) into a single string and then use >>>>>strsplit to break the string into individual "words" on whitespace. >>>>>Then the matching is vectorized with the any( %in% ... ) call. >>>>> >>>>>Cheers, >>>>>Bert >>>>>Bert Gunter >>>>> >>>>>"Data is not information. Information is not knowledge. And knowledge >>>>>is certainly not wisdom." >>>>> -- Clifford Stoll >>>>> >>>>> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote: >>>>>> Dear Chris, >>>>>> >>>>>> If I understand correctly what you want, how about the following? >>>>>> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, >>>>>grepl, x=x))) >>>>>>> zz[rows, ] >>>>>> >>>>>> v1 v2 v3 v4 >>>>>> 3 -1.022329 green turtle ronald weasley 2 >>>>>> 6 0.336599 waffle the hamster red sparks 1 >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 >>>>>> 10 1.130622 black bear gandalf the grey 2 >>>>>> >>>>>> I hope this helps, >>>>>> John >>>>>> >>>>>> ------------------------------------------------ >>>>>> John Fox, Professor >>>>>> McMaster University >>>>>> Hamilton, Ontario, Canada >>>>>> http://socserv.mcmaster.ca/jfox/ >>>>>> >>>>>> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400 >>>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote: >>>>>>> Running R 3.1.1 on windows 7 >>>>>>> >>>>>>> I want to identify as a case any record in a dataframe that >>>contains >>>>>any >>>>>>> of several keywords in any of several variables. >>>>>>> >>>>>>> Example: >>>>>>> >>>>>>> # create a dataframe with 4 variables and 10 records >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown >>>>>fox", >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot", >>>>>"hello >>>>>>> world", "yellow giraffe with a long neck", "black bear") >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", >>>>>"ginny >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white >>>>>dress >>>>>>> robes", "gandalf the white", "gandalf the grey") >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, >>>lambda=2), >>>>>>> stringsAsFactors=FALSE) >>>>>>> str(zz) >>>>>>> zz >>>>>>> >>>>>>> # here are the keywords >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf") >>>>>>> >>>>>>> # For each row/record, I want to test whether the string in v2 or >>>>>the >>>>>>> string in v3 contains any of the strings in alarm.words. And then >>>if >>>>>so, >>>>>>> set zz$v5=TRUE for that record. >>>>>>> >>>>>>> # I'm thinking the str_detect function in the stringr package >>>ought >>>>>to >>>>>>> be able to help, perhaps with some use of apply over the rows, but >>>I >>>>>>> obviously misunderstand something about how str_detect works >>>>>>> >>>>>>> library(stringr) >>>>>>> >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the >>>>>search >>>>>>> # must be a vector, not >>>>>multiple >>>>>>> # columns >>>>>>> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error >>>>>>> >>>>>>> str_detect(zz[,2], alarm.words) # error, length of >>>alarm.words >>>>>>> # is less than the number of >>>>>>> # rows I am using for the >>>>>>> # comparison >>>>>>> >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when >>>>>>> length(alarm.words) # confining nrows >>>>>>> # to the length of >>>alarm.words >>>>>>> >>>>>>> str_detect(zz, alarm.words) # obviously not right >>>>>>> >>>>>>> # maybe I need apply() ? >>>>>>> my.f <- function(x){str_detect(x, alarm.words)} >>>>>>> >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths >>>>>>> # between alarm.words and that >>>>>>> # in which I am searching for >>>>>>> # matching strings >>>>>>> >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4 >>>>>>> # rows of the dataframe >>>>>>> >>>>>>> >>>>>>> # perhaps %in% could do the job? >>>>>>> >>>>>>> Appreciate any advice. >>>>>>> >>>>>>> --Chris Ryan >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide >>>>>http://www.R-project.org/posting-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>>______________________________________________ >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>https://stat.ethz.ch/mailman/listinfo/r-help >>>>>PLEASE do read the posting guide >>>>>http://www.R-project.org/posting-guide.html >>>>>and provide commented, minimal, self-contained, reproducible code. >>>> >> > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
John Fox
2015-Jul-09 19:24 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Dear Christopher, My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time. That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that). Best, John> -----Original Message----- > From: Christopher W Ryan [mailto:cryan at binghamton.edu] > Sent: July-09-15 2:49 PM > To: Bert Gunter > Cc: Jeff Newmiller; R Help; John Fox > Subject: Re: [R] detecting any element in a vector of strings, appearing > anywhere in any of several character variables in a dataframe > > Thanks everyone. John's original solution worked great. And with > 27,000 records, 65 alarm.words, and 6 columns to search, it takes only > about 15 seconds. That is certainly adequate for my needs. But I > will try out the other strategies too. > > And thanks also for lot's of new R things to learn--grep, grepl, > do.call . . . that's always a bonus! > > --Chris Ryan > > On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com> > wrote: > > Yup, that does it. Let grep figure out what's a word rather than doing > > it manually. Forgot about "\b" > > > > Cheers, > > Bert > > > > > > Bert Gunter > > > > "Data is not information. Information is not knowledge. And knowledge > > is certainly not wisdom." > > -- Clifford Stoll > > > > > > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller > > <jdnewmil at dcn.davis.ca.us> wrote: > >> Just add a word break marker before and after: > >> > >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), > ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) ) > >> --------------------------------------------------------------------- > ------ > >> Jeff Newmiller The ..... ..... Go > Live... > >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live > Go... > >> Live: OO#.. Dead: OO#.. > Playing > >> Research Engineer (Solar/Batteries O.O#. #.O#. with > >> /Software/Embedded Controllers) .OO#. .OO#. > rocks...1k > >> --------------------------------------------------------------------- > ------ > >> Sent from my phone. Please excuse my brevity. > >> > >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> > wrote: > >>>Jeff: > >>> > >>>Well, it would be much better (no loops!) except, I think, for one > >>>issue: "red" would match "barred" and I don't think that this is what > >>>is wanted: the matches should be on whole "words" not just string > >>>patterns. > >>> > >>>So you would need to fix up the matching pattern to make this work, > >>>but it may be a little tricky, as arbitrary whitespace characters, > >>>e.g. " " or "\n" etc. could be in the strings to be matched > separating > >>>the words or ending the "sentence." I'm sure it can be done, but > I'll > >>>leave it to you or others to figure it out. > >>> > >>>Of course, if my diagnosis is wrong or silly, please point this out. > >>> > >>>Cheers, > >>>Bert > >>> > >>> > >>>Bert Gunter > >>> > >>>"Data is not information. Information is not knowledge. And knowledge > >>>is certainly not wisdom." > >>> -- Clifford Stoll > >>> > >>> > >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller > >>><jdnewmil at dcn.davis.ca.us> wrote: > >>>> I think grep is better suited to this: > >>>> > >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( > paste, > >>>zz[ , 2:3 ] ) ) ) > >>>> > >>>--------------------------------------------------------------------- > ------ > >>>> Jeff Newmiller The ..... ..... Go > >>>Live... > >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. > Live > >>>Go... > >>>> Live: OO#.. Dead: OO#.. > >>>Playing > >>>> Research Engineer (Solar/Batteries O.O#. #.O#. > with > >>>> /Software/Embedded Controllers) .OO#. .OO#. > >>>rocks...1k > >>>> > >>>--------------------------------------------------------------------- > ------ > >>>> Sent from my phone. Please excuse my brevity. > >>>> > >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter > <bgunter.4567 at gmail.com> > >>>wrote: > >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only > a > >>>>>single, not a double, loop. It should be more efficient. > >>>>> > >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"), > >>>>>+ function(x)any(x %in% alarm.words)) > >>>>> > >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE > >>>>> > >>>>>The idea is to paste the strings in each row (do.call allows an > >>>>>arbitrary number of columns) into a single string and then use > >>>>>strsplit to break the string into individual "words" on whitespace. > >>>>>Then the matching is vectorized with the any( %in% ... ) call. > >>>>> > >>>>>Cheers, > >>>>>Bert > >>>>>Bert Gunter > >>>>> > >>>>>"Data is not information. Information is not knowledge. And > knowledge > >>>>>is certainly not wisdom." > >>>>> -- Clifford Stoll > >>>>> > >>>>> > >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote: > >>>>>> Dear Chris, > >>>>>> > >>>>>> If I understand correctly what you want, how about the following? > >>>>>> > >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, > >>>>>grepl, x=x))) > >>>>>>> zz[rows, ] > >>>>>> > >>>>>> v1 v2 v3 v4 > >>>>>> 3 -1.022329 green turtle ronald weasley 2 > >>>>>> 6 0.336599 waffle the hamster red sparks 1 > >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 > >>>>>> 10 1.130622 black bear gandalf the grey 2 > >>>>>> > >>>>>> I hope this helps, > >>>>>> John > >>>>>> > >>>>>> ------------------------------------------------ > >>>>>> John Fox, Professor > >>>>>> McMaster University > >>>>>> Hamilton, Ontario, Canada > >>>>>> http://socserv.mcmaster.ca/jfox/ > >>>>>> > >>>>>> > >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400 > >>>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote: > >>>>>>> Running R 3.1.1 on windows 7 > >>>>>>> > >>>>>>> I want to identify as a case any record in a dataframe that > >>>contains > >>>>>any > >>>>>>> of several keywords in any of several variables. > >>>>>>> > >>>>>>> Example: > >>>>>>> > >>>>>>> # create a dataframe with 4 variables and 10 records > >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown > >>>>>fox", > >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot", > >>>>>"hello > >>>>>>> world", "yellow giraffe with a long neck", "black bear") > >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", > >>>>>"ginny > >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white > >>>>>dress > >>>>>>> robes", "gandalf the white", "gandalf the grey") > >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, > >>>lambda=2), > >>>>>>> stringsAsFactors=FALSE) > >>>>>>> str(zz) > >>>>>>> zz > >>>>>>> > >>>>>>> # here are the keywords > >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf") > >>>>>>> > >>>>>>> # For each row/record, I want to test whether the string in v2 > or > >>>>>the > >>>>>>> string in v3 contains any of the strings in alarm.words. And > then > >>>if > >>>>>so, > >>>>>>> set zz$v5=TRUE for that record. > >>>>>>> > >>>>>>> # I'm thinking the str_detect function in the stringr package > >>>ought > >>>>>to > >>>>>>> be able to help, perhaps with some use of apply over the rows, > but > >>>I > >>>>>>> obviously misunderstand something about how str_detect works > >>>>>>> > >>>>>>> library(stringr) > >>>>>>> > >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the > >>>>>search > >>>>>>> # must be a vector, not > >>>>>multiple > >>>>>>> # columns > >>>>>>> > >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error > >>>>>>> > >>>>>>> str_detect(zz[,2], alarm.words) # error, length of > >>>alarm.words > >>>>>>> # is less than the number > of > >>>>>>> # rows I am using for the > >>>>>>> # comparison > >>>>>>> > >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when > >>>>>>> length(alarm.words) # confining nrows > >>>>>>> # to the length of > >>>alarm.words > >>>>>>> > >>>>>>> str_detect(zz, alarm.words) # obviously not right > >>>>>>> > >>>>>>> # maybe I need apply() ? > >>>>>>> my.f <- function(x){str_detect(x, alarm.words)} > >>>>>>> > >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths > >>>>>>> # between alarm.words and that > >>>>>>> # in which I am searching for > >>>>>>> # matching strings > >>>>>>> > >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere > >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4 > >>>>>>> # rows of the dataframe > >>>>>>> > >>>>>>> > >>>>>>> # perhaps %in% could do the job? > >>>>>>> > >>>>>>> Appreciate any advice. > >>>>>>> > >>>>>>> --Chris Ryan > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, > see > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>>>http://www.R-project.org/posting-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible > code. > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>> PLEASE do read the posting guide > >>>>>http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, reproducible > code. > >>>>> > >>>>>______________________________________________ > >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>PLEASE do read the posting guide > >>>>>http://www.R-project.org/posting-guide.html > >>>>>and provide commented, minimal, self-contained, reproducible code. > >>>> > >> > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > > and provide commented, minimal, self-contained, reproducible code.--- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus