Christopher W Ryan
2015-Jul-10 17:30 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Interesting thoughts about the partial-word matches, and speed On another real data set, about 73,000 records and 6 columns to search through for matches (one column of which contains very long character strings--several paragraphs each), I ran both John's and Bert's solutions. John's was noticeably slower, although still quite tolerable. There were a different number of matches, though: oic.2 oic FALSE TRUE Sum FALSE 74939 0 74939 TRUE 274 927 1201 Sum 75213 927 76140 where oic is the logical vector generated by John's solution, and oic.2 is the logical vector generated by Bert's solution. Bert's solution detected about 77% of the cases detected by John's. I'm still exploring why that might be. One possible explanation, for at least part of the difference, is the issue of partial-word matches. Substantively, I am searching ambulance run records for words related to opioid overdose, and I've noticed that the medics often spell heroin as "heroine" So in this context, I like partial-word matches--I want to pick up records that (partially) match "heroin" because it is contained in the word "heroine" . There may be other things going on too. Thanks. --Chris On Thu, Jul 9, 2015 at 3:24 PM, John Fox <jfox at mcmaster.ca> wrote:> Dear Christopher, > > My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time. > > That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that). > > Best, > John > >> -----Original Message----- >> From: Christopher W Ryan [mailto:cryan at binghamton.edu] >> Sent: July-09-15 2:49 PM >> To: Bert Gunter >> Cc: Jeff Newmiller; R Help; John Fox >> Subject: Re: [R] detecting any element in a vector of strings, appearing >> anywhere in any of several character variables in a dataframe >> >> Thanks everyone. John's original solution worked great. And with >> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only >> about 15 seconds. That is certainly adequate for my needs. But I >> will try out the other strategies too. >> >> And thanks also for lot's of new R things to learn--grep, grepl, >> do.call . . . that's always a bonus! >> >> --Chris Ryan >> >> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com> >> wrote: >> > Yup, that does it. Let grep figure out what's a word rather than doing >> > it manually. Forgot about "\b" >> > >> > Cheers, >> > Bert >> > >> > >> > Bert Gunter >> > >> > "Data is not information. Information is not knowledge. And knowledge >> > is certainly not wisdom." >> > -- Clifford Stoll >> > >> > >> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller >> > <jdnewmil at dcn.davis.ca.us> wrote: >> >> Just add a word break marker before and after: >> >> >> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), >> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) ) >> >> --------------------------------------------------------------------- >> ------ >> >> Jeff Newmiller The ..... ..... Go >> Live... >> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> >> Live: OO#.. Dead: OO#.. >> Playing >> >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> >> /Software/Embedded Controllers) .OO#. .OO#. >> rocks...1k >> >> --------------------------------------------------------------------- >> ------ >> >> Sent from my phone. Please excuse my brevity. >> >> >> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> >> wrote: >> >>>Jeff: >> >>> >> >>>Well, it would be much better (no loops!) except, I think, for one >> >>>issue: "red" would match "barred" and I don't think that this is what >> >>>is wanted: the matches should be on whole "words" not just string >> >>>patterns. >> >>> >> >>>So you would need to fix up the matching pattern to make this work, >> >>>but it may be a little tricky, as arbitrary whitespace characters, >> >>>e.g. " " or "\n" etc. could be in the strings to be matched >> separating >> >>>the words or ending the "sentence." I'm sure it can be done, but >> I'll >> >>>leave it to you or others to figure it out. >> >>> >> >>>Of course, if my diagnosis is wrong or silly, please point this out. >> >>> >> >>>Cheers, >> >>>Bert >> >>> >> >>> >> >>>Bert Gunter >> >>> >> >>>"Data is not information. Information is not knowledge. And knowledge >> >>>is certainly not wisdom." >> >>> -- Clifford Stoll >> >>> >> >>> >> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller >> >>><jdnewmil at dcn.davis.ca.us> wrote: >> >>>> I think grep is better suited to this: >> >>>> >> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( >> paste, >> >>>zz[ , 2:3 ] ) ) ) >> >>>> >> >>>--------------------------------------------------------------------- >> ------ >> >>>> Jeff Newmiller The ..... ..... Go >> >>>Live... >> >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. >> Live >> >>>Go... >> >>>> Live: OO#.. Dead: OO#.. >> >>>Playing >> >>>> Research Engineer (Solar/Batteries O.O#. #.O#. >> with >> >>>> /Software/Embedded Controllers) .OO#. .OO#. >> >>>rocks...1k >> >>>> >> >>>--------------------------------------------------------------------- >> ------ >> >>>> Sent from my phone. Please excuse my brevity. >> >>>> >> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter >> <bgunter.4567 at gmail.com> >> >>>wrote: >> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only >> a >> >>>>>single, not a double, loop. It should be more efficient. >> >>>>> >> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"), >> >>>>>+ function(x)any(x %in% alarm.words)) >> >>>>> >> >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE >> >>>>> >> >>>>>The idea is to paste the strings in each row (do.call allows an >> >>>>>arbitrary number of columns) into a single string and then use >> >>>>>strsplit to break the string into individual "words" on whitespace. >> >>>>>Then the matching is vectorized with the any( %in% ... ) call. >> >>>>> >> >>>>>Cheers, >> >>>>>Bert >> >>>>>Bert Gunter >> >>>>> >> >>>>>"Data is not information. Information is not knowledge. And >> knowledge >> >>>>>is certainly not wisdom." >> >>>>> -- Clifford Stoll >> >>>>> >> >>>>> >> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote: >> >>>>>> Dear Chris, >> >>>>>> >> >>>>>> If I understand correctly what you want, how about the following? >> >>>>>> >> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, >> >>>>>grepl, x=x))) >> >>>>>>> zz[rows, ] >> >>>>>> >> >>>>>> v1 v2 v3 v4 >> >>>>>> 3 -1.022329 green turtle ronald weasley 2 >> >>>>>> 6 0.336599 waffle the hamster red sparks 1 >> >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 >> >>>>>> 10 1.130622 black bear gandalf the grey 2 >> >>>>>> >> >>>>>> I hope this helps, >> >>>>>> John >> >>>>>> >> >>>>>> ------------------------------------------------ >> >>>>>> John Fox, Professor >> >>>>>> McMaster University >> >>>>>> Hamilton, Ontario, Canada >> >>>>>> http://socserv.mcmaster.ca/jfox/ >> >>>>>> >> >>>>>> >> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400 >> >>>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote: >> >>>>>>> Running R 3.1.1 on windows 7 >> >>>>>>> >> >>>>>>> I want to identify as a case any record in a dataframe that >> >>>contains >> >>>>>any >> >>>>>>> of several keywords in any of several variables. >> >>>>>>> >> >>>>>>> Example: >> >>>>>>> >> >>>>>>> # create a dataframe with 4 variables and 10 records >> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown >> >>>>>fox", >> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot", >> >>>>>"hello >> >>>>>>> world", "yellow giraffe with a long neck", "black bear") >> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", >> >>>>>"ginny >> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white >> >>>>>dress >> >>>>>>> robes", "gandalf the white", "gandalf the grey") >> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, >> >>>lambda=2), >> >>>>>>> stringsAsFactors=FALSE) >> >>>>>>> str(zz) >> >>>>>>> zz >> >>>>>>> >> >>>>>>> # here are the keywords >> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf") >> >>>>>>> >> >>>>>>> # For each row/record, I want to test whether the string in v2 >> or >> >>>>>the >> >>>>>>> string in v3 contains any of the strings in alarm.words. And >> then >> >>>if >> >>>>>so, >> >>>>>>> set zz$v5=TRUE for that record. >> >>>>>>> >> >>>>>>> # I'm thinking the str_detect function in the stringr package >> >>>ought >> >>>>>to >> >>>>>>> be able to help, perhaps with some use of apply over the rows, >> but >> >>>I >> >>>>>>> obviously misunderstand something about how str_detect works >> >>>>>>> >> >>>>>>> library(stringr) >> >>>>>>> >> >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the >> >>>>>search >> >>>>>>> # must be a vector, not >> >>>>>multiple >> >>>>>>> # columns >> >>>>>>> >> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error >> >>>>>>> >> >>>>>>> str_detect(zz[,2], alarm.words) # error, length of >> >>>alarm.words >> >>>>>>> # is less than the number >> of >> >>>>>>> # rows I am using for the >> >>>>>>> # comparison >> >>>>>>> >> >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when >> >>>>>>> length(alarm.words) # confining nrows >> >>>>>>> # to the length of >> >>>alarm.words >> >>>>>>> >> >>>>>>> str_detect(zz, alarm.words) # obviously not right >> >>>>>>> >> >>>>>>> # maybe I need apply() ? >> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)} >> >>>>>>> >> >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths >> >>>>>>> # between alarm.words and that >> >>>>>>> # in which I am searching for >> >>>>>>> # matching strings >> >>>>>>> >> >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere >> >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4 >> >>>>>>> # rows of the dataframe >> >>>>>>> >> >>>>>>> >> >>>>>>> # perhaps %in% could do the job? >> >>>>>>> >> >>>>>>> Appreciate any advice. >> >>>>>>> >> >>>>>>> --Chris Ryan >> >>>>>>> >> >>>>>>> ______________________________________________ >> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, >> see >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >>>>>>> PLEASE do read the posting guide >> >>>>>http://www.R-project.org/posting-guide.html >> >>>>>>> and provide commented, minimal, self-contained, reproducible >> code. >> >>>>>> >> >>>>>> ______________________________________________ >> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >> >>>>>> PLEASE do read the posting guide >> >>>>>http://www.R-project.org/posting-guide.html >> >>>>>> and provide commented, minimal, self-contained, reproducible >> code. >> >>>>> >> >>>>>______________________________________________ >> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help >> >>>>>PLEASE do read the posting guide >> >>>>>http://www.R-project.org/posting-guide.html >> >>>>>and provide commented, minimal, self-contained, reproducible code. >> >>>> >> >> >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide http://www.R-project.org/posting- >> guide.html >> > and provide commented, minimal, self-contained, reproducible code. > > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus >
Bert Gunter
2015-Jul-10 17:39 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Yes. This is one of the fundamental challenges in text searching -- defining exactly what text defines a match and what doesn't. So, continuing your example, one might imagine that heroin and heroine might both be matches, but maybe heroines shouldn't be (e.g. if the text contains movie reviews). So what one might want to do is add semantic analysis to searches, ? la google, a topic way beyond the simple capabilities discussed, or likely needed, here. Incidentally, Jeff Newmiller's (final) regular expression solution is preferable to mine in all respects, I think. -- Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Fri, Jul 10, 2015 at 10:30 AM, Christopher W Ryan <cryan at binghamton.edu> wrote:> Interesting thoughts about the partial-word matches, and speed On > another real data set, about 73,000 records and 6 columns to search > through for matches (one column of which contains very long character > strings--several paragraphs each), I ran both John's and Bert's > solutions. John's was noticeably slower, although still quite > tolerable. There were a different number of matches, though: > > oic.2 > oic FALSE TRUE Sum > FALSE 74939 0 74939 > TRUE 274 927 1201 > Sum 75213 927 76140 > > where oic is the logical vector generated by John's solution, and > oic.2 is the logical vector generated by Bert's solution. Bert's > solution detected about 77% of the cases detected by John's. > > I'm still exploring why that might be. One possible explanation, for > at least part of the difference, is the issue of partial-word matches. > Substantively, I am searching ambulance run records for words related > to opioid overdose, and I've noticed that the medics often spell > heroin as "heroine" So in this context, I like partial-word > matches--I want to pick up records that (partially) match "heroin" > because it is contained in the word "heroine" . > > There may be other things going on too. > > Thanks. > > --Chris > > On Thu, Jul 9, 2015 at 3:24 PM, John Fox <jfox at mcmaster.ca> wrote: >> Dear Christopher, >> >> My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time. >> >> That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that). >> >> Best, >> John >> >>> -----Original Message----- >>> From: Christopher W Ryan [mailto:cryan at binghamton.edu] >>> Sent: July-09-15 2:49 PM >>> To: Bert Gunter >>> Cc: Jeff Newmiller; R Help; John Fox >>> Subject: Re: [R] detecting any element in a vector of strings, appearing >>> anywhere in any of several character variables in a dataframe >>> >>> Thanks everyone. John's original solution worked great. And with >>> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only >>> about 15 seconds. That is certainly adequate for my needs. But I >>> will try out the other strategies too. >>> >>> And thanks also for lot's of new R things to learn--grep, grepl, >>> do.call . . . that's always a bonus! >>> >>> --Chris Ryan >>> >>> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com> >>> wrote: >>> > Yup, that does it. Let grep figure out what's a word rather than doing >>> > it manually. Forgot about "\b" >>> > >>> > Cheers, >>> > Bert >>> > >>> > >>> > Bert Gunter >>> > >>> > "Data is not information. Information is not knowledge. And knowledge >>> > is certainly not wisdom." >>> > -- Clifford Stoll >>> > >>> > >>> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller >>> > <jdnewmil at dcn.davis.ca.us> wrote: >>> >> Just add a word break marker before and after: >>> >> >>> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), >>> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) ) >>> >> --------------------------------------------------------------------- >>> ------ >>> >> Jeff Newmiller The ..... ..... Go >>> Live... >>> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>> Go... >>> >> Live: OO#.. Dead: OO#.. >>> Playing >>> >> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> >> /Software/Embedded Controllers) .OO#. .OO#. >>> rocks...1k >>> >> --------------------------------------------------------------------- >>> ------ >>> >> Sent from my phone. Please excuse my brevity. >>> >> >>> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> >>> wrote: >>> >>>Jeff: >>> >>> >>> >>>Well, it would be much better (no loops!) except, I think, for one >>> >>>issue: "red" would match "barred" and I don't think that this is what >>> >>>is wanted: the matches should be on whole "words" not just string >>> >>>patterns. >>> >>> >>> >>>So you would need to fix up the matching pattern to make this work, >>> >>>but it may be a little tricky, as arbitrary whitespace characters, >>> >>>e.g. " " or "\n" etc. could be in the strings to be matched >>> separating >>> >>>the words or ending the "sentence." I'm sure it can be done, but >>> I'll >>> >>>leave it to you or others to figure it out. >>> >>> >>> >>>Of course, if my diagnosis is wrong or silly, please point this out. >>> >>> >>> >>>Cheers, >>> >>>Bert >>> >>> >>> >>> >>> >>>Bert Gunter >>> >>> >>> >>>"Data is not information. Information is not knowledge. And knowledge >>> >>>is certainly not wisdom." >>> >>> -- Clifford Stoll >>> >>> >>> >>> >>> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller >>> >>><jdnewmil at dcn.davis.ca.us> wrote: >>> >>>> I think grep is better suited to this: >>> >>>> >>> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( >>> paste, >>> >>>zz[ , 2:3 ] ) ) ) >>> >>>> >>> >>>--------------------------------------------------------------------- >>> ------ >>> >>>> Jeff Newmiller The ..... ..... Go >>> >>>Live... >>> >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. >>> Live >>> >>>Go... >>> >>>> Live: OO#.. Dead: OO#.. >>> >>>Playing >>> >>>> Research Engineer (Solar/Batteries O.O#. #.O#. >>> with >>> >>>> /Software/Embedded Controllers) .OO#. .OO#. >>> >>>rocks...1k >>> >>>> >>> >>>--------------------------------------------------------------------- >>> ------ >>> >>>> Sent from my phone. Please excuse my brevity. >>> >>>> >>> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter >>> <bgunter.4567 at gmail.com> >>> >>>wrote: >>> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only >>> a >>> >>>>>single, not a double, loop. It should be more efficient. >>> >>>>> >>> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"), >>> >>>>>+ function(x)any(x %in% alarm.words)) >>> >>>>> >>> >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE >>> >>>>> >>> >>>>>The idea is to paste the strings in each row (do.call allows an >>> >>>>>arbitrary number of columns) into a single string and then use >>> >>>>>strsplit to break the string into individual "words" on whitespace. >>> >>>>>Then the matching is vectorized with the any( %in% ... ) call. >>> >>>>> >>> >>>>>Cheers, >>> >>>>>Bert >>> >>>>>Bert Gunter >>> >>>>> >>> >>>>>"Data is not information. Information is not knowledge. And >>> knowledge >>> >>>>>is certainly not wisdom." >>> >>>>> -- Clifford Stoll >>> >>>>> >>> >>>>> >>> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote: >>> >>>>>> Dear Chris, >>> >>>>>> >>> >>>>>> If I understand correctly what you want, how about the following? >>> >>>>>> >>> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, >>> >>>>>grepl, x=x))) >>> >>>>>>> zz[rows, ] >>> >>>>>> >>> >>>>>> v1 v2 v3 v4 >>> >>>>>> 3 -1.022329 green turtle ronald weasley 2 >>> >>>>>> 6 0.336599 waffle the hamster red sparks 1 >>> >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 >>> >>>>>> 10 1.130622 black bear gandalf the grey 2 >>> >>>>>> >>> >>>>>> I hope this helps, >>> >>>>>> John >>> >>>>>> >>> >>>>>> ------------------------------------------------ >>> >>>>>> John Fox, Professor >>> >>>>>> McMaster University >>> >>>>>> Hamilton, Ontario, Canada >>> >>>>>> http://socserv.mcmaster.ca/jfox/ >>> >>>>>> >>> >>>>>> >>> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400 >>> >>>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote: >>> >>>>>>> Running R 3.1.1 on windows 7 >>> >>>>>>> >>> >>>>>>> I want to identify as a case any record in a dataframe that >>> >>>contains >>> >>>>>any >>> >>>>>>> of several keywords in any of several variables. >>> >>>>>>> >>> >>>>>>> Example: >>> >>>>>>> >>> >>>>>>> # create a dataframe with 4 variables and 10 records >>> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown >>> >>>>>fox", >>> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot", >>> >>>>>"hello >>> >>>>>>> world", "yellow giraffe with a long neck", "black bear") >>> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", >>> >>>>>"ginny >>> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white >>> >>>>>dress >>> >>>>>>> robes", "gandalf the white", "gandalf the grey") >>> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, >>> >>>lambda=2), >>> >>>>>>> stringsAsFactors=FALSE) >>> >>>>>>> str(zz) >>> >>>>>>> zz >>> >>>>>>> >>> >>>>>>> # here are the keywords >>> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf") >>> >>>>>>> >>> >>>>>>> # For each row/record, I want to test whether the string in v2 >>> or >>> >>>>>the >>> >>>>>>> string in v3 contains any of the strings in alarm.words. And >>> then >>> >>>if >>> >>>>>so, >>> >>>>>>> set zz$v5=TRUE for that record. >>> >>>>>>> >>> >>>>>>> # I'm thinking the str_detect function in the stringr package >>> >>>ought >>> >>>>>to >>> >>>>>>> be able to help, perhaps with some use of apply over the rows, >>> but >>> >>>I >>> >>>>>>> obviously misunderstand something about how str_detect works >>> >>>>>>> >>> >>>>>>> library(stringr) >>> >>>>>>> >>> >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the >>> >>>>>search >>> >>>>>>> # must be a vector, not >>> >>>>>multiple >>> >>>>>>> # columns >>> >>>>>>> >>> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error >>> >>>>>>> >>> >>>>>>> str_detect(zz[,2], alarm.words) # error, length of >>> >>>alarm.words >>> >>>>>>> # is less than the number >>> of >>> >>>>>>> # rows I am using for the >>> >>>>>>> # comparison >>> >>>>>>> >>> >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when >>> >>>>>>> length(alarm.words) # confining nrows >>> >>>>>>> # to the length of >>> >>>alarm.words >>> >>>>>>> >>> >>>>>>> str_detect(zz, alarm.words) # obviously not right >>> >>>>>>> >>> >>>>>>> # maybe I need apply() ? >>> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)} >>> >>>>>>> >>> >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths >>> >>>>>>> # between alarm.words and that >>> >>>>>>> # in which I am searching for >>> >>>>>>> # matching strings >>> >>>>>>> >>> >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere >>> >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4 >>> >>>>>>> # rows of the dataframe >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> # perhaps %in% could do the job? >>> >>>>>>> >>> >>>>>>> Appreciate any advice. >>> >>>>>>> >>> >>>>>>> --Chris Ryan >>> >>>>>>> >>> >>>>>>> ______________________________________________ >>> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, >>> see >>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>> >>>>>>> PLEASE do read the posting guide >>> >>>>>http://www.R-project.org/posting-guide.html >>> >>>>>>> and provide commented, minimal, self-contained, reproducible >>> code. >>> >>>>>> >>> >>>>>> ______________________________________________ >>> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>> >>>>>> PLEASE do read the posting guide >>> >>>>>http://www.R-project.org/posting-guide.html >>> >>>>>> and provide commented, minimal, self-contained, reproducible >>> code. >>> >>>>> >>> >>>>>______________________________________________ >>> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help >>> >>>>>PLEASE do read the posting guide >>> >>>>>http://www.R-project.org/posting-guide.html >>> >>>>>and provide commented, minimal, self-contained, reproducible code. >>> >>>> >>> >> >>> > >>> > ______________________________________________ >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > https://stat.ethz.ch/mailman/listinfo/r-help >>> > PLEASE do read the posting guide http://www.R-project.org/posting- >>> guide.html >>> > and provide commented, minimal, self-contained, reproducible code. >> >> >> --- >> This email has been checked for viruses by Avast antivirus software. >> https://www.avast.com/antivirus >>
Bert Gunter
2015-Jul-11 14:36 UTC
[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Note that John's solution probably includes incorrect partial matches and that mine fails to match "red" in "this is red." If you change my proposal to sapply(strsplit(do.call(paste,zz[,2:3]),"\\W"), function(x)any(x %in% alarm.words)) it should agree with Jeff's. Note, however, that you have missed capital letters: "Red" would not match "This is red". Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Fri, Jul 10, 2015 at 10:54 AM, Christopher W Ryan <cryan at binghamton.edu> wrote:> Indeed, the perils of syndromic surveillance with free text. > >> with(dd.2, table(fox)) > fox > FALSE TRUE > 74939 1201 > >> with(dd.2, table(gunter)) > gunter > FALSE TRUE > 75213 927 > >> with(dd.2, table(newmiller)) > newmiller > FALSE TRUE > 75028 1112 > > > Of, course, the simplest thing for me to do would be add "heroine" to > the alarm.words. I'm surprised that the US national organization that > promulgated this list of drug-related terms did not include it. Many > other common misspellings are included. I will have to contact them. > > --Chris > > On Fri, Jul 10, 2015 at 1:39 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote: >> Yes. This is one of the fundamental challenges in text searching -- >> defining exactly what text defines a match and what doesn't. So, >> continuing your example, one might imagine that heroin and heroine >> might both be matches, but maybe heroines shouldn't be (e.g. if the >> text contains movie reviews). So what one might want to do is add >> semantic analysis to searches, ? la google, a topic way beyond the >> simple capabilities discussed, or likely needed, here. >> >> Incidentally, Jeff Newmiller's (final) regular expression solution is >> preferable to mine in all respects, I think. >> >> -- Bert >> >> >> Bert Gunter >> >> "Data is not information. Information is not knowledge. And knowledge >> is certainly not wisdom." >> -- Clifford Stoll >> >> >> On Fri, Jul 10, 2015 at 10:30 AM, Christopher W Ryan >> <cryan at binghamton.edu> wrote: >>> Interesting thoughts about the partial-word matches, and speed On >>> another real data set, about 73,000 records and 6 columns to search >>> through for matches (one column of which contains very long character >>> strings--several paragraphs each), I ran both John's and Bert's >>> solutions. John's was noticeably slower, although still quite >>> tolerable. There were a different number of matches, though: >>> >>> oic.2 >>> oic FALSE TRUE Sum >>> FALSE 74939 0 74939 >>> TRUE 274 927 1201 >>> Sum 75213 927 76140 >>> >>> where oic is the logical vector generated by John's solution, and >>> oic.2 is the logical vector generated by Bert's solution. Bert's >>> solution detected about 77% of the cases detected by John's. >>> >>> I'm still exploring why that might be. One possible explanation, for >>> at least part of the difference, is the issue of partial-word matches. >>> Substantively, I am searching ambulance run records for words related >>> to opioid overdose, and I've noticed that the medics often spell >>> heroin as "heroine" So in this context, I like partial-word >>> matches--I want to pick up records that (partially) match "heroin" >>> because it is contained in the word "heroine" . >>> >>> There may be other things going on too. >>> >>> Thanks. >>> >>> --Chris >>> >>> On Thu, Jul 9, 2015 at 3:24 PM, John Fox <jfox at mcmaster.ca> wrote: >>>> Dear Christopher, >>>> >>>> My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time. >>>> >>>> That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark, and Jeff Newmiller's solution was about 10 times faster. Both Bert's and Jeff's (eventual) solution protect against partial (rather than full-word) matches, while mine doesn't (though it could easily be modified to do that). >>>> >>>> Best, >>>> John >>>> >>>>> -----Original Message----- >>>>> From: Christopher W Ryan [mailto:cryan at binghamton.edu] >>>>> Sent: July-09-15 2:49 PM >>>>> To: Bert Gunter >>>>> Cc: Jeff Newmiller; R Help; John Fox >>>>> Subject: Re: [R] detecting any element in a vector of strings, appearing >>>>> anywhere in any of several character variables in a dataframe >>>>> >>>>> Thanks everyone. John's original solution worked great. And with >>>>> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only >>>>> about 15 seconds. That is certainly adequate for my needs. But I >>>>> will try out the other strategies too. >>>>> >>>>> And thanks also for lot's of new R things to learn--grep, grepl, >>>>> do.call . . . that's always a bonus! >>>>> >>>>> --Chris Ryan >>>>> >>>>> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com> >>>>> wrote: >>>>> > Yup, that does it. Let grep figure out what's a word rather than doing >>>>> > it manually. Forgot about "\b" >>>>> > >>>>> > Cheers, >>>>> > Bert >>>>> > >>>>> > >>>>> > Bert Gunter >>>>> > >>>>> > "Data is not information. Information is not knowledge. And knowledge >>>>> > is certainly not wisdom." >>>>> > -- Clifford Stoll >>>>> > >>>>> > >>>>> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller >>>>> > <jdnewmil at dcn.davis.ca.us> wrote: >>>>> >> Just add a word break marker before and after: >>>>> >> >>>>> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), >>>>> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) ) >>>>> >> --------------------------------------------------------------------- >>>>> ------ >>>>> >> Jeff Newmiller The ..... ..... Go >>>>> Live... >>>>> >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>>>> Go... >>>>> >> Live: OO#.. Dead: OO#.. >>>>> Playing >>>>> >> Research Engineer (Solar/Batteries O.O#. #.O#. with >>>>> >> /Software/Embedded Controllers) .OO#. .OO#. >>>>> rocks...1k >>>>> >> --------------------------------------------------------------------- >>>>> ------ >>>>> >> Sent from my phone. Please excuse my brevity. >>>>> >> >>>>> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> >>>>> wrote: >>>>> >>>Jeff: >>>>> >>> >>>>> >>>Well, it would be much better (no loops!) except, I think, for one >>>>> >>>issue: "red" would match "barred" and I don't think that this is what >>>>> >>>is wanted: the matches should be on whole "words" not just string >>>>> >>>patterns. >>>>> >>> >>>>> >>>So you would need to fix up the matching pattern to make this work, >>>>> >>>but it may be a little tricky, as arbitrary whitespace characters, >>>>> >>>e.g. " " or "\n" etc. could be in the strings to be matched >>>>> separating >>>>> >>>the words or ending the "sentence." I'm sure it can be done, but >>>>> I'll >>>>> >>>leave it to you or others to figure it out. >>>>> >>> >>>>> >>>Of course, if my diagnosis is wrong or silly, please point this out. >>>>> >>> >>>>> >>>Cheers, >>>>> >>>Bert >>>>> >>> >>>>> >>> >>>>> >>>Bert Gunter >>>>> >>> >>>>> >>>"Data is not information. Information is not knowledge. And knowledge >>>>> >>>is certainly not wisdom." >>>>> >>> -- Clifford Stoll >>>>> >>> >>>>> >>> >>>>> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller >>>>> >>><jdnewmil at dcn.davis.ca.us> wrote: >>>>> >>>> I think grep is better suited to this: >>>>> >>>> >>>>> >>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( >>>>> paste, >>>>> >>>zz[ , 2:3 ] ) ) ) >>>>> >>>> >>>>> >>>--------------------------------------------------------------------- >>>>> ------ >>>>> >>>> Jeff Newmiller The ..... ..... Go >>>>> >>>Live... >>>>> >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. >>>>> Live >>>>> >>>Go... >>>>> >>>> Live: OO#.. Dead: OO#.. >>>>> >>>Playing >>>>> >>>> Research Engineer (Solar/Batteries O.O#. #.O#. >>>>> with >>>>> >>>> /Software/Embedded Controllers) .OO#. .OO#. >>>>> >>>rocks...1k >>>>> >>>> >>>>> >>>--------------------------------------------------------------------- >>>>> ------ >>>>> >>>> Sent from my phone. Please excuse my brevity. >>>>> >>>> >>>>> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter >>>>> <bgunter.4567 at gmail.com> >>>>> >>>wrote: >>>>> >>>>>Here's a way to do it that uses %in% (i.e. match() ) and uses only >>>>> a >>>>> >>>>>single, not a double, loop. It should be more efficient. >>>>> >>>>> >>>>> >>>>>> sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"), >>>>> >>>>>+ function(x)any(x %in% alarm.words)) >>>>> >>>>> >>>>> >>>>> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE >>>>> >>>>> >>>>> >>>>>The idea is to paste the strings in each row (do.call allows an >>>>> >>>>>arbitrary number of columns) into a single string and then use >>>>> >>>>>strsplit to break the string into individual "words" on whitespace. >>>>> >>>>>Then the matching is vectorized with the any( %in% ... ) call. >>>>> >>>>> >>>>> >>>>>Cheers, >>>>> >>>>>Bert >>>>> >>>>>Bert Gunter >>>>> >>>>> >>>>> >>>>>"Data is not information. Information is not knowledge. And >>>>> knowledge >>>>> >>>>>is certainly not wisdom." >>>>> >>>>> -- Clifford Stoll >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote: >>>>> >>>>>> Dear Chris, >>>>> >>>>>> >>>>> >>>>>> If I understand correctly what you want, how about the following? >>>>> >>>>>> >>>>> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, >>>>> >>>>>grepl, x=x))) >>>>> >>>>>>> zz[rows, ] >>>>> >>>>>> >>>>> >>>>>> v1 v2 v3 v4 >>>>> >>>>>> 3 -1.022329 green turtle ronald weasley 2 >>>>> >>>>>> 6 0.336599 waffle the hamster red sparks 1 >>>>> >>>>>> 9 -1.631874 yellow giraffe with a long neck gandalf the white 1 >>>>> >>>>>> 10 1.130622 black bear gandalf the grey 2 >>>>> >>>>>> >>>>> >>>>>> I hope this helps, >>>>> >>>>>> John >>>>> >>>>>> >>>>> >>>>>> ------------------------------------------------ >>>>> >>>>>> John Fox, Professor >>>>> >>>>>> McMaster University >>>>> >>>>>> Hamilton, Ontario, Canada >>>>> >>>>>> http://socserv.mcmaster.ca/jfox/ >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400 >>>>> >>>>>> "Christopher W. Ryan" <cryan at binghamton.edu> wrote: >>>>> >>>>>>> Running R 3.1.1 on windows 7 >>>>> >>>>>>> >>>>> >>>>>>> I want to identify as a case any record in a dataframe that >>>>> >>>contains >>>>> >>>>>any >>>>> >>>>>>> of several keywords in any of several variables. >>>>> >>>>>>> >>>>> >>>>>>> Example: >>>>> >>>>>>> >>>>> >>>>>>> # create a dataframe with 4 variables and 10 records >>>>> >>>>>>> v2 <- c("white bird", "blue bird", "green turtle", "quick brown >>>>> >>>>>fox", >>>>> >>>>>>> "big black dog", "waffle the hamster", "benny likes food a lot", >>>>> >>>>>"hello >>>>> >>>>>>> world", "yellow giraffe with a long neck", "black bear") >>>>> >>>>>>> v3 <- c("harry potter", "hermione grainger", "ronald weasley", >>>>> >>>>>"ginny >>>>> >>>>>>> weasley", "dudley dursley", "red sparks", "blue sparks", "white >>>>> >>>>>dress >>>>> >>>>>>> robes", "gandalf the white", "gandalf the grey") >>>>> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10, >>>>> >>>lambda=2), >>>>> >>>>>>> stringsAsFactors=FALSE) >>>>> >>>>>>> str(zz) >>>>> >>>>>>> zz >>>>> >>>>>>> >>>>> >>>>>>> # here are the keywords >>>>> >>>>>>> alarm.words <- c("red", "green", "turtle", "gandalf") >>>>> >>>>>>> >>>>> >>>>>>> # For each row/record, I want to test whether the string in v2 >>>>> or >>>>> >>>>>the >>>>> >>>>>>> string in v3 contains any of the strings in alarm.words. And >>>>> then >>>>> >>>if >>>>> >>>>>so, >>>>> >>>>>>> set zz$v5=TRUE for that record. >>>>> >>>>>>> >>>>> >>>>>>> # I'm thinking the str_detect function in the stringr package >>>>> >>>ought >>>>> >>>>>to >>>>> >>>>>>> be able to help, perhaps with some use of apply over the rows, >>>>> but >>>>> >>>I >>>>> >>>>>>> obviously misunderstand something about how str_detect works >>>>> >>>>>>> >>>>> >>>>>>> library(stringr) >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[,2:3], alarm.words) # error: the target of the >>>>> >>>>>search >>>>> >>>>>>> # must be a vector, not >>>>> >>>>>multiple >>>>> >>>>>>> # columns >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[,2], alarm.words) # error, length of >>>>> >>>alarm.words >>>>> >>>>>>> # is less than the number >>>>> of >>>>> >>>>>>> # rows I am using for the >>>>> >>>>>>> # comparison >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz[1:4,2], alarm.words) # works as hoped when >>>>> >>>>>>> length(alarm.words) # confining nrows >>>>> >>>>>>> # to the length of >>>>> >>>alarm.words >>>>> >>>>>>> >>>>> >>>>>>> str_detect(zz, alarm.words) # obviously not right >>>>> >>>>>>> >>>>> >>>>>>> # maybe I need apply() ? >>>>> >>>>>>> my.f <- function(x){str_detect(x, alarm.words)} >>>>> >>>>>>> >>>>> >>>>>>> apply(zz[,2], 1, my.f) # again, a mismatch in lengths >>>>> >>>>>>> # between alarm.words and that >>>>> >>>>>>> # in which I am searching for >>>>> >>>>>>> # matching strings >>>>> >>>>>>> >>>>> >>>>>>> apply(zz, 2, my.f) # now I'm getting somewhere >>>>> >>>>>>> apply(zz[1:4,], 2, my.f) # but still only works with 4 >>>>> >>>>>>> # rows of the dataframe >>>>> >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> # perhaps %in% could do the job? >>>>> >>>>>>> >>>>> >>>>>>> Appreciate any advice. >>>>> >>>>>>> >>>>> >>>>>>> --Chris Ryan >>>>> >>>>>>> >>>>> >>>>>>> ______________________________________________ >>>>> >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, >>>>> see >>>>> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> >>>>>>> PLEASE do read the posting guide >>>>> >>>>>http://www.R-project.org/posting-guide.html >>>>> >>>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>> >>>>>> >>>>> >>>>>> ______________________________________________ >>>>> >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> >>>>>> PLEASE do read the posting guide >>>>> >>>>>http://www.R-project.org/posting-guide.html >>>>> >>>>>> and provide commented, minimal, self-contained, reproducible >>>>> code. >>>>> >>>>> >>>>> >>>>>______________________________________________ >>>>> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help >>>>> >>>>>PLEASE do read the posting guide >>>>> >>>>>http://www.R-project.org/posting-guide.html >>>>> >>>>>and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>>> >> >>>>> > >>>>> > ______________________________________________ >>>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> > https://stat.ethz.ch/mailman/listinfo/r-help >>>>> > PLEASE do read the posting guide http://www.R-project.org/posting- >>>>> guide.html >>>>> > and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>>> --- >>>> This email has been checked for viruses by Avast antivirus software. >>>> https://www.avast.com/antivirus >>>>