thr3ads.net - R help - [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2015-Jul-09 17:52 UTC

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Yup, that does it. Let grep figure out what's a word rather than doing
it manually. Forgot about "\b"

Cheers,
Bert


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:> Just add a word break marker before and after:
>
> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words,
collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>>Jeff:
>>
>>Well, it would be much better (no loops!) except, I think, for one
>>issue: "red" would match "barred" and I don't
think that this is what
>>is wanted: the matches should be on whole "words" not just
string
>>patterns.
>>
>>So you would need to fix up the matching pattern to make this work,
>>but it may be a little tricky, as arbitrary whitespace characters,
>>e.g. " " or "\n" etc. could be in the strings to be
matched separating
>>the words or ending the "sentence."  I'm sure it can be
done, but I'll
>>leave it to you or others to figure it out.
>>
>>Of course, if my diagnosis is wrong or silly, please point this out.
>>
>>Cheers,
>>Bert
>>
>>
>>Bert Gunter
>>
>>"Data is not information. Information is not knowledge. And
knowledge
>>is certainly not wisdom."
>>   -- Clifford Stoll
>>
>>
>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
>><jdnewmil at dcn.davis.ca.us> wrote:
>>> I think grep is better suited to this:
>>>
>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|" ),
do.call( paste,
>>zz[ , 2:3 ] ) ) )
>>>
>>---------------------------------------------------------------------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
>>Go...
>>>                                       Live:   OO#.. Dead: OO#..
>>Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#. 
with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>rocks...1k
>>>
>>---------------------------------------------------------------------------
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at
gmail.com>
>>wrote:
>>>>Here's a way to do it that uses %in% (i.e. match() ) and
uses only a
>>>>single, not a double, loop. It should be more efficient.
>>>>
>>>>>
sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>>>>+       function(x)any(x %in% alarm.words))
>>>>
>>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
>>>>
>>>>The idea is to paste the strings in each row (do.call allows an
>>>>arbitrary number of columns) into a single string and then use
>>>>strsplit to break the string into individual "words"
on whitespace.
>>>>Then the matching is vectorized with the any( %in% ... ) call.
>>>>
>>>>Cheers,
>>>>Bert
>>>>Bert Gunter
>>>>
>>>>"Data is not information. Information is not knowledge. And
knowledge
>>>>is certainly not wisdom."
>>>>   -- Clifford Stoll
>>>>
>>>>
>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at
mcmaster.ca> wrote:
>>>>> Dear Chris,
>>>>>
>>>>> If I understand correctly what you want, how about the
following?
>>>>>
>>>>>> rows <- apply(zz[, 2:3], 1, function(x)
any(sapply(alarm.words,
>>>>grepl, x=x)))
>>>>>> zz[rows, ]
>>>>>
>>>>>           v1                              v2               
v3 v4
>>>>> 3  -1.022329                    green turtle    ronald
weasley  2
>>>>> 6   0.336599              waffle the hamster        red
sparks  1
>>>>> 9  -1.631874 yellow giraffe with a long neck gandalf the
white  1
>>>>> 10  1.130622                      black bear  gandalf the
grey  2
>>>>>
>>>>> I hope this helps,
>>>>>  John
>>>>>
>>>>> ------------------------------------------------
>>>>> John Fox, Professor
>>>>> McMaster University
>>>>> Hamilton, Ontario, Canada
>>>>> http://socserv.mcmaster.ca/jfox/
>>>>>
>>>>>
>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
>>>>>  "Christopher W. Ryan" <cryan at
binghamton.edu> wrote:
>>>>>> Running R 3.1.1 on windows 7
>>>>>>
>>>>>> I want to identify as a case any record in a dataframe
that
>>contains
>>>>any
>>>>>> of several keywords in any of several variables.
>>>>>>
>>>>>> Example:
>>>>>>
>>>>>> # create a dataframe with 4 variables and 10 records
>>>>>> v2 <- c("white bird", "blue
bird", "green turtle", "quick brown
>>>>fox",
>>>>>> "big black dog", "waffle the
hamster", "benny likes food a lot",
>>>>"hello
>>>>>> world", "yellow giraffe with a long
neck", "black bear")
>>>>>> v3 <- c("harry potter", "hermione
grainger", "ronald weasley",
>>>>"ginny
>>>>>> weasley", "dudley dursley", "red
sparks", "blue sparks", "white
>>>>dress
>>>>>> robes", "gandalf the white",
"gandalf the grey")
>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3,
v4=rpois(10,
>>lambda=2),
>>>>>> stringsAsFactors=FALSE)
>>>>>> str(zz)
>>>>>> zz
>>>>>>
>>>>>> # here are the keywords
>>>>>> alarm.words <- c("red", "green",
"turtle", "gandalf")
>>>>>>
>>>>>> # For each row/record, I want to test whether the
string in v2 or
>>>>the
>>>>>> string in v3 contains any of the strings in
alarm.words. And then
>>if
>>>>so,
>>>>>> set zz$v5=TRUE for that record.
>>>>>>
>>>>>> # I'm thinking the str_detect function in the
stringr package
>>ought
>>>>to
>>>>>> be able to help, perhaps with some use of apply over
the rows, but
>>I
>>>>>> obviously misunderstand something about how str_detect
works
>>>>>>
>>>>>> library(stringr)
>>>>>>
>>>>>> str_detect(zz[,2:3], alarm.words)    # error: the
target of the
>>>>search
>>>>>>                                      # must be a
vector, not
>>>>multiple
>>>>>>                                      # columns
>>>>>>
>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>>>>
>>>>>> str_detect(zz[,2], alarm.words)      # error, length of
>>alarm.words
>>>>>>                                      # is less than the
number of
>>>>>>                                      # rows I am using
for the
>>>>>>                                      # comparison
>>>>>>
>>>>>> str_detect(zz[1:4,2], alarm.words)   # works as hoped
when
>>>>>> length(alarm.words)                  # confining nrows
>>>>>>                                      # to the length of
>>alarm.words
>>>>>>
>>>>>> str_detect(zz, alarm.words)          # obviously not
right
>>>>>>
>>>>>> # maybe I need apply() ?
>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>>>>
>>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch in
lengths
>>>>>>                            # between alarm.words and
that
>>>>>>                            # in which I am searching
for
>>>>>>                            # matching strings
>>>>>>
>>>>>> apply(zz, 2, my.f)         # now I'm getting
somewhere
>>>>>> apply(zz[1:4,], 2, my.f)   # but still only works with
4
>>>>>>                            # rows of the dataframe
>>>>>>
>>>>>>
>>>>>> # perhaps %in% could do the job?
>>>>>>
>>>>>> Appreciate any advice.
>>>>>>
>>>>>> --Chris Ryan
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>
>>>>______________________________________________
>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide
>>>>http://www.R-project.org/posting-guide.html
>>>>and provide commented, minimal, self-contained, reproducible
code.
>>>
>

Christopher W Ryan

2015-Jul-09 18:48 UTC

head link

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Thanks everyone.  John's original solution worked great.  And with
27,000 records, 65 alarm.words, and 6 columns to search, it takes only
about 15 seconds.  That is certainly adequate for my needs.  But I
will try out the other strategies too.

And thanks also for lot's of new R things to learn--grep, grepl,
do.call . . . that's always a bonus!

--Chris Ryan

On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at gmail.com>
wrote:> Yup, that does it. Let grep figure out what's a word rather than doing
> it manually. Forgot about "\b"
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>    -- Clifford Stoll
>
>
> On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
> <jdnewmil at dcn.davis.ca.us> wrote:
>> Just add a word break marker before and after:
>>
>> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words,
collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
>>
---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go
Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live Go...
>>                                       Live:   OO#.. Dead: OO#.. 
Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#. 
rocks...1k
>>
---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>>>Jeff:
>>>
>>>Well, it would be much better (no loops!) except, I think, for one
>>>issue: "red" would match "barred" and I
don't think that this is what
>>>is wanted: the matches should be on whole "words" not just
string
>>>patterns.
>>>
>>>So you would need to fix up the matching pattern to make this work,
>>>but it may be a little tricky, as arbitrary whitespace characters,
>>>e.g. " " or "\n" etc. could be in the strings to
be matched separating
>>>the words or ending the "sentence."  I'm sure it can
be done, but I'll
>>>leave it to you or others to figure it out.
>>>
>>>Of course, if my diagnosis is wrong or silly, please point this out.
>>>
>>>Cheers,
>>>Bert
>>>
>>>
>>>Bert Gunter
>>>
>>>"Data is not information. Information is not knowledge. And
knowledge
>>>is certainly not wisdom."
>>>   -- Clifford Stoll
>>>
>>>
>>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
>>><jdnewmil at dcn.davis.ca.us> wrote:
>>>> I think grep is better suited to this:
>>>>
>>>> zz$v5 <- grepl( paste0( alarm.words, collapse="|"
), do.call( paste,
>>>zz[ , 2:3 ] ) ) )
>>>>
>>>---------------------------------------------------------------------------
>>>> Jeff Newmiller                        The     .....       .....
Go
>>>Live...
>>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.   
##.#.  Live
>>>Go...
>>>>                                       Live:   OO#.. Dead: OO#..
>>>Playing
>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
with
>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>rocks...1k
>>>>
>>>---------------------------------------------------------------------------
>>>> Sent from my phone. Please excuse my brevity.
>>>>
>>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at
gmail.com>
>>>wrote:
>>>>>Here's a way to do it that uses %in% (i.e. match() ) and
uses only a
>>>>>single, not a double, loop. It should be more efficient.
>>>>>
>>>>>>
sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
>>>>>+       function(x)any(x %in% alarm.words))
>>>>>
>>>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE 
TRUE
>>>>>
>>>>>The idea is to paste the strings in each row (do.call allows
an
>>>>>arbitrary number of columns) into a single string and then
use
>>>>>strsplit to break the string into individual
"words" on whitespace.
>>>>>Then the matching is vectorized with the any( %in% ... )
call.
>>>>>
>>>>>Cheers,
>>>>>Bert
>>>>>Bert Gunter
>>>>>
>>>>>"Data is not information. Information is not knowledge.
And knowledge
>>>>>is certainly not wisdom."
>>>>>   -- Clifford Stoll
>>>>>
>>>>>
>>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at
mcmaster.ca> wrote:
>>>>>> Dear Chris,
>>>>>>
>>>>>> If I understand correctly what you want, how about the
following?
>>>>>>
>>>>>>> rows <- apply(zz[, 2:3], 1, function(x)
any(sapply(alarm.words,
>>>>>grepl, x=x)))
>>>>>>> zz[rows, ]
>>>>>>
>>>>>>           v1                              v2           
v3 v4
>>>>>> 3  -1.022329                    green turtle    ronald
weasley  2
>>>>>> 6   0.336599              waffle the hamster        red
sparks  1
>>>>>> 9  -1.631874 yellow giraffe with a long neck gandalf
the white  1
>>>>>> 10  1.130622                      black bear  gandalf
the grey  2
>>>>>>
>>>>>> I hope this helps,
>>>>>>  John
>>>>>>
>>>>>> ------------------------------------------------
>>>>>> John Fox, Professor
>>>>>> McMaster University
>>>>>> Hamilton, Ontario, Canada
>>>>>> http://socserv.mcmaster.ca/jfox/
>>>>>>
>>>>>>
>>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
>>>>>>  "Christopher W. Ryan" <cryan at
binghamton.edu> wrote:
>>>>>>> Running R 3.1.1 on windows 7
>>>>>>>
>>>>>>> I want to identify as a case any record in a
dataframe that
>>>contains
>>>>>any
>>>>>>> of several keywords in any of several variables.
>>>>>>>
>>>>>>> Example:
>>>>>>>
>>>>>>> # create a dataframe with 4 variables and 10
records
>>>>>>> v2 <- c("white bird", "blue
bird", "green turtle", "quick brown
>>>>>fox",
>>>>>>> "big black dog", "waffle the
hamster", "benny likes food a lot",
>>>>>"hello
>>>>>>> world", "yellow giraffe with a long
neck", "black bear")
>>>>>>> v3 <- c("harry potter", "hermione
grainger", "ronald weasley",
>>>>>"ginny
>>>>>>> weasley", "dudley dursley",
"red sparks", "blue sparks", "white
>>>>>dress
>>>>>>> robes", "gandalf the white",
"gandalf the grey")
>>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3,
v4=rpois(10,
>>>lambda=2),
>>>>>>> stringsAsFactors=FALSE)
>>>>>>> str(zz)
>>>>>>> zz
>>>>>>>
>>>>>>> # here are the keywords
>>>>>>> alarm.words <- c("red",
"green", "turtle", "gandalf")
>>>>>>>
>>>>>>> # For each row/record, I want to test whether the
string in v2 or
>>>>>the
>>>>>>> string in v3 contains any of the strings in
alarm.words. And then
>>>if
>>>>>so,
>>>>>>> set zz$v5=TRUE for that record.
>>>>>>>
>>>>>>> # I'm thinking the str_detect function in the
stringr package
>>>ought
>>>>>to
>>>>>>> be able to help, perhaps with some use of apply
over the rows, but
>>>I
>>>>>>> obviously misunderstand something about how
str_detect works
>>>>>>>
>>>>>>> library(stringr)
>>>>>>>
>>>>>>> str_detect(zz[,2:3], alarm.words)    # error: the
target of the
>>>>>search
>>>>>>>                                      # must be a
vector, not
>>>>>multiple
>>>>>>>                                      # columns
>>>>>>>
>>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same error
>>>>>>>
>>>>>>> str_detect(zz[,2], alarm.words)      # error,
length of
>>>alarm.words
>>>>>>>                                      # is less than
the number of
>>>>>>>                                      # rows I am
using for the
>>>>>>>                                      # comparison
>>>>>>>
>>>>>>> str_detect(zz[1:4,2], alarm.words)   # works as
hoped when
>>>>>>> length(alarm.words)                  # confining
nrows
>>>>>>>                                      # to the
length of
>>>alarm.words
>>>>>>>
>>>>>>> str_detect(zz, alarm.words)          # obviously
not right
>>>>>>>
>>>>>>> # maybe I need apply() ?
>>>>>>> my.f <- function(x){str_detect(x, alarm.words)}
>>>>>>>
>>>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch in
lengths
>>>>>>>                            # between alarm.words
and that
>>>>>>>                            # in which I am
searching for
>>>>>>>                            # matching strings
>>>>>>>
>>>>>>> apply(zz, 2, my.f)         # now I'm getting
somewhere
>>>>>>> apply(zz[1:4,], 2, my.f)   # but still only works
with 4
>>>>>>>                            # rows of the dataframe
>>>>>>>
>>>>>>>
>>>>>>> # perhaps %in% could do the job?
>>>>>>>
>>>>>>> Appreciate any advice.
>>>>>>>
>>>>>>> --Chris Ryan
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>>______________________________________________
>>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>PLEASE do read the posting guide
>>>>>http://www.R-project.org/posting-guide.html
>>>>>and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

John Fox

2015-Jul-09 19:24 UTC

head link

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

Dear Christopher,

My usual orientation to this kind of one-off problem is that I'm looking for
a simple correct solution. Computing time is usually much smaller than
programming time.

That said, Bert Gunter's solution was about 5 times faster in a simple check
that I ran with microbenchmark, and Jeff Newmiller's solution was about 10
times faster. Both Bert's and Jeff's (eventual) solution protect against
partial (rather than full-word) matches, while mine doesn't (though it could
easily be modified to do that).

Best,
 John
> -----Original Message-----
> From: Christopher W Ryan [mailto:cryan at binghamton.edu]
> Sent: July-09-15 2:49 PM
> To: Bert Gunter
> Cc: Jeff Newmiller; R Help; John Fox
> Subject: Re: [R] detecting any element in a vector of strings, appearing
> anywhere in any of several character variables in a dataframe
> 
> Thanks everyone.  John's original solution worked great.  And with
> 27,000 records, 65 alarm.words, and 6 columns to search, it takes only
> about 15 seconds.  That is certainly adequate for my needs.  But I
> will try out the other strategies too.
> 
> And thanks also for lot's of new R things to learn--grep, grepl,
> do.call . . . that's always a bonus!
> 
> --Chris Ryan
> 
> On Thu, Jul 9, 2015 at 1:52 PM, Bert Gunter <bgunter.4567 at
gmail.com>
> wrote:
> > Yup, that does it. Let grep figure out what's a word rather than
doing
> > it manually. Forgot about "\b"
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "Data is not information. Information is not knowledge. And
knowledge
> > is certainly not wisdom."
> >    -- Clifford Stoll
> >
> >
> > On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
> > <jdnewmil at dcn.davis.ca.us> wrote:
> >> Just add a word break marker before and after:
> >>
> >> zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words,
collapse="|" ),
> ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
> >>
---------------------------------------------------------------------
> ------
> >> Jeff Newmiller                        The     .....       ..... 
Go
> Live...
> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
> Go...
> >>                                       Live:   OO#.. Dead: OO#..
> Playing
> >> Research Engineer (Solar/Batteries            O.O#.       #.O#. 
with
> >> /Software/Embedded Controllers)               .OO#.       .OO#.
> rocks...1k
> >>
---------------------------------------------------------------------
> ------
> >> Sent from my phone. Please excuse my brevity.
> >>
> >> On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at
gmail.com>
> wrote:
> >>>Jeff:
> >>>
> >>>Well, it would be much better (no loops!) except, I think, for
one
> >>>issue: "red" would match "barred" and I
don't think that this is what
> >>>is wanted: the matches should be on whole "words" not
just string
> >>>patterns.
> >>>
> >>>So you would need to fix up the matching pattern to make this
work,
> >>>but it may be a little tricky, as arbitrary whitespace
characters,
> >>>e.g. " " or "\n" etc. could be in the
strings to be matched
> separating
> >>>the words or ending the "sentence."  I'm sure it
can be done, but
> I'll
> >>>leave it to you or others to figure it out.
> >>>
> >>>Of course, if my diagnosis is wrong or silly, please point this
out.
> >>>
> >>>Cheers,
> >>>Bert
> >>>
> >>>
> >>>Bert Gunter
> >>>
> >>>"Data is not information. Information is not knowledge.
And knowledge
> >>>is certainly not wisdom."
> >>>   -- Clifford Stoll
> >>>
> >>>
> >>>On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller
> >>><jdnewmil at dcn.davis.ca.us> wrote:
> >>>> I think grep is better suited to this:
> >>>>
> >>>> zz$v5 <- grepl( paste0( alarm.words,
collapse="|" ), do.call(
> paste,
> >>>zz[ , 2:3 ] ) ) )
> >>>>
>
>>>---------------------------------------------------------------------
> ------
> >>>> Jeff Newmiller                        The     .....      
.....  Go
> >>>Live...
> >>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics:
##.#.       ##.#.
> Live
> >>>Go...
> >>>>                                       Live:   OO#.. Dead:
OO#..
> >>>Playing
> >>>> Research Engineer (Solar/Batteries            O.O#.      
#.O#.
> with
> >>>> /Software/Embedded Controllers)               .OO#.      
.OO#.
> >>>rocks...1k
> >>>>
>
>>>---------------------------------------------------------------------
> ------
> >>>> Sent from my phone. Please excuse my brevity.
> >>>>
> >>>> On July 9, 2015 8:51:10 AM PDT, Bert Gunter
> <bgunter.4567 at gmail.com>
> >>>wrote:
> >>>>>Here's a way to do it that uses %in% (i.e. match()
) and uses only
> a
> >>>>>single, not a double, loop. It should be more
efficient.
> >>>>>
> >>>>>>
sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
> >>>>>+       function(x)any(x %in% alarm.words))
> >>>>>
> >>>>> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE 
TRUE  TRUE
> >>>>>
> >>>>>The idea is to paste the strings in each row (do.call
allows an
> >>>>>arbitrary number of columns) into a single string and
then use
> >>>>>strsplit to break the string into individual
"words" on whitespace.
> >>>>>Then the matching is vectorized with the any( %in% ...
) call.
> >>>>>
> >>>>>Cheers,
> >>>>>Bert
> >>>>>Bert Gunter
> >>>>>
> >>>>>"Data is not information. Information is not
knowledge. And
> knowledge
> >>>>>is certainly not wisdom."
> >>>>>   -- Clifford Stoll
> >>>>>
> >>>>>
> >>>>>On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at
mcmaster.ca> wrote:
> >>>>>> Dear Chris,
> >>>>>>
> >>>>>> If I understand correctly what you want, how about
the following?
> >>>>>>
> >>>>>>> rows <- apply(zz[, 2:3], 1, function(x)
any(sapply(alarm.words,
> >>>>>grepl, x=x)))
> >>>>>>> zz[rows, ]
> >>>>>>
> >>>>>>           v1                              v2      
v3 v4
> >>>>>> 3  -1.022329                    green turtle   
ronald weasley  2
> >>>>>> 6   0.336599              waffle the hamster      
red sparks  1
> >>>>>> 9  -1.631874 yellow giraffe with a long neck
gandalf the white  1
> >>>>>> 10  1.130622                      black bear 
gandalf the grey  2
> >>>>>>
> >>>>>> I hope this helps,
> >>>>>>  John
> >>>>>>
> >>>>>> ------------------------------------------------
> >>>>>> John Fox, Professor
> >>>>>> McMaster University
> >>>>>> Hamilton, Ontario, Canada
> >>>>>> http://socserv.mcmaster.ca/jfox/
> >>>>>>
> >>>>>>
> >>>>>> On Wed, 08 Jul 2015 22:23:37 -0400
> >>>>>>  "Christopher W. Ryan" <cryan at
binghamton.edu> wrote:
> >>>>>>> Running R 3.1.1 on windows 7
> >>>>>>>
> >>>>>>> I want to identify as a case any record in a
dataframe that
> >>>contains
> >>>>>any
> >>>>>>> of several keywords in any of several
variables.
> >>>>>>>
> >>>>>>> Example:
> >>>>>>>
> >>>>>>> # create a dataframe with 4 variables and 10
records
> >>>>>>> v2 <- c("white bird", "blue
bird", "green turtle", "quick brown
> >>>>>fox",
> >>>>>>> "big black dog", "waffle the
hamster", "benny likes food a lot",
> >>>>>"hello
> >>>>>>> world", "yellow giraffe with a long
neck", "black bear")
> >>>>>>> v3 <- c("harry potter",
"hermione grainger", "ronald weasley",
> >>>>>"ginny
> >>>>>>> weasley", "dudley dursley",
"red sparks", "blue sparks", "white
> >>>>>dress
> >>>>>>> robes", "gandalf the white",
"gandalf the grey")
> >>>>>>> zz <- data.frame(v1=rnorm(10), v2=v2,
v3=v3, v4=rpois(10,
> >>>lambda=2),
> >>>>>>> stringsAsFactors=FALSE)
> >>>>>>> str(zz)
> >>>>>>> zz
> >>>>>>>
> >>>>>>> # here are the keywords
> >>>>>>> alarm.words <- c("red",
"green", "turtle", "gandalf")
> >>>>>>>
> >>>>>>> # For each row/record, I want to test whether
the string in v2
> or
> >>>>>the
> >>>>>>> string in v3 contains any of the strings in
alarm.words. And
> then
> >>>if
> >>>>>so,
> >>>>>>> set zz$v5=TRUE for that record.
> >>>>>>>
> >>>>>>> # I'm thinking the str_detect function in
the stringr package
> >>>ought
> >>>>>to
> >>>>>>> be able to help, perhaps with some use of
apply over the rows,
> but
> >>>I
> >>>>>>> obviously misunderstand something about how
str_detect works
> >>>>>>>
> >>>>>>> library(stringr)
> >>>>>>>
> >>>>>>> str_detect(zz[,2:3], alarm.words)    # error:
the target of the
> >>>>>search
> >>>>>>>                                      # must be
a vector, not
> >>>>>multiple
> >>>>>>>                                      # columns
> >>>>>>>
> >>>>>>> str_detect(zz[1:4,2:3], alarm.words) # same
error
> >>>>>>>
> >>>>>>> str_detect(zz[,2], alarm.words)      # error,
length of
> >>>alarm.words
> >>>>>>>                                      # is less
than the number
> of
> >>>>>>>                                      # rows I
am using for the
> >>>>>>>                                      #
comparison
> >>>>>>>
> >>>>>>> str_detect(zz[1:4,2], alarm.words)   # works
as hoped when
> >>>>>>> length(alarm.words)                  #
confining nrows
> >>>>>>>                                      # to the
length of
> >>>alarm.words
> >>>>>>>
> >>>>>>> str_detect(zz, alarm.words)          #
obviously not right
> >>>>>>>
> >>>>>>> # maybe I need apply() ?
> >>>>>>> my.f <- function(x){str_detect(x,
alarm.words)}
> >>>>>>>
> >>>>>>> apply(zz[,2], 1, my.f)     # again, a mismatch
in lengths
> >>>>>>>                            # between
alarm.words and that
> >>>>>>>                            # in which I am
searching for
> >>>>>>>                            # matching strings
> >>>>>>>
> >>>>>>> apply(zz, 2, my.f)         # now I'm
getting somewhere
> >>>>>>> apply(zz[1:4,], 2, my.f)   # but still only
works with 4
> >>>>>>>                            # rows of the
dataframe
> >>>>>>>
> >>>>>>>
> >>>>>>> # perhaps %in% could do the job?
> >>>>>>>
> >>>>>>> Appreciate any advice.
> >>>>>>>
> >>>>>>> --Chris Ryan
> >>>>>>>
> >>>>>>> ______________________________________________
> >>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal,
self-contained, reproducible
> code.
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained,
reproducible
> code.
> >>>>>
> >>>>>______________________________________________
> >>>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >>>>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>PLEASE do read the posting guide
> >>>>>http://www.R-project.org/posting-guide.html
> >>>>>and provide commented, minimal, self-contained,
reproducible code.
> >>>>
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

R help - Jul 2015 - detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe