Bert Gunter
2021-Jun-11 23:10 UTC
[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1
First, if Rui's solution works for you, I recommend that you stop reading
and discard this email. Why bother wasting time with stuff you don't need?!
If it doesn't work or if you would like another approach -- perhaps as a
check -- then read on.
Warning: I am a dinosaur and just use base R functionality , including
regular expressions, for these sorts of relatively simple tasks. I also
eschew pipes. So my code for your example is simply:
matchpat <- paste("\\b",coreWords, "\\b", sep =
"",collapse = "|")
out <- gsub(matchpat,"",Utterance)
Core <- nchar(out) != nchar(Utterance)
Fringe <- nchar(gsub(" +","",out)) > 0
Note that I have given the results as logical TRUE or FALSE. If you insist
on 1's and 0's, just instead do:
Core <- (nchar(out) != nchar(Utterance)) + 0
Fringe <- sign(nchar(gsub(" +","",out)))
Now for an explanation. My approach was simply to create a regular
expression (regex) match pattern that would match any of your words or
phrases. The matchpat assignment does this just by logically "or"ing
(with
the"|" symbol) together all your words and phrases, each of which is
surrounded by the edge of word symbol, "\\b" (so only whole words or
phrases are matched). This is standard regex stuff, and I could do it
rather handily with r's paste() function. One word of caution, though:
R's
?regex says:
"Long regular expression patterns may or may not be accepted: the POSIX
standard only requires up to 256 bytes." So what works for your reprex
might not work for your full list of coreWords. It is possible to work
around this by repeatedly applying subsets of your coreWords **provided**
you make sure that you order these subsets by the number of words in each
coreWord phrase. That is, bigger phrases must be applied first before
applying smaller phrases/words to the results. This is not hard to do, but
adds complexity, and may not be necessary. See below for an explanation.
What the second line of code does is to use the gsub() function to remove
all matches to matchpat -- which, via the "|" construction -- is
anything
in your coreWord list. So this means that if you have any matches in an
utterance, what remains after gsubbing will be shorter -- fewer characters
-- than the original utterance. The Core assignment checks for this using
the nchar() function and returns TRUE or FALSE as appropriate. If all the
words in the utterance matched code words, you would be left with nothing
or a bunch of spaces. The Fringe assignment just first removes all spaces
via the gsub() and then returns TRUE if there's nothing (0 characters) left
or FALSE if there still are some left.
Finally, why do you have to start with longer phrases first if you have to
do this sequentially? Suppose you have the phrases "good night" in
your
phrase list, and also the word "night". If you have to do things
sequentially instead of as one swell foop, if you applied the gsub() with a
bunch including only "night" first, then "night" will be
removed and "good"
will be left. Then when the bunch containing "good night" is gsubbed
after,
it won't see the whole phrase any more and "good" will be left in,
which is
*not* what you said you wanted.
Finally,it is of course possible to do these things by sequentially
applying one word/phrase at a time in a loop (again, longest phrases first
for the same reason as above), but I believe this might take quite a while
with a big list of coreWords (and Utterances). The above approach using
"|"
vectorizes things and takes advantage of the power of the regex engine, so
I think it will be more efficient **if it's accepted.** But if you run
into the problem of pattern length limitations, then sequentially, one at a
time, might be simpler. My judgments of computational efficiency are often
wrong anyway.
Note: I think my approach works, but I would appreciate an on-list response
if I have erred. Also, even if correct, alternative cleverer approaches are
always welcome.
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Jun 11, 2021 at 11:54 AM Debbie Hahs-Vaughn <debbie at ucf.edu>
wrote:
> Thank you for noting this. The utterance has to match the exact phrase
> (e.g., "all done") for it to constitute a match in the utterance.
>
>
> ------------------------------
> *From:* Bert Gunter <bgunter.4567 at gmail.com>
> *Sent:* Friday, June 11, 2021 2:42 PM
> *To:* Debbie Hahs-Vaughn <debbie at ucf.edu>
> *Cc:* r-help at R-project.org <r-help at r-project.org>
> *Subject:* Re: [R] Identifying words from a list and code as 0 or 1 and
> words NOT on the list code as 1
>
> Note that your specification is ambiguous. "all done" is not a
single word
> -- it's a phrase. So what do you want to do if:
>
> 1) "all" and/or "done" are also among your core
words?
> 2) "I'm all done" is another of your core phrases.
>
> The existence of phrases in your core list allows such conflicts to arise.
> Do you claim that phrases would be chosen so that this can never happen? --
> or what is your specification if they can (what constitutes a match and in
> what priority)?
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Fri, Jun 11, 2021 at 10:06 AM Debbie Hahs-Vaughn <debbie at
ucf.edu>
> wrote:
>
> I am working with utterances, statements spoken by children. From each
> utterance, if one or more words in the statement match a predefined list of
> multiple 'core' words (probably 300 words), then I want to input
'1' into
> 'Core' (and if none, then input '0' into 'Core').
>
> If there are one or more words in the statement that are NOT core words,
> then I want to input '1' into 'Fringe' (and if there are
only core words
> and nothing extra, then input '0' into 'Fringe'). I will
not have a list
> of Fringe words.
>
> Basically, right now I have a child ID and only the utterances. Here is a
> snippet of my data.
>
> ID Utterance
> 1 a baby
> 2 small
> 3 yes
> 4 where's his bed
> 5 there's his bed
> 6 where's his pillow
> 7 what is that on his head
> 8 hey he has his arm stuck here
> 9 there there's it
> 10 now you're gonna go night-night
> 11 and that's the thing you can turn on
> 12 yeah where's the music box
> 13 what is this
> 14 small
> 15 there you go baby
>
>
> The following code runs but isn't doing exactly what I need--which is:
1)
> the ability to detect words from the list and define as core; 2) the
> ability to search the utterance and if there are any words in the utterance
> that are NOT core, to identify those as ?1? as I will not have a list of
> fringe words.
>
> ```
>
> library(dplyr)
> library(stringr)
> library(tidyr)
>
> coreWords <-c("I", "no", "yes",
"my", "the", "want", "is",
"it", "that",
> "a", "go", "mine", "you",
"what", "on", "in", "here",
"more", "out", "off",
> "some", "help", "all done",
"finished")
>
> str_detect(df,)
>
> dfplus <- df %>%
> mutate(id = row_number()) %>%
> separate_rows(Utterance, sep = ' ') %>%
> mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse =
'|')),
> Fringe = + !Core) %>%
> group_by(id) %>%
> mutate(Core = + (sum(Core) > 0),
> Fringe = + (sum(Fringe) > 0)) %>%
> slice(1) %>%
> select(-Utterance) %>%
> left_join(df) %>%
> ungroup() %>%
> select(Utterance, Core, Fringe, ID)
>
> ```
>
> The dput() code is:
>
> structure(list(Utterance = c("a baby", "small",
"yes", "where's his bed",
> "there's his bed", "where's his pillow",
"what is that on his head",
> "hey he has his arm stuck here", "there there's
it", "now you're gonna go
> night-night",
> "and that's the thing you can turn on", "yeah
where's the music box",
> "what is this", "small", "there you go baby
", "what is this for ",
> "a ", "and the go goodnight here ", "and what is
this ", " what's that
> sound ",
> "what does she say ", "what she say", "should I
turn the on so Laura
> doesn't cry ",
> "what is this ", "what is that ", "where's
clothes ", " where's the baby's
> bedroom ",
> "that might be in dad's bed+room ", "yes ",
"there you go baby ",
> "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L,
> 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA,
> -31L), class = c("tbl_df", "tbl",
"data.frame"))
>
> ```
>
> The first 10 rows of output looks like this:
>
> Utterance Core Fringe ID
> 1 a baby 1 0 1
> 2 small 1 0 2
> 3 yes 1 0 3
> 4 where's his bed 1 1 4
> 5 there's his bed 1 1 5
> 6 where's his pillow 1 1 6
> 7 what is that on his head 1 0 7
> 8 hey he has his arm stuck here 1 1 8
> 9 there there's it 1 0 9
> 10 now you're gonna go night-night 1 1 10
>
> For example, in line 1 of the output, ?a? is a core word so ?1? for core
> is correct. However, ?baby? should be picked up as fringe so there should
> be ?1?, not ?0?, for fringe. Lines 7 and 9 also have words that should be
> identified as fringe but are not.
>
> Additionally, it seems like if the utterance has parts of a core word in
> it, it?s being counted. For example, ?small? is identified as a core word
> even though it's not (but 'all done' is a core word).
'Where's his bed' is
> identified as core and fringe, although none of the words are core.
>
> Any suggestions on what is happening and how to correct it are greatly
> appreciated.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
>
<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7Cdebbie%40ucf.edu%7Cc70431490a5242f4adff08d92d08ba70%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590337793491343%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ncGdcj1KDtcw4tTB%2BGavbH%2BoM4SRy8PguEsopccJbxM%3D&reserved=0>
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>
<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7Cdebbie%40ucf.edu%7Cc70431490a5242f4adff08d92d08ba70%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590337793501335%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pF35NCtRstPszG96LOC65969fvxiwNSHrQt0YBeCOlM%3D&reserved=0>
> and provide commented, minimal, self-contained, reproducible code.
>
>
[[alternative HTML version deleted]]
Debbie Hahs-Vaughn
2021-Jun-15 00:13 UTC
[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1
This also seems to work beautifully! I'm all for having multiple approaches
so appreciate the time you took to do this and am particularly appreciative of
the annotation on the script. That definitely helps clarify what's
happening and am sure that will be helpful to others working on similar tasks as
well. Thanks again, very much!
________________________________
From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: Friday, June 11, 2021 7:10 PM
To: Debbie Hahs-Vaughn <debbie at ucf.edu>; Rui Barradas <ruipbarradas
at sapo.pt>
Cc: r-help at R-project.org <r-help at r-project.org>
Subject: Re: [R] Identifying words from a list and code as 0 or 1 and words NOT
on the list code as 1
First, if Rui's solution works for you, I recommend that you stop reading
and discard this email. Why bother wasting time with stuff you don't need?!
If it doesn't work or if you would like another approach -- perhaps as a
check -- then read on.
Warning: I am a dinosaur and just use base R functionality , including regular
expressions, for these sorts of relatively simple tasks. I also eschew pipes. So
my code for your example is simply:
matchpat <- paste("\\b",coreWords, "\\b", sep =
"",collapse = "|")
out <- gsub(matchpat,"",Utterance)
Core <- nchar(out) != nchar(Utterance)
Fringe <- nchar(gsub(" +","",out)) > 0
Note that I have given the results as logical TRUE or FALSE. If you insist on
1's and 0's, just instead do:
Core <- (nchar(out) != nchar(Utterance)) + 0
Fringe <- sign(nchar(gsub(" +","",out)))
Now for an explanation. My approach was simply to create a regular expression
(regex) match pattern that would match any of your words or phrases. The
matchpat assignment does this just by logically "or"ing (with
the"|" symbol) together all your words and phrases, each of which is
surrounded by the edge of word symbol, "\\b" (so only whole words or
phrases are matched). This is standard regex stuff, and I could do it rather
handily with r's paste() function. One word of caution, though: R's
?regex says:
"Long regular expression patterns may or may not be accepted: the POSIX
standard only requires up to 256 bytes." So what works for your reprex
might not work for your full list of coreWords. It is possible to work around
this by repeatedly applying subsets of your coreWords **provided** you make sure
that you order these subsets by the number of words in each coreWord phrase.
That is, bigger phrases must be applied first before applying smaller
phrases/words to the results. This is not hard to do, but adds complexity, and
may not be necessary. See below for an explanation.
What the second line of code does is to use the gsub() function to remove all
matches to matchpat -- which, via the "|" construction -- is anything
in your coreWord list. So this means that if you have any matches in an
utterance, what remains after gsubbing will be shorter -- fewer characters --
than the original utterance. The Core assignment checks for this using the
nchar() function and returns TRUE or FALSE as appropriate. If all the words in
the utterance matched code words, you would be left with nothing or a bunch of
spaces. The Fringe assignment just first removes all spaces via the gsub() and
then returns TRUE if there's nothing (0 characters) left or FALSE if there
still are some left.
Finally, why do you have to start with longer phrases first if you have to do
this sequentially? Suppose you have the phrases "good night" in your
phrase list, and also the word "night". If you have to do things
sequentially instead of as one swell foop, if you applied the gsub() with a
bunch including only "night" first, then "night" will be
removed and "good" will be left. Then when the bunch containing
"good night" is gsubbed after, it won't see the whole phrase any
more and "good" will be left in, which is *not* what you said you
wanted.
Finally,it is of course possible to do these things by sequentially applying one
word/phrase at a time in a loop (again, longest phrases first for the same
reason as above), but I believe this might take quite a while with a big list of
coreWords (and Utterances). The above approach using "|" vectorizes
things and takes advantage of the power of the regex engine, so I think it will
be more efficient **if it's accepted.** But if you run into the problem of
pattern length limitations, then sequentially, one at a time, might be simpler.
My judgments of computational efficiency are often wrong anyway.
Note: I think my approach works, but I would appreciate an on-list response if I
have erred. Also, even if correct, alternative cleverer approaches are always
welcome.
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Jun 11, 2021 at 11:54 AM Debbie Hahs-Vaughn <debbie at
ucf.edu<mailto:debbie at ucf.edu>> wrote:
Thank you for noting this. The utterance has to match the exact phrase (e.g.,
"all done") for it to constitute a match in the utterance.
________________________________
From: Bert Gunter <bgunter.4567 at gmail.com<mailto:bgunter.4567 at
gmail.com>>
Sent: Friday, June 11, 2021 2:42 PM
To: Debbie Hahs-Vaughn <debbie at ucf.edu<mailto:debbie at ucf.edu>>
Cc: r-help at R-project.org <r-help at r-project.org<mailto:r-help at
r-project.org>>
Subject: Re: [R] Identifying words from a list and code as 0 or 1 and words NOT
on the list code as 1
Note that your specification is ambiguous. "all done" is not a single
word -- it's a phrase. So what do you want to do if:
1) "all" and/or "done" are also among your core words?
2) "I'm all done" is another of your core phrases.
The existence of phrases in your core list allows such conflicts to arise. Do
you claim that phrases would be chosen so that this can never happen? -- or what
is your specification if they can (what constitutes a match and in what
priority)?
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, Jun 11, 2021 at 10:06 AM Debbie Hahs-Vaughn <debbie at
ucf.edu<mailto:debbie at ucf.edu>> wrote:
I am working with utterances, statements spoken by children. From each
utterance, if one or more words in the statement match a predefined list of
multiple 'core' words (probably 300 words), then I want to input
'1' into 'Core' (and if none, then input '0' into
'Core').
If there are one or more words in the statement that are NOT core words, then I
want to input '1' into 'Fringe' (and if there are only core
words and nothing extra, then input '0' into 'Fringe'). I will
not have a list of Fringe words.
Basically, right now I have a child ID and only the utterances. Here is a
snippet of my data.
ID Utterance
1 a baby
2 small
3 yes
4 where's his bed
5 there's his bed
6 where's his pillow
7 what is that on his head
8 hey he has his arm stuck here
9 there there's it
10 now you're gonna go night-night
11 and that's the thing you can turn on
12 yeah where's the music box
13 what is this
14 small
15 there you go baby
The following code runs but isn't doing exactly what I need--which is: 1)
the ability to detect words from the list and define as core; 2) the ability to
search the utterance and if there are any words in the utterance that are NOT
core, to identify those as ?1? as I will not have a list of fringe words.
```
library(dplyr)
library(stringr)
library(tidyr)
coreWords <-c("I", "no", "yes", "my",
"the", "want", "is", "it",
"that", "a", "go", "mine",
"you", "what", "on", "in",
"here", "more", "out", "off",
"some", "help", "all done", "finished")
str_detect(df,)
dfplus <- df %>%
mutate(id = row_number()) %>%
separate_rows(Utterance, sep = ' ') %>%
mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse =
'|')),
Fringe = + !Core) %>%
group_by(id) %>%
mutate(Core = + (sum(Core) > 0),
Fringe = + (sum(Fringe) > 0)) %>%
slice(1) %>%
select(-Utterance) %>%
left_join(df) %>%
ungroup() %>%
select(Utterance, Core, Fringe, ID)
```
The dput() code is:
structure(list(Utterance = c("a baby", "small",
"yes", "where's his bed",
"there's his bed", "where's his pillow", "what
is that on his head",
"hey he has his arm stuck here", "there there's it",
"now you're gonna go night-night",
"and that's the thing you can turn on", "yeah where's the
music box",
"what is this", "small", "there you go baby ",
"what is this for ",
"a ", "and the go goodnight here ", "and what is this
", " what's that sound ",
"what does she say ", "what she say", "should I turn
the on so Laura doesn't cry ",
"what is this ", "what is that ", "where's clothes
", " where's the baby's bedroom ",
"that might be in dad's bed+room ", "yes ", "there
you go baby ",
"you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L,
0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA,
-31L), class = c("tbl_df", "tbl", "data.frame"))
```
The first 10 rows of output looks like this:
Utterance Core Fringe ID
1 a baby 1 0 1
2 small 1 0 2
3 yes 1 0 3
4 where's his bed 1 1 4
5 there's his bed 1 1 5
6 where's his pillow 1 1 6
7 what is that on his head 1 0 7
8 hey he has his arm stuck here 1 1 8
9 there there's it 1 0 9
10 now you're gonna go night-night 1 1 10
For example, in line 1 of the output, ?a? is a core word so ?1? for core is
correct. However, ?baby? should be picked up as fringe so there should be ?1?,
not ?0?, for fringe. Lines 7 and 9 also have words that should be identified as
fringe but are not.
Additionally, it seems like if the utterance has parts of a core word in it,
it?s being counted. For example, ?small? is identified as a core word even
though it's not (but 'all done' is a core word). 'Where's
his bed' is identified as core and fringe, although none of the words are
core.
Any suggestions on what is happening and how to correct it are greatly
appreciated.
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7Cdebbie%40ucf.edu%7C4281e4e2bef34fd68d3c08d92d2e157e%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590498253390595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2hXSrCtfbIEk4gqHumpNCgkzr1pVWuukB48laLhDHQI%3D&reserved=0>
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7Cdebbie%40ucf.edu%7C4281e4e2bef34fd68d3c08d92d2e157e%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590498253400588%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sTXDvMD%2B7UZynCzuoEovyBfOwwgmUlpBV7szxQYwJVg%3D&reserved=0>
and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
Bert Gunter
2021-Jun-15 00:57 UTC
[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1
I was not going to continue this thread, as you had Rui's presumed solution in hand. But now that you have responded, I need to warn you that actually my solution -- and perhaps Rui's -- may be incorrect, though you'll have to check carefully to see if you also have the error. The problem stems from this note in the regex man page: "The symbol \b matches the empty string at either edge of a word, and \B matches the empty string provided it is not at an edge of a word. (The interpretation of ?word? depends on the locale and implementation: these are all extensions.)." This vague "interpretation" but is, of course, quite annoying, though probably unavoidable given the vagaries of language." The problem can be seen in this little example:> gsub("\\bit\\b", "", c("it's a", "this is it"))[1] "'s a" "this is " What's going on here, is that in *my particular implementation* (maybe not yours or Rui's) of the regex engine, the apostrophe in "it's" is seen as a word delineator, so the "it" in "it's" is removed, which is *not* what needs to happen. Defining a word to be text that is preceded by either a space or line beginning and followed by a space or line end, seems to fix the problem:> gsub("( |^)it( |$)", "", c("it's a", "this is it"))[1] "it's a" "this is" But you'll need to **check this carefully** with good examples. Here's how a slightly modified version with this correction of what I gave you previously works on your example on my R setup (with an extra "Now we're all done" added to test the "all done" core phrase):> ut <- c(Utterance, "now we are all done") ## for testing > > out <- gsub(paste("( |^)",coreWords,"( |$)", sep = "", collapse = "|"),+ "", ut)> Core <- nchar(out) < nchar(ut) > Fringe <- grepl("[[:alpha:]]", out) > Core[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [14] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE [27] TRUE TRUE TRUE TRUE FALSE TRUE> Fringe[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [14] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [27] TRUE TRUE FALSE TRUE TRUE TRUE Sorry for the confusion, but this sort of thing can be tricky. If Rui's solution does not suffer these problems, it should be preferred. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Jun 14, 2021 at 5:13 PM Debbie Hahs-Vaughn <debbie at ucf.edu> wrote:> This also seems to work beautifully! I'm all for having multiple > approaches so appreciate the time you took to do this and am particularly > appreciative of the annotation on the script. That definitely helps > clarify what's happening and am sure that will be helpful to others working > on similar tasks as well. Thanks again, very much! > > > > > ------------------------------ > *From:* Bert Gunter <bgunter.4567 at gmail.com> > *Sent:* Friday, June 11, 2021 7:10 PM > *To:* Debbie Hahs-Vaughn <debbie at ucf.edu>; Rui Barradas < > ruipbarradas at sapo.pt> > *Cc:* r-help at R-project.org <r-help at r-project.org> > *Subject:* Re: [R] Identifying words from a list and code as 0 or 1 and > words NOT on the list code as 1 > > First, if Rui's solution works for you, I recommend that you stop reading > and discard this email. Why bother wasting time with stuff you don't need?! > > If it doesn't work or if you would like another approach -- perhaps as a > check -- then read on. > > Warning: I am a dinosaur and just use base R functionality , including > regular expressions, for these sorts of relatively simple tasks. I also > eschew pipes. So my code for your example is simply: > > matchpat <- paste("\\b",coreWords, "\\b", sep = "",collapse = "|") > out <- gsub(matchpat,"",Utterance) > Core <- nchar(out) != nchar(Utterance) > Fringe <- nchar(gsub(" +","",out)) > 0 > > Note that I have given the results as logical TRUE or FALSE. If you insist > on 1's and 0's, just instead do: > Core <- (nchar(out) != nchar(Utterance)) + 0 > Fringe <- sign(nchar(gsub(" +","",out))) > > > Now for an explanation. My approach was simply to create a regular > expression (regex) match pattern that would match any of your words or > phrases. The matchpat assignment does this just by logically "or"ing (with > the"|" symbol) together all your words and phrases, each of which is > surrounded by the edge of word symbol, "\\b" (so only whole words or > phrases are matched). This is standard regex stuff, and I could do it > rather handily with r's paste() function. One word of caution, though: R's > ?regex says: > "Long regular expression patterns may or may not be accepted: the POSIX > standard only requires up to 256 bytes." So what works for your reprex > might not work for your full list of coreWords. It is possible to work > around this by repeatedly applying subsets of your coreWords **provided** > you make sure that you order these subsets by the number of words in each > coreWord phrase. That is, bigger phrases must be applied first before > applying smaller phrases/words to the results. This is not hard to do, but > adds complexity, and may not be necessary. See below for an explanation. > > What the second line of code does is to use the gsub() function to remove > all matches to matchpat -- which, via the "|" construction -- is anything > in your coreWord list. So this means that if you have any matches in an > utterance, what remains after gsubbing will be shorter -- fewer characters > -- than the original utterance. The Core assignment checks for this using > the nchar() function and returns TRUE or FALSE as appropriate. If all the > words in the utterance matched code words, you would be left with nothing > or a bunch of spaces. The Fringe assignment just first removes all spaces > via the gsub() and then returns TRUE if there's nothing (0 characters) left > or FALSE if there still are some left. > > Finally, why do you have to start with longer phrases first if you have to > do this sequentially? Suppose you have the phrases "good night" in your > phrase list, and also the word "night". If you have to do things > sequentially instead of as one swell foop, if you applied the gsub() with a > bunch including only "night" first, then "night" will be removed and "good" > will be left. Then when the bunch containing "good night" is gsubbed after, > it won't see the whole phrase any more and "good" will be left in, which is > *not* what you said you wanted. > > Finally,it is of course possible to do these things by sequentially > applying one word/phrase at a time in a loop (again, longest phrases first > for the same reason as above), but I believe this might take quite a while > with a big list of coreWords (and Utterances). The above approach using "|" > vectorizes things and takes advantage of the power of the regex engine, so > I think it will be more efficient **if it's accepted.** But if you run > into the problem of pattern length limitations, then sequentially, one at a > time, might be simpler. My judgments of computational efficiency are often > wrong anyway. > > Note: I think my approach works, but I would appreciate an on-list > response if I have erred. Also, even if correct, alternative cleverer > approaches are always welcome. > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Jun 11, 2021 at 11:54 AM Debbie Hahs-Vaughn <debbie at ucf.edu> > wrote: > > Thank you for noting this. The utterance has to match the exact phrase > (e.g., "all done") for it to constitute a match in the utterance. > > > ------------------------------ > *From:* Bert Gunter <bgunter.4567 at gmail.com> > *Sent:* Friday, June 11, 2021 2:42 PM > *To:* Debbie Hahs-Vaughn <debbie at ucf.edu> > *Cc:* r-help at R-project.org <r-help at r-project.org> > *Subject:* Re: [R] Identifying words from a list and code as 0 or 1 and > words NOT on the list code as 1 > > Note that your specification is ambiguous. "all done" is not a single word > -- it's a phrase. So what do you want to do if: > > 1) "all" and/or "done" are also among your core words? > 2) "I'm all done" is another of your core phrases. > > The existence of phrases in your core list allows such conflicts to arise. > Do you claim that phrases would be chosen so that this can never happen? -- > or what is your specification if they can (what constitutes a match and in > what priority)? > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Jun 11, 2021 at 10:06 AM Debbie Hahs-Vaughn <debbie at ucf.edu> > wrote: > > I am working with utterances, statements spoken by children. From each > utterance, if one or more words in the statement match a predefined list of > multiple 'core' words (probably 300 words), then I want to input '1' into > 'Core' (and if none, then input '0' into 'Core'). > > If there are one or more words in the statement that are NOT core words, > then I want to input '1' into 'Fringe' (and if there are only core words > and nothing extra, then input '0' into 'Fringe'). I will not have a list > of Fringe words. > > Basically, right now I have a child ID and only the utterances. Here is a > snippet of my data. > > ID Utterance > 1 a baby > 2 small > 3 yes > 4 where's his bed > 5 there's his bed > 6 where's his pillow > 7 what is that on his head > 8 hey he has his arm stuck here > 9 there there's it > 10 now you're gonna go night-night > 11 and that's the thing you can turn on > 12 yeah where's the music box > 13 what is this > 14 small > 15 there you go baby > > > The following code runs but isn't doing exactly what I need--which is: 1) > the ability to detect words from the list and define as core; 2) the > ability to search the utterance and if there are any words in the utterance > that are NOT core, to identify those as ?1? as I will not have a list of > fringe words. > > ``` > > library(dplyr) > library(stringr) > library(tidyr) > > coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", > "a", "go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", > "some", "help", "all done", "finished") > > str_detect(df,) > > dfplus <- df %>% > mutate(id = row_number()) %>% > separate_rows(Utterance, sep = ' ') %>% > mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')), > Fringe = + !Core) %>% > group_by(id) %>% > mutate(Core = + (sum(Core) > 0), > Fringe = + (sum(Fringe) > 0)) %>% > slice(1) %>% > select(-Utterance) %>% > left_join(df) %>% > ungroup() %>% > select(Utterance, Core, Fringe, ID) > > ``` > > The dput() code is: > > structure(list(Utterance = c("a baby", "small", "yes", "where's his bed", > "there's his bed", "where's his pillow", "what is that on his head", > "hey he has his arm stuck here", "there there's it", "now you're gonna go > night-night", > "and that's the thing you can turn on", "yeah where's the music box", > "what is this", "small", "there you go baby ", "what is this for ", > "a ", "and the go goodnight here ", "and what is this ", " what's that > sound ", > "what does she say ", "what she say", "should I turn the on so Laura > doesn't cry ", > "what is this ", "what is that ", "where's clothes ", " where's the baby's > bedroom ", > "that might be in dad's bed+room ", "yes ", "there you go baby ", > "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L, > 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA, > -31L), class = c("tbl_df", "tbl", "data.frame")) > > ``` > > The first 10 rows of output looks like this: > > Utterance Core Fringe ID > 1 a baby 1 0 1 > 2 small 1 0 2 > 3 yes 1 0 3 > 4 where's his bed 1 1 4 > 5 there's his bed 1 1 5 > 6 where's his pillow 1 1 6 > 7 what is that on his head 1 0 7 > 8 hey he has his arm stuck here 1 1 8 > 9 there there's it 1 0 9 > 10 now you're gonna go night-night 1 1 10 > > For example, in line 1 of the output, ?a? is a core word so ?1? for core > is correct. However, ?baby? should be picked up as fringe so there should > be ?1?, not ?0?, for fringe. Lines 7 and 9 also have words that should be > identified as fringe but are not. > > Additionally, it seems like if the utterance has parts of a core word in > it, it?s being counted. For example, ?small? is identified as a core word > even though it's not (but 'all done' is a core word). 'Where's his bed' is > identified as core and fringe, although none of the words are core. > > Any suggestions on what is happening and how to correct it are greatly > appreciated. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7Cdebbie%40ucf.edu%7C4281e4e2bef34fd68d3c08d92d2e157e%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590498253390595%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2hXSrCtfbIEk4gqHumpNCgkzr1pVWuukB48laLhDHQI%3D&reserved=0> > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7Cdebbie%40ucf.edu%7C4281e4e2bef34fd68d3c08d92d2e157e%7Cbb932f15ef3842ba91fcf3c59d5dd1f1%7C0%7C1%7C637590498253400588%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sTXDvMD%2B7UZynCzuoEovyBfOwwgmUlpBV7szxQYwJVg%3D&reserved=0> > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]