Debbie Hahs-Vaughn
2021-Jun-11 17:02 UTC
[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1
I am working with utterances, statements spoken by children. From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core'). If there are one or more words in the statement that are NOT core words, then I want to input '1' into 'Fringe' (and if there are only core words and nothing extra, then input '0' into 'Fringe'). I will not have a list of Fringe words. Basically, right now I have a child ID and only the utterances. Here is a snippet of my data. ID Utterance 1 a baby 2 small 3 yes 4 where's his bed 5 there's his bed 6 where's his pillow 7 what is that on his head 8 hey he has his arm stuck here 9 there there's it 10 now you're gonna go night-night 11 and that's the thing you can turn on 12 yeah where's the music box 13 what is this 14 small 15 there you go baby The following code runs but isn't doing exactly what I need--which is: 1) the ability to detect words from the list and define as core; 2) the ability to search the utterance and if there are any words in the utterance that are NOT core, to identify those as ?1? as I will not have a list of fringe words. ``` library(dplyr) library(stringr) library(tidyr) coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", "a", "go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", "some", "help", "all done", "finished") str_detect(df,) dfplus <- df %>% mutate(id = row_number()) %>% separate_rows(Utterance, sep = ' ') %>% mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')), Fringe = + !Core) %>% group_by(id) %>% mutate(Core = + (sum(Core) > 0), Fringe = + (sum(Fringe) > 0)) %>% slice(1) %>% select(-Utterance) %>% left_join(df) %>% ungroup() %>% select(Utterance, Core, Fringe, ID) ``` The dput() code is: structure(list(Utterance = c("a baby", "small", "yes", "where's his bed", "there's his bed", "where's his pillow", "what is that on his head", "hey he has his arm stuck here", "there there's it", "now you're gonna go night-night", "and that's the thing you can turn on", "yeah where's the music box", "what is this", "small", "there you go baby ", "what is this for ", "a ", "and the go goodnight here ", "and what is this ", " what's that sound ", "what does she say ", "what she say", "should I turn the on so Laura doesn't cry ", "what is this ", "what is that ", "where's clothes ", " where's the baby's bedroom ", "that might be in dad's bed+room ", "yes ", "there you go baby ", "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA, -31L), class = c("tbl_df", "tbl", "data.frame")) ``` The first 10 rows of output looks like this: Utterance Core Fringe ID 1 a baby 1 0 1 2 small 1 0 2 3 yes 1 0 3 4 where's his bed 1 1 4 5 there's his bed 1 1 5 6 where's his pillow 1 1 6 7 what is that on his head 1 0 7 8 hey he has his arm stuck here 1 1 8 9 there there's it 1 0 9 10 now you're gonna go night-night 1 1 10 For example, in line 1 of the output, ?a? is a core word so ?1? for core is correct. However, ?baby? should be picked up as fringe so there should be ?1?, not ?0?, for fringe. Lines 7 and 9 also have words that should be identified as fringe but are not. Additionally, it seems like if the utterance has parts of a core word in it, it?s being counted. For example, ?small? is identified as a core word even though it's not (but 'all done' is a core word). 'Where's his bed' is identified as core and fringe, although none of the words are core. Any suggestions on what is happening and how to correct it are greatly appreciated. [[alternative HTML version deleted]]
Rui Barradas
2021-Jun-11 18:03 UTC
[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1
Hello, For what I understood of the problem, this might be what you want. library(dplyr) library(stringr) coreWordsPat <- paste0("\\b", coreWords, "\\b") coreWordsPat <- paste(coreWordsPat, collapse = "|") left_join( df %>% mutate(Core = +str_detect(Utterance, coreWordsPat)) %>% select(ID, Utterance, Core), df %>% mutate(Fringe = str_remove_all(Utterance, coreWordsPat), Fringe = +(nchar(trimws(Fringe)) > 0)) %>% select(ID, Fringe), by = "ID" ) Hope this helps, Rui Barradas ?s 18:02 de 11/06/21, Debbie Hahs-Vaughn escreveu:> I am working with utterances, statements spoken by children. From each utterance, if one or more words in the statement match a predefined list of multiple 'core' words (probably 300 words), then I want to input '1' into 'Core' (and if none, then input '0' into 'Core'). > > If there are one or more words in the statement that are NOT core words, then I want to input '1' into 'Fringe' (and if there are only core words and nothing extra, then input '0' into 'Fringe'). I will not have a list of Fringe words. > > Basically, right now I have a child ID and only the utterances. Here is a snippet of my data. > > ID Utterance > 1 a baby > 2 small > 3 yes > 4 where's his bed > 5 there's his bed > 6 where's his pillow > 7 what is that on his head > 8 hey he has his arm stuck here > 9 there there's it > 10 now you're gonna go night-night > 11 and that's the thing you can turn on > 12 yeah where's the music box > 13 what is this > 14 small > 15 there you go baby > > > The following code runs but isn't doing exactly what I need--which is: 1) the ability to detect words from the list and define as core; 2) the ability to search the utterance and if there are any words in the utterance that are NOT core, to identify those as ?1? as I will not have a list of fringe words. > > ``` > > library(dplyr) > library(stringr) > library(tidyr) > > coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", "a", "go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", "some", "help", "all done", "finished") > > str_detect(df,) > > dfplus <- df %>% > mutate(id = row_number()) %>% > separate_rows(Utterance, sep = ' ') %>% > mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')), > Fringe = + !Core) %>% > group_by(id) %>% > mutate(Core = + (sum(Core) > 0), > Fringe = + (sum(Fringe) > 0)) %>% > slice(1) %>% > select(-Utterance) %>% > left_join(df) %>% > ungroup() %>% > select(Utterance, Core, Fringe, ID) > > ``` > > The dput() code is: > > structure(list(Utterance = c("a baby", "small", "yes", "where's his bed", > "there's his bed", "where's his pillow", "what is that on his head", > "hey he has his arm stuck here", "there there's it", "now you're gonna go night-night", > "and that's the thing you can turn on", "yeah where's the music box", > "what is this", "small", "there you go baby ", "what is this for ", > "a ", "and the go goodnight here ", "and what is this ", " what's that sound ", > "what does she say ", "what she say", "should I turn the on so Laura doesn't cry ", > "what is this ", "what is that ", "where's clothes ", " where's the baby's bedroom ", > "that might be in dad's bed+room ", "yes ", "there you go baby ", > "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L, > 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA, > -31L), class = c("tbl_df", "tbl", "data.frame")) > > ``` > > The first 10 rows of output looks like this: > > Utterance Core Fringe ID > 1 a baby 1 0 1 > 2 small 1 0 2 > 3 yes 1 0 3 > 4 where's his bed 1 1 4 > 5 there's his bed 1 1 5 > 6 where's his pillow 1 1 6 > 7 what is that on his head 1 0 7 > 8 hey he has his arm stuck here 1 1 8 > 9 there there's it 1 0 9 > 10 now you're gonna go night-night 1 1 10 > > For example, in line 1 of the output, ?a? is a core word so ?1? for core is correct. However, ?baby? should be picked up as fringe so there should be ?1?, not ?0?, for fringe. Lines 7 and 9 also have words that should be identified as fringe but are not. > > Additionally, it seems like if the utterance has parts of a core word in it, it?s being counted. For example, ?small? is identified as a core word even though it's not (but 'all done' is a core word). 'Where's his bed' is identified as core and fringe, although none of the words are core. > > Any suggestions on what is happening and how to correct it are greatly appreciated. > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Bert Gunter
2021-Jun-11 18:42 UTC
[R] Identifying words from a list and code as 0 or 1 and words NOT on the list code as 1
Note that your specification is ambiguous. "all done" is not a single word -- it's a phrase. So what do you want to do if: 1) "all" and/or "done" are also among your core words? 2) "I'm all done" is another of your core phrases. The existence of phrases in your core list allows such conflicts to arise. Do you claim that phrases would be chosen so that this can never happen? -- or what is your specification if they can (what constitutes a match and in what priority)? Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Jun 11, 2021 at 10:06 AM Debbie Hahs-Vaughn <debbie at ucf.edu> wrote:> I am working with utterances, statements spoken by children. From each > utterance, if one or more words in the statement match a predefined list of > multiple 'core' words (probably 300 words), then I want to input '1' into > 'Core' (and if none, then input '0' into 'Core'). > > If there are one or more words in the statement that are NOT core words, > then I want to input '1' into 'Fringe' (and if there are only core words > and nothing extra, then input '0' into 'Fringe'). I will not have a list > of Fringe words. > > Basically, right now I have a child ID and only the utterances. Here is a > snippet of my data. > > ID Utterance > 1 a baby > 2 small > 3 yes > 4 where's his bed > 5 there's his bed > 6 where's his pillow > 7 what is that on his head > 8 hey he has his arm stuck here > 9 there there's it > 10 now you're gonna go night-night > 11 and that's the thing you can turn on > 12 yeah where's the music box > 13 what is this > 14 small > 15 there you go baby > > > The following code runs but isn't doing exactly what I need--which is: 1) > the ability to detect words from the list and define as core; 2) the > ability to search the utterance and if there are any words in the utterance > that are NOT core, to identify those as ?1? as I will not have a list of > fringe words. > > ``` > > library(dplyr) > library(stringr) > library(tidyr) > > coreWords <-c("I", "no", "yes", "my", "the", "want", "is", "it", "that", > "a", "go", "mine", "you", "what", "on", "in", "here", "more", "out", "off", > "some", "help", "all done", "finished") > > str_detect(df,) > > dfplus <- df %>% > mutate(id = row_number()) %>% > separate_rows(Utterance, sep = ' ') %>% > mutate(Core = + str_detect(Utterance, str_c(coreWords, collapse = '|')), > Fringe = + !Core) %>% > group_by(id) %>% > mutate(Core = + (sum(Core) > 0), > Fringe = + (sum(Fringe) > 0)) %>% > slice(1) %>% > select(-Utterance) %>% > left_join(df) %>% > ungroup() %>% > select(Utterance, Core, Fringe, ID) > > ``` > > The dput() code is: > > structure(list(Utterance = c("a baby", "small", "yes", "where's his bed", > "there's his bed", "where's his pillow", "what is that on his head", > "hey he has his arm stuck here", "there there's it", "now you're gonna go > night-night", > "and that's the thing you can turn on", "yeah where's the music box", > "what is this", "small", "there you go baby ", "what is this for ", > "a ", "and the go goodnight here ", "and what is this ", " what's that > sound ", > "what does she say ", "what she say", "should I turn the on so Laura > doesn't cry ", > "what is this ", "what is that ", "where's clothes ", " where's the baby's > bedroom ", > "that might be in dad's bed+room ", "yes ", "there you go baby ", > "you're welcome "), Core = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L), Fringe = c(0L, 0L, 0L, 1L, 1L, 1L, > 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = 1:31), row.names = c(NA, > -31L), class = c("tbl_df", "tbl", "data.frame")) > > ``` > > The first 10 rows of output looks like this: > > Utterance Core Fringe ID > 1 a baby 1 0 1 > 2 small 1 0 2 > 3 yes 1 0 3 > 4 where's his bed 1 1 4 > 5 there's his bed 1 1 5 > 6 where's his pillow 1 1 6 > 7 what is that on his head 1 0 7 > 8 hey he has his arm stuck here 1 1 8 > 9 there there's it 1 0 9 > 10 now you're gonna go night-night 1 1 10 > > For example, in line 1 of the output, ?a? is a core word so ?1? for core > is correct. However, ?baby? should be picked up as fringe so there should > be ?1?, not ?0?, for fringe. Lines 7 and 9 also have words that should be > identified as fringe but are not. > > Additionally, it seems like if the utterance has parts of a core word in > it, it?s being counted. For example, ?small? is identified as a core word > even though it's not (but 'all done' is a core word). 'Where's his bed' is > identified as core and fringe, although none of the words are core. > > Any suggestions on what is happening and how to correct it are greatly > appreciated. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]