Nathan Parsons
2018-Oct-16 21:39 UTC
[R] Matching multiple search criteria (Unlisting a nested dataset, take 2)
Thanks all for your patience. Here?s a second go that is perhaps more
explicative of what it is I am trying to accomplish (and hopefully in plain
text form)...
I?m using the following packages: tidyverse, purrr, tidytext
I have a number of tweets in the following form:
th <- structure(list(status_id = c("x1047841705729306624",
"x1046966595610927105",
"x1047094786610552832", "x1046988542818308097",
"x1046934493553221632",
"x1047227442899775488"), created_at =
c("2018-10-04T13:31:45Z",
"2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z",
"2018-10-02T05:01:35Z",
"2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text =
c("Technique is
everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
"@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
tell you man. You are the fucking messiah",
"@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
being hung over tomorrow vs. not fucking up your life ten years later.",
"I tend to think about my dreams before I sleep.",
"@MichaelAvenatti
@SenatorCollins So, if your client was in her 20s, attending parties with
teenagers, doesn't that make her at the least immature as hell, or at the
worst, a pedophile and a person contributing to the delinquency of
minors?",
"i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
-83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
), county_name = c("Cumberland County", "Delaware County",
"San Francisco
County",
"Allegheny County", "Concho County", "Los Angeles
County"), fips = c(23005L,
39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
"Ohio", "California", "Pennsylvania",
"Texas", "California"),
state_abb = c("ME", "OH", "CA", "PA",
"TX", "CA"), urban_level = c("Medium
Metro",
"Large Fringe Metro", "Large Central Metro", "Large
Central Metro",
"NonCore (Nonmetro)", "Large Central Metro"), urban_code =
c(3L,
2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
1160433L, 4160L, 9509611L)), class = c("data.table",
"data.frame"
), row.names = c(NA, -6L), .internal.selfref = )
I also have a number of search terms in the following form:
st <- structure(list(terms = c("me abused depressed", "me hurt
depressed",
"feel hopeless depressed", "feel alone depressed", "i
feel helpless",
"i feel worthless")), row.names = c(NA, -6L), class =
c("tbl_df",
"tbl", "data.frame?))
I am trying to isolate the tweets that contain all of the words in each of
the search terms, i.e ?me? ?abused? and ?depressed? from the first example
search term, but they do not have to be in order or even next to one
another.
I am familiar with the dplyr suite of tools and have been attempting to
generate some sort of ?filter()? to do this. I am not very familiar with
purrr, but there may be a solution using the map function? I have also
explored the tidytext ?unnest_tokens? function which transforms the ?th?
data in the following way:
> tidytext::unnest_tokens(th, word, text, token = "tweets") ->
tt
> head(tt)
status_id created_at lat lng
1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
county_name fips state_name state_abb urban_level urban_code
1: Cumberland County 23005 Maine ME Medium Metro 3
2: Cumberland County 23005 Maine ME Medium Metro 3
3: Cumberland County 23005 Maine ME Medium Metro 3
4: Cumberland County 23005 Maine ME Medium Metro 3
5: Cumberland County 23005 Maine ME Medium Metro 3
6: Cumberland County 23005 Maine ME Medium Metro 3
population word
1: 277308 technique
2: 277308 is
3: 277308 everything
4: 277308 with
5: 277308 olympic
6: 277308 lifts
but once I have unnested the tokens, I am unable to recombine them back
into tweets.
Ideally the end result would append a new column to the ?th? data that
would flag a tweet that contained all of the search words for any of the
search terms; so the work flow would look like
1) look for all search words for one search term in a tweet
2) if all of the search words in the search term are found, create a flag
(mutate(flag = 1) or some such)
3) do this for all of the tweets
4) move on the next search term and repeat
Again, my thanks for your patience.
--
Nate Parsons
Pronouns: He, Him, His
Graduate Teaching Assistant
Department of Sociology
Portland State University
Portland, Oregon
503-725-9025
503-725-3957 FAX
[[alternative HTML version deleted]]
Nathan Parsons
2018-Oct-16 21:46 UTC
[R] Matching multiple search criteria (Unlisting a nested dataset, take 2)
Argh! Here are those two example datasets as data frames (not tibbles).
Sorry again. This apparently is just not my day.
th <- structure(list(status_id = c("x1047841705729306624",
"x1046966595610927105",
"x1047094786610552832", "x1046988542818308097",
"x1046934493553221632",
"x1047227442899775488"), created_at =
c("2018-10-04T13:31:45Z",
"2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z",
"2018-10-02T05:01:35Z",
"2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text =
c("Technique is
everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt",
"@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me
tell you man. You are the fucking messiah",
"@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not
being hung over tomorrow vs. not fucking up your life ten years later.",
"I tend to think about my dreams before I sleep.",
"@MichaelAvenatti
@SenatorCollins So, if your client was in her 20s, attending parties with
teenagers, doesn't that make her at the least immature as hell, or at the
worst, a pedophile and a person contributing to the delinquency of
minors?",
"i wish i could take credit for this"), lat = c(43.6835853, 40.284123,
37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
-83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
), county_name = c("Cumberland County", "Delaware County",
"San Francisco
County",
"Allegheny County", "Concho County", "Los Angeles
County"), fips = c(23005L,
39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
"Ohio", "California", "Pennsylvania",
"Texas", "California"),
state_abb = c("ME", "OH", "CA",
"PA", "TX", "CA"), urban_level c("Medium
Metro",
"Large Fringe Metro", "Large Central Metro", "Large
Central Metro",
"NonCore (Nonmetro)", "Large Central Metro"), urban_code
= c(3L,
2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
1160433L, 4160L, 9509611L)), class = "data.frame", row.names =
c(NA,
-6L))
st <- structure(list(terms = c("me abused depressed", "me hurt
depressed",
"feel hopeless depressed", "feel alone depressed", "i
feel helpless",
"i feel worthless")), row.names = c(NA, -6L), class =
c("tbl_df",
"tbl", "data.frame"))
On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <nathan.f.parsons at
gmail.com>
wrote:
> Thanks all for your patience. Here?s a second go that is perhaps more
> explicative of what it is I am trying to accomplish (and hopefully in plain
> text form)...
>
>
> I?m using the following packages: tidyverse, purrr, tidytext
>
>
> I have a number of tweets in the following form:
>
>
> th <- structure(list(status_id = c("x1047841705729306624",
> "x1046966595610927105",
>
> "x1047094786610552832", "x1046988542818308097",
"x1046934493553221632",
>
> "x1047227442899775488"), created_at =
c("2018-10-04T13:31:45Z",
>
> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z",
"2018-10-02T05:01:35Z",
>
> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text =
c("Technique is
> everything with olympic lifts ! @ Body By John
https://t.co/UsfR6DafZt",
>
> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
let
> me tell you man. You are the fucking messiah",
>
> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
not
> being hung over tomorrow vs. not fucking up your life ten years
later.",
>
> "I tend to think about my dreams before I sleep.",
"@MichaelAvenatti
> @SenatorCollins So, if your client was in her 20s, attending parties with
> teenagers, doesn't that make her at the least immature as hell, or at
the
> worst, a pedophile and a person contributing to the delinquency of
minors?",
>
> "i wish i could take credit for this"), lat = c(43.6835853,
40.284123,
>
> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>
> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>
> ), county_name = c("Cumberland County", "Delaware
County", "San Francisco
> County",
>
> "Allegheny County", "Concho County", "Los Angeles
County"), fips > c(23005L,
>
> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>
> "Ohio", "California", "Pennsylvania",
"Texas", "California"),
>
> state_abb = c("ME", "OH", "CA",
"PA", "TX", "CA"), urban_level = c("Medium
> Metro",
>
> "Large Fringe Metro", "Large Central Metro",
"Large Central Metro",
>
> "NonCore (Nonmetro)", "Large Central Metro"),
urban_code = c(3L,
>
> 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>
> 1160433L, 4160L, 9509611L)), class = c("data.table",
"data.frame"
>
> ), row.names = c(NA, -6L), .internal.selfref = )
>
>
> I also have a number of search terms in the following form:
>
>
> st <- structure(list(terms = c("me abused depressed", "me
hurt depressed",
>
> "feel hopeless depressed", "feel alone depressed",
"i feel helpless",
>
> "i feel worthless")), row.names = c(NA, -6L), class =
c("tbl_df",
>
> "tbl", "data.frame?))
>
>
> I am trying to isolate the tweets that contain all of the words in each of
> the search terms, i.e ?me? ?abused? and ?depressed? from the first example
> search term, but they do not have to be in order or even next to one
> another.
>
>
> I am familiar with the dplyr suite of tools and have been attempting to
> generate some sort of ?filter()? to do this. I am not very familiar with
> purrr, but there may be a solution using the map function? I have also
> explored the tidytext ?unnest_tokens? function which transforms the ?th?
> data in the following way:
>
>
> > tidytext::unnest_tokens(th, word, text, token = "tweets")
-> tt
>
> > head(tt)
>
> status_id created_at lat lng
>
> 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> county_name fips state_name state_abb urban_level urban_code
>
> 1: Cumberland County 23005 Maine ME Medium Metro 3
>
> 2: Cumberland County 23005 Maine ME Medium Metro 3
>
> 3: Cumberland County 23005 Maine ME Medium Metro 3
>
> 4: Cumberland County 23005 Maine ME Medium Metro 3
>
> 5: Cumberland County 23005 Maine ME Medium Metro 3
>
> 6: Cumberland County 23005 Maine ME Medium Metro 3
>
> population word
>
> 1: 277308 technique
>
> 2: 277308 is
>
> 3: 277308 everything
>
> 4: 277308 with
>
> 5: 277308 olympic
>
> 6: 277308 lifts
>
>
> but once I have unnested the tokens, I am unable to recombine them back
> into tweets.
>
>
> Ideally the end result would append a new column to the ?th? data that
> would flag a tweet that contained all of the search words for any of the
> search terms; so the work flow would look like
>
> 1) look for all search words for one search term in a tweet
>
> 2) if all of the search words in the search term are found, create a flag
> (mutate(flag = 1) or some such)
>
> 3) do this for all of the tweets
>
> 4) move on the next search term and repeat
>
>
> Again, my thanks for your patience.
>
>
> --
>
>
> Nate Parsons
>
> Pronouns: He, Him, His
>
> Graduate Teaching Assistant
>
> Department of Sociology
>
> Portland State University
>
> Portland, Oregon
>
>
> 503-725-9025
>
> 503-725-3957 FAX
>
[[alternative HTML version deleted]]
Bert Gunter
2018-Oct-16 22:03 UTC
[R] Matching multiple search criteria (Unlisting a nested dataset, take 2)
The problem wasn't the data tibbles. You posted in html -- which you were explictly warned against -- and that corrupted your text (e.g. some quotes became "smart quotes", which cannot be properly cut and pasted into R). Bert On Tue, Oct 16, 2018 at 2:47 PM Nathan Parsons <nathan.f.parsons at gmail.com> wrote:> Argh! Here are those two example datasets as data frames (not tibbles). > Sorry again. This apparently is just not my day. > > > th <- structure(list(status_id = c("x1047841705729306624", > "x1046966595610927105", > > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632", > > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z", > > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z", > > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt", > > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let me > tell you man. You are the fucking messiah", > > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not > being hung over tomorrow vs. not fucking up your life ten years later.", > > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti > @SenatorCollins So, if your client was in her 20s, attending parties with > teenagers, doesn't that make her at the least immature as hell, or at the > worst, a pedophile and a person contributing to the delinquency of > minors?", > > > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123, > > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118, > > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426 > > ), county_name = c("Cumberland County", "Delaware County", "San Francisco > County", > > "Allegheny County", "Concho County", "Los Angeles County"), fips > c(23005L, > > > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine", > > "Ohio", "California", "Pennsylvania", "Texas", "California"), > > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level > c("Medium Metro", > > "Large Fringe Metro", "Large Central Metro", "Large Central Metro", > > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L, > > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L, > > 1160433L, 4160L, 9509611L)), class = "data.frame", row.names = c(NA, > > -6L)) > > > st <- structure(list(terms = c("me abused depressed", "me hurt depressed", > > "feel hopeless depressed", "feel alone depressed", "i feel helpless", > > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df", > > "tbl", "data.frame")) > > On Tue, Oct 16, 2018 at 2:39 PM Nathan Parsons <nathan.f.parsons at gmail.com > > > wrote: > > > Thanks all for your patience. Here?s a second go that is perhaps more > > explicative of what it is I am trying to accomplish (and hopefully in > plain > > text form)... > > > > > > I?m using the following packages: tidyverse, purrr, tidytext > > > > > > I have a number of tweets in the following form: > > > > > > th <- structure(list(status_id = c("x1047841705729306624", > > "x1046966595610927105", > > > > "x1047094786610552832", "x1046988542818308097", "x1046934493553221632", > > > > "x1047227442899775488"), created_at = c("2018-10-04T13:31:45Z", > > > > "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z", "2018-10-02T05:01:35Z", > > > > "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text = c("Technique is > > everything with olympic lifts ! @ Body By John https://t.co/UsfR6DafZt", > > > > "@Subtronics just went back and rewatched ur FBlice with ur CDJs and let > > me tell you man. You are the fucking messiah", > > > > "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs. not > > being hung over tomorrow vs. not fucking up your life ten years later.", > > > > "I tend to think about my dreams before I sleep.", "@MichaelAvenatti > > @SenatorCollins So, if your client was in her 20s, attending parties with > > teenagers, doesn't that make her at the least immature as hell, or at the > > worst, a pedophile and a person contributing to the delinquency of > minors?", > > > > "i wish i could take credit for this"), lat = c(43.6835853, 40.284123, > > > > 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118, > > > > -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426 > > > > ), county_name = c("Cumberland County", "Delaware County", "San Francisco > > County", > > > > "Allegheny County", "Concho County", "Los Angeles County"), fips > > c(23005L, > > > > 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine", > > > > "Ohio", "California", "Pennsylvania", "Texas", "California"), > > > > state_abb = c("ME", "OH", "CA", "PA", "TX", "CA"), urban_level > c("Medium > > Metro", > > > > "Large Fringe Metro", "Large Central Metro", "Large Central Metro", > > > > "NonCore (Nonmetro)", "Large Central Metro"), urban_code = c(3L, > > > > 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L, > > > > 1160433L, 4160L, 9509611L)), class = c("data.table", "data.frame" > > > > ), row.names = c(NA, -6L), .internal.selfref = ) > > > > > > I also have a number of search terms in the following form: > > > > > > st <- structure(list(terms = c("me abused depressed", "me hurt > depressed", > > > > "feel hopeless depressed", "feel alone depressed", "i feel helpless", > > > > "i feel worthless")), row.names = c(NA, -6L), class = c("tbl_df", > > > > "tbl", "data.frame?)) > > > > > > I am trying to isolate the tweets that contain all of the words in each > of > > the search terms, i.e ?me? ?abused? and ?depressed? from the first > example > > search term, but they do not have to be in order or even next to one > > another. > > > > > > I am familiar with the dplyr suite of tools and have been attempting to > > generate some sort of ?filter()? to do this. I am not very familiar with > > purrr, but there may be a solution using the map function? I have also > > explored the tidytext ?unnest_tokens? function which transforms the ?th? > > data in the following way: > > > > > > > tidytext::unnest_tokens(th, word, text, token = "tweets") -> tt > > > > > head(tt) > > > > status_id created_at lat lng > > > > 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > > > 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > > > 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > > > 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > > > 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > > > 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841 > > > > county_name fips state_name state_abb urban_level urban_code > > > > 1: Cumberland County 23005 Maine ME Medium Metro 3 > > > > 2: Cumberland County 23005 Maine ME Medium Metro 3 > > > > 3: Cumberland County 23005 Maine ME Medium Metro 3 > > > > 4: Cumberland County 23005 Maine ME Medium Metro 3 > > > > 5: Cumberland County 23005 Maine ME Medium Metro 3 > > > > 6: Cumberland County 23005 Maine ME Medium Metro 3 > > > > population word > > > > 1: 277308 technique > > > > 2: 277308 is > > > > 3: 277308 everything > > > > 4: 277308 with > > > > 5: 277308 olympic > > > > 6: 277308 lifts > > > > > > but once I have unnested the tokens, I am unable to recombine them back > > into tweets. > > > > > > Ideally the end result would append a new column to the ?th? data that > > would flag a tweet that contained all of the search words for any of the > > search terms; so the work flow would look like > > > > 1) look for all search words for one search term in a tweet > > > > 2) if all of the search words in the search term are found, create a flag > > (mutate(flag = 1) or some such) > > > > 3) do this for all of the tweets > > > > 4) move on the next search term and repeat > > > > > > Again, my thanks for your patience. > > > > > > -- > > > > > > Nate Parsons > > > > Pronouns: He, Him, His > > > > Graduate Teaching Assistant > > > > Department of Sociology > > > > Portland State University > > > > Portland, Oregon > > > > > > 503-725-9025 > > > > 503-725-3957 FAX > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Ista Zahn
2018-Oct-19 13:00 UTC
[R] Matching multiple search criteria (Unlisting a nested dataset, take 2)
Here is another approach, just for fun:
library(tidyverse)
library(tokenizers)
anyall <- function(x, # a character vector
terms # a list of character vectors
){
any(map_lgl(terms, function(term) {
all(term %in% x)
}))
}
mutate(th,
flag = map_lgl(tokenize_tweets(text),
anyall,
terms = tokenize_words(st$terms)))
Best,
Ista
On Tue, Oct 16, 2018 at 5:39 PM Nathan Parsons
<nathan.f.parsons at gmail.com> wrote:>
> Thanks all for your patience. Here?s a second go that is perhaps more
> explicative of what it is I am trying to accomplish (and hopefully in plain
> text form)...
>
>
> I?m using the following packages: tidyverse, purrr, tidytext
>
>
> I have a number of tweets in the following form:
>
>
> th <- structure(list(status_id = c("x1047841705729306624",
> "x1046966595610927105",
>
> "x1047094786610552832", "x1046988542818308097",
"x1046934493553221632",
>
> "x1047227442899775488"), created_at =
c("2018-10-04T13:31:45Z",
>
> "2018-10-02T03:34:22Z", "2018-10-02T12:03:45Z",
"2018-10-02T05:01:35Z",
>
> "2018-10-02T01:26:49Z", "2018-10-02T20:50:53Z"), text =
c("Technique is
> everything with olympic lifts ! @ Body By John
https://t.co/UsfR6DafZt",
>
> "@Subtronics just went back and rewatched ur FBlice with ur CDJs and
let me
> tell you man. You are the fucking messiah",
>
> "@ic4rus1 Opportunistic means short-game. As in getting drunk now vs.
not
> being hung over tomorrow vs. not fucking up your life ten years
later.",
>
> "I tend to think about my dreams before I sleep.",
"@MichaelAvenatti
> @SenatorCollins So, if your client was in her 20s, attending parties with
> teenagers, doesn't that make her at the least immature as hell, or at
the
> worst, a pedophile and a person contributing to the delinquency of
minors?",
>
> "i wish i could take credit for this"), lat = c(43.6835853,
40.284123,
>
> 37.7706565, 40.431389, 31.1688935, 33.9376735), lng = c(-70.3284118,
>
> -83.078589, -122.4359785, -79.9806895, -100.0768885, -118.130426
>
> ), county_name = c("Cumberland County", "Delaware
County", "San Francisco
> County",
>
> "Allegheny County", "Concho County", "Los Angeles
County"), fips = c(23005L,
>
> 39041L, 6075L, 42003L, 48095L, 6037L), state_name = c("Maine",
>
> "Ohio", "California", "Pennsylvania",
"Texas", "California"),
>
> state_abb = c("ME", "OH", "CA",
"PA", "TX", "CA"), urban_level = c("Medium
> Metro",
>
> "Large Fringe Metro", "Large Central Metro",
"Large Central Metro",
>
> "NonCore (Nonmetro)", "Large Central Metro"),
urban_code = c(3L,
>
> 2L, 1L, 1L, 6L, 1L), population = c(277308L, 184029L, 830781L,
>
> 1160433L, 4160L, 9509611L)), class = c("data.table",
"data.frame"
>
> ), row.names = c(NA, -6L), .internal.selfref = )
>
>
> I also have a number of search terms in the following form:
>
>
> st <- structure(list(terms = c("me abused depressed", "me
hurt depressed",
>
> "feel hopeless depressed", "feel alone depressed",
"i feel helpless",
>
> "i feel worthless")), row.names = c(NA, -6L), class =
c("tbl_df",
>
> "tbl", "data.frame?))
>
>
> I am trying to isolate the tweets that contain all of the words in each of
> the search terms, i.e ?me? ?abused? and ?depressed? from the first example
> search term, but they do not have to be in order or even next to one
> another.
>
>
> I am familiar with the dplyr suite of tools and have been attempting to
> generate some sort of ?filter()? to do this. I am not very familiar with
> purrr, but there may be a solution using the map function? I have also
> explored the tidytext ?unnest_tokens? function which transforms the ?th?
> data in the following way:
>
>
> > tidytext::unnest_tokens(th, word, text, token = "tweets")
-> tt
>
> > head(tt)
>
> status_id created_at lat lng
>
> 1: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 2: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 3: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 4: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 5: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> 6: x1047841705729306624 2018-10-04T13:31:45Z 43.68359 -70.32841
>
> county_name fips state_name state_abb urban_level urban_code
>
> 1: Cumberland County 23005 Maine ME Medium Metro 3
>
> 2: Cumberland County 23005 Maine ME Medium Metro 3
>
> 3: Cumberland County 23005 Maine ME Medium Metro 3
>
> 4: Cumberland County 23005 Maine ME Medium Metro 3
>
> 5: Cumberland County 23005 Maine ME Medium Metro 3
>
> 6: Cumberland County 23005 Maine ME Medium Metro 3
>
> population word
>
> 1: 277308 technique
>
> 2: 277308 is
>
> 3: 277308 everything
>
> 4: 277308 with
>
> 5: 277308 olympic
>
> 6: 277308 lifts
>
>
> but once I have unnested the tokens, I am unable to recombine them back
> into tweets.
>
>
> Ideally the end result would append a new column to the ?th? data that
> would flag a tweet that contained all of the search words for any of the
> search terms; so the work flow would look like
>
> 1) look for all search words for one search term in a tweet
>
> 2) if all of the search words in the search term are found, create a flag
> (mutate(flag = 1) or some such)
>
> 3) do this for all of the tweets
>
> 4) move on the next search term and repeat
>
>
> Again, my thanks for your patience.
>
>
> --
>
>
> Nate Parsons
>
> Pronouns: He, Him, His
>
> Graduate Teaching Assistant
>
> Department of Sociology
>
> Portland State University
>
> Portland, Oregon
>
>
> 503-725-9025
>
> 503-725-3957 FAX
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.