Dear R command, I was wondering if I could ask you recommendations on my problem if that is fine with you. Basically, I have a data frame with 5 columns and 10 000 tweets recorded(rows). Those columns are: numberofatweet(number), tweet (actual textual tweet), locations(from where tweet sent), badwords(words that should not be used on twitter, that is just a column irrespective the number of a tweet and it contains only 80 rows with one word recorded in one cell. My question is whether it is possible to select only the rows which would contain such tweets, where in column "tweet"(actual text) there was one of those words from badwords column present. I tried to use grep and grepl, but nothing seems to be working. Thank you in advance, Vladimir [[alternative HTML version deleted]]
Hello, Please use ?dput to post a data example. Use something like the following, where 'dat' is the name of your data.frame. dput(head(dat, 30))? # paste the output of this in a mail Hope this helps, Rui Barradas ? Citando ???? ????????? <v.grabarnik at gmail.com>:> Dear R command, > > I was wondering if I could ask you recommendations on my problem if that is > fine with you. > Basically, I have a data frame with 5 columns and 10 000 tweets > recorded(rows). Those columns are: numberofatweet(number), tweet (actual > textual tweet), locations(from where tweet sent), badwords(words that > should not be used on twitter, that is just a column irrespective the > number of a tweet and it contains only 80 rows with one word recorded in > one cell. > My question is whether it is possible to select only the rows which would > contain such tweets, where in column "tweet"(actual text) there was one of > those words from badwords column present. I tried to use grep and grepl, > but nothing seems to be working. > > Thank you in advance, > Vladimir > > ? ? ? ? [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.htmland provide commented, > minimal, self-contained, reproducible code.? [[alternative HTML version deleted]]
I'm not quite sure if this is what you are looking for: example.df <- data.frame(words= c("A T", "Z H", "B E", "C P H"), badwords c("A|I|J|H|K|L")) # Extract the column with bad words badwords <- example.df$badwords badwords <- as.character(badwords[1]) # Subset the data.frame subset(example.df, grepl(badwords, words)) As I understand your email the badwords column contains all bad words in each cell, so I assume they are separated somehow. In my example I use | because it used to signify OR in grep. Since all elements of the bad word column are equal I just get the first element, make sure it is a character, and use grepl to subset the entire data.frame HTH Ulrik On Fri, 5 Aug 2016 at 17:19 <ruipbarradas at sapo.pt> wrote:> Hello, > > Please use ?dput to post a data example. Use something like the > following, where 'dat' is the name of your data.frame. > > dput(head(dat, 30)) # paste the output of this in a mail > > Hope this helps, > > Rui Barradas > > > Citando ???? ????????? <v.grabarnik at gmail.com>: > > > Dear R command, > > > > I was wondering if I could ask you recommendations on my problem if that > is > > fine with you. > > Basically, I have a data frame with 5 columns and 10 000 tweets > > recorded(rows). Those columns are: numberofatweet(number), tweet (actual > > textual tweet), locations(from where tweet sent), badwords(words that > > should not be used on twitter, that is just a column irrespective the > > number of a tweet and it contains only 80 rows with one word recorded in > > one cell. > > My question is whether it is possible to select only the rows which would > > contain such tweets, where in column "tweet"(actual text) there was one > of > > those words from badwords column present. I tried to use grep and grepl, > > but nothing seems to be working. > > > > Thank you in advance, > > Vladimir > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.htmland provide commented, > > minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
Hi Vladimir, Do you want something like this? vdat<-read.table(text="numberoftweet,tweet,locations,badwords 1,My cat is asleep,London,glum 2,My cat is flying,Paris,dashed 3,My cat is dancing,Berlin,mopey 4,My cat is singing,Rome,ill 5,My cat is reading,Budapest,sad 6,My cat is eating,Amsterdam,annoyed 7,My cat is hiding,Copenhagen,crazy 8,My cat is fluffy,Vilnius,terrified 9,My cat is annoyed,Athens,sick 10,My cat is exercising,Ankara,mortified 11,My cat is dreaming,Kracow,irked 12,My cat is mopey,Vienna,uneasy 13,My cat is glum,Brussels,upset", sep=",",header=TRUE,stringsAsFactors=FALSE) badwords<-paste(vdat$badwords,collapse="|") names(unlist(sapply(vdat$tweet,grep,pattern=badwords))) Jim On Sat, Aug 6, 2016 at 12:07 AM, ???? ????????? <v.grabarnik at gmail.com> wrote:> Dear R command, > > I was wondering if I could ask you recommendations on my problem if that is > fine with you. > Basically, I have a data frame with 5 columns and 10 000 tweets > recorded(rows). Those columns are: numberofatweet(number), tweet (actual > textual tweet), locations(from where tweet sent), badwords(words that > should not be used on twitter, that is just a column irrespective the > number of a tweet and it contains only 80 rows with one word recorded in > one cell. > My question is whether it is possible to select only the rows which would > contain such tweets, where in column "tweet"(actual text) there was one of > those words from badwords column present. I tried to use grep and grepl, > but nothing seems to be working. > > Thank you in advance, > Vladimir > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Vladimir, This may fix the NA problem: vdat<-read.table(text="numberoftweet,tweet,locations,badwords 1,My cat is asleep,London,glum 2,My cat is flying,Paris,dashed 3,My cat is dancing,Berlin,mopey 4,My cat is singing,Rome,ill 5,My cat is reading,Budapest,sad 6,My cat is eating,Amsterdam,annoyed 7,My cat is hiding,Copenhagen,crazy 8,My cat is fluffy,Vilnius,terrified 9,My cat is annoyed,Athens,sick 10,My cat is exercising,Ankara,mortified 11,My cat is dreaming,Kracow,irked 12,My cat is mopey,Vienna,uneasy 13,My cat is glum,Brussels,upset 14,My cat is swinging,Madrid, 15,My cat is crazy,Ljubljana,", sep=",",header=TRUE,stringsAsFactors=FALSE) vdat$badwords[!nchar(vdat$badwords)]<-NA badwords<-paste(vdat$badwords[!is.na(vdat$badwords)],collapse="|") names(unlist(sapply(vdat$tweet,grep,pattern=badwords))) Jim On Sun, Aug 7, 2016 at 6:43 PM, ???? ????????? <v.grabarnik at gmail.com> wrote:> Hi Jim! > > That is exactly what I mean. Your example does the job I was looking for. > If I refer to your example, my badwords column is not completed for all > rows, like yours. For example it has only 10 values, but there are much more > rows. When I try to introduce NA for blanks and write > badwords<-paste(vdat$badwords,collapse="|") > it collapses all values and writes smth like: word|word|NA|NA > and if I dont introduce NAs when reading data, the outcome is still like: > word|word|word|word|||||||||||||||| > and when I try to > names(unlist(sapply(vdat$tweet,grep,pattern=badwords))) there is a mistake. > I had this question before but do you know by any chance how to separate > just those words in a column badwords and not include NA's or blanks. > > Thank you, > Vladimir > > 2016-08-07 0:19 GMT+01:00 Jim Lemon <drjimlemon at gmail.com>: >> >> Hi Vladimir, >> Do you want something like this? >> >> vdat<-read.table(text="numberoftweet,tweet,locations,badwords >> 1,My cat is asleep,London,glum >> 2,My cat is flying,Paris,dashed >> 3,My cat is dancing,Berlin,mopey >> 4,My cat is singing,Rome,ill >> 5,My cat is reading,Budapest,sad >> 6,My cat is eating,Amsterdam,annoyed >> 7,My cat is hiding,Copenhagen,crazy >> 8,My cat is fluffy,Vilnius,terrified >> 9,My cat is annoyed,Athens,sick >> 10,My cat is exercising,Ankara,mortified >> 11,My cat is dreaming,Kracow,irked >> 12,My cat is mopey,Vienna,uneasy >> 13,My cat is glum,Brussels,upset", >> sep=",",header=TRUE,stringsAsFactors=FALSE) >> >> badwords<-paste(vdat$badwords,collapse="|") >> >> names(unlist(sapply(vdat$tweet,grep,pattern=badwords))) >> >> Jim >> >> >> On Sat, Aug 6, 2016 at 12:07 AM, ???? ????????? <v.grabarnik at gmail.com> >> wrote: >> > Dear R command, >> > >> > I was wondering if I could ask you recommendations on my problem if that >> > is >> > fine with you. >> > Basically, I have a data frame with 5 columns and 10 000 tweets >> > recorded(rows). Those columns are: numberofatweet(number), tweet (actual >> > textual tweet), locations(from where tweet sent), badwords(words that >> > should not be used on twitter, that is just a column irrespective the >> > number of a tweet and it contains only 80 rows with one word recorded in >> > one cell. >> > My question is whether it is possible to select only the rows which >> > would >> > contain such tweets, where in column "tweet"(actual text) there was one >> > of >> > those words from badwords column present. I tried to use grep and grepl, >> > but nothing seems to be working. >> > >> > Thank you in advance, >> > Vladimir >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > ? ?????????, > ?????? ?????????