Renato Medei
2015-Feb-10 14:52 UTC
[R] Problems with tm package, Removeword and trasformations
Dear all, I'm sorry but as all the newbies I have a lot of problems to solve. I'm using R 3.1.2 under osx 10.10.2. I'm working with tm to analyze some tweets and I received some strange errors when I tried to remove stopwords (See below error 1), to transform content (See below error 2) and to create document term Matrix (See below error 3) Could anyone help me? Error 1> tweets = searchTwitter("rimini", n=1000) > tweets = sapply(tweets, function(x) x$getText()) > tweets_corpus = Corpus(VectorSource(tweets)) > toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) > tweets_corpus <- tm_map(tweets_corpus, toSpace,"(f|ht)tp(s?)://(.*)[.][a-z]+")> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ") > tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+") > tweets_corpus <- tm_map(tweets_corpus, removeNumbers) > tweets_corpus <- tm_map(tweets_corpus, removePunctuation) > tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini","Riviera", "riviera"))> tweets_corpus <- tm_map(tweets_corpus, stopwords("italian"))Warning message: In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code Error2> tweets = searchTwitter("rimini", n=1000) > tweets = sapply(tweets, function(x) x$getText()) > tweets_corpus = Corpus(VectorSource(tweets)) > tweets_corpus<<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) > tweets_corpus <- tm_map(tweets_corpus, toSpace,"(f|ht)tp(s?)://(.*)[.][a-z]+")> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ") > tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+") > tweets_corpus <- tm_map(tweets_corpus, removeNumbers) > tweets_corpus <- tm_map(tweets_corpus, removePunctuation) > tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini","Riviera", "riviera"))> tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))Warning message: In mclapply(content(x), FUN, ...) : all scheduled cores encountered errors in user code Error3> tweets = searchTwitter("rimini", n=1000) > tweets = sapply(tweets, function(x) x$getText()) > tweets_corpus = Corpus(VectorSource(tweets)) > tweets_corpus<<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>>> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) > tweets_corpus <- tm_map(tweets_corpus, toSpace,"(f|ht)tp(s?)://(.*)[.][a-z]+")> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ") > tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+") > tweets_corpus <- tm_map(tweets_corpus, removeNumbers) > tweets_corpus <- tm_map(tweets_corpus, removePunctuation) > tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini","Riviera", "riviera"))> dtm <- DocumentTermMatrix(tweets_corpus)Errore in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow length(allTerms), : 'i, j, v' different lengths Inoltre: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow length(allTerms), : si ? prodotto un NA per coercizione Thank you for your help [[alternative HTML version deleted]]
Amos B. Elberg
2015-Feb-10 19:08 UTC
[R] Problems with tm package, Removeword and trasformations
Trying to use t m to analyze tweets, you're going to experience a long stream of issues like the one you found, which generally relate to text formatting. I worked through them over the past few months for a project. If you email me offline I'll try to help and share some example code.> On Feb 10, 2015, at 9:52 AM, Renato Medei <medei.ren at gmail.com> wrote: > > Dear all, > I'm sorry but as all the newbies I have a lot of problems to solve. > I'm using R 3.1.2 under osx 10.10.2. > I'm working with tm to analyze some tweets and I received some strange > errors when I tried to remove stopwords (See below error 1), to transform > content (See below error 2) and to create document term Matrix (See below > error 3) > Could anyone help me? > > Error 1 >> tweets = searchTwitter("rimini", n=1000) >> tweets = sapply(tweets, function(x) x$getText()) >> tweets_corpus = Corpus(VectorSource(tweets)) >> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) >> tweets_corpus <- tm_map(tweets_corpus, toSpace, > "(f|ht)tp(s?)://(.*)[.][a-z]+") >> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ") >> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+") >> tweets_corpus <- tm_map(tweets_corpus, removeNumbers) >> tweets_corpus <- tm_map(tweets_corpus, removePunctuation) >> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini", > "Riviera", "riviera")) >> tweets_corpus <- tm_map(tweets_corpus, stopwords("italian")) > Warning message: > In mclapply(content(x), FUN, ...) : > all scheduled cores encountered errors in user code > > Error2 >> tweets = searchTwitter("rimini", n=1000) >> tweets = sapply(tweets, function(x) x$getText()) >> tweets_corpus = Corpus(VectorSource(tweets)) >> tweets_corpus > <<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>> >> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) >> tweets_corpus <- tm_map(tweets_corpus, toSpace, > "(f|ht)tp(s?)://(.*)[.][a-z]+") >> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ") >> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+") >> tweets_corpus <- tm_map(tweets_corpus, removeNumbers) >> tweets_corpus <- tm_map(tweets_corpus, removePunctuation) >> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini", > "Riviera", "riviera")) >> tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower)) > Warning message: > In mclapply(content(x), FUN, ...) : > all scheduled cores encountered errors in user code > > > Error3 > >> tweets = searchTwitter("rimini", n=1000) >> tweets = sapply(tweets, function(x) x$getText()) >> tweets_corpus = Corpus(VectorSource(tweets)) >> tweets_corpus > <<VCorpus (documents: 1000, metadata (corpus/indexed): 0/0)>> >> toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) >> tweets_corpus <- tm_map(tweets_corpus, toSpace, > "(f|ht)tp(s?)://(.*)[.][a-z]+") >> tweets_corpus <- tm_map(tweets_corpus, toSpace, "RT |via ") >> tweets_corpus <- tm_map(tweets_corpus, toSpace, "@[^\\s]+") >> tweets_corpus <- tm_map(tweets_corpus, removeNumbers) >> tweets_corpus <- tm_map(tweets_corpus, removePunctuation) >> tweets_corpus <- tm_map(tweets_corpus, removeWords, c("rimini", "Rimini", > "Riviera", "riviera")) >> dtm <- DocumentTermMatrix(tweets_corpus) > Errore in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow > length(allTerms), : > 'i, j, v' different lengths > Inoltre: Warning messages: > 1: In mclapply(unname(content(x)), termFreq, control) : > all scheduled cores encountered errors in user code > 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow > length(allTerms), : > si ? prodotto un NA per coercizione > > > Thank you for your help > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.