thr3ads.net - R help - [R] R hangs at NGramTokenizer [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Neep Hazarika

2013-Sep-26 08:03 UTC

[R] R hangs at NGramTokenizer

Hi:
I try to construct a Document-Term Meatrix from a corpus. The commands I used
are:> library(parallel)> library(tm)> library(RWeka)>
library(topicmodels)> library(RTextTools)>
cl=makeCluster(detectCores())> invisible(clusterEvalQ(cl, library(tm)))>
invisible(clusterEvalQ(cl, library(RWeka))) > invisible(clusterEvalQ(cl,
library(topicmodels)))> invisible(clusterEvalQ(cl, library(RTextTools)))>
myCorpus <-Corpus(DirSource("/home/neeph/Test/DMOZ_Business"),
encoding="UTF-8", readerControl=list(reader=readPlain))> removeURL
<- function(x) gsub("http[[:alnum:]]*", "", x)>
myCorpus <- tm_map(myCorpus, removeURL)> removeAmp <- function(x)
gsub("&amp;", "", x)> myCorpus <- tm_map(myCorpus,
removeAmp)> removeWWW <- function(x) gsub("www[[:alnum:]]*",
"", x)> myCorpus <- tm_map(myCorpus, removeWWW)> myCorpus
<- tm_map(myCorpus, tolower)> myCorpus <- tm_map(myCorpus,
removeNumbers)> myCorpus <- tm_map(myCorpus, removePunctuation)>
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))>
myCorpus <- tm_map(myCorpus, removeWords, stopwords("SMART"))>
myCorpus <- tm_map(myCorpus, stripWhitespace)> myDtm <-
DocumentTermMatrix(myCorpus, control = list(wordLengths = c(1,Inf)))Everything works fine upto this stage, if I do not include tokenizing. However,
when I run the code with the following alteration:> dictCorpus <-
myCorpus> myDtm <- DocumentTermMatrix(myCorpus, control =
list(wordlengths=c(1,Inf),tokenize=NGramTokenizer, dictionary=dictCorpus))
it hangs. I have kept it running overnight, but no results. Any help would be
much appreciated.
Thanks--Neep Hazarika 		 	   		  
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more seemingly similar threads

R help - Sep 2013 - R hangs at NGramTokenizer

[R] R hangs at NGramTokenizer

Maybe Matching Threads

Wisdom of the Ancients