Abraham Mathew
2014-Oct-07 14:51 UTC
[R] Find common two and three word phrases in a corpus
Let's say I have a corpus and want to find the two, three, etc word phrases that occur most frequently in the data. I normally do this in the following manner but am getting an error message and am having some difficulty diagnosing what is going wrong. Given the following data, I'd just want a count of 2 for the number of 2 word phrases given that "that sucks" appears twice. dat = c("love it", "who goes there", "what is wrong", "that sucks", "that sucks") (corpus <- Corpus(VectorSource(dat))) matrix <- create_matrix(corpus, ngramLength=2) bww_freq = findFreqTerms(matrix, lowfreq=5) Here is the error message when I attempt to create a matrix> (corpus <- Corpus(VectorSource(dat)))<<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>> matrix <- create_matrix(corpus, ngramLength=2)Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : dims [product 5] do not match the length of object [3] Can anyone tell me what could be going wrong? or a workaround? or another package which could give me the desires result in a more efficient manner. [[alternative HTML version deleted]]
Hi Abraham, On Tue, Oct 7, 2014 at 10:51 AM, Abraham Mathew <abmathewks at gmail.com> wrote:> Let's say I have a corpus and want to find the two, three, etc word phrases > that occur most frequently in the data. I normally do this in the following > manner but am getting an error message and am having some difficulty > diagnosing what is going wrong. Given the following data, I'd just want a > count of 2 for the number of 2 word phrases given that "that sucks" appears > twice. > > dat = c("love it", "who goes there", "what is wrong", "that sucks", "that > sucks") > > (corpus <- Corpus(VectorSource(dat))) > > matrix <- create_matrix(corpus, ngramLength=2)It would have been nice for you to tell us which package this comes from. If this is from RTextTools, I don't see anything in the documentation that suggests that create_matrix expects a corpus. It asks for a character vector, e.g.e, matrix <- create_matrix(dat, ngramLength=2) That may be all you need. When I tried it I got some errors that seem to be related to the version of Weka I had installed. I never did get create_matrix to work, following the instructions on the tm package FAQ page (http://tm.r-forge.r-project.org/faq.html) I was able to get this working as library(tm) library(RWeka) dat = c("love", "who goes there", "what is wrong", "that sucks", "that sucks") (corpus <- Corpus(VectorSource(dat))) BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) bww_freq = findFreqTerms(tdm, lowfreq=2) Best, Ista> > bww_freq = findFreqTerms(matrix, lowfreq=5) > > Here is the error message when I attempt to create a matrix > >> (corpus <- Corpus(VectorSource(dat))) > <<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>> >> matrix <- create_matrix(corpus, ngramLength=2) > Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), > : > dims [product 5] do not match the length of object [3] > > > Can anyone tell me what could be going wrong? or a workaround? or another > package which could give me the desires result in a more efficient manner. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.