thr3ads.net - R help - [R] Find common two and three word phrases in a corpus [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Abraham Mathew

2014-Oct-07 14:51 UTC

[R] Find common two and three word phrases in a corpus

Let's say I have a corpus and want to find the two, three, etc word phrases
that occur most frequently in the data. I normally do this in the following
manner but am getting an error message and am having some difficulty
diagnosing what is going wrong. Given the following data, I'd just want a
count of 2 for the number of 2 word phrases given that "that sucks"
appears
twice.

dat = c("love it", "who goes there", "what is
wrong", "that sucks", "that
sucks")

(corpus <- Corpus(VectorSource(dat)))

matrix <- create_matrix(corpus, ngramLength=2)

bww_freq = findFreqTerms(matrix, lowfreq=5)

Here is the error message when I attempt to create a matrix
> (corpus <- Corpus(VectorSource(dat)))<<VCorpus (documents: 5, metadata (corpus/indexed):
0/0)>>> matrix <- create_matrix(corpus, ngramLength=2)Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x),
 :
  dims [product 5] do not match the length of object [3]


Can anyone tell me what could be going wrong? or a workaround? or another
package which could give me the desires result in a more efficient manner.

	[[alternative HTML version deleted]]

Ista Zahn

2014-Oct-07 19:49 UTC

head link

[R] Find common two and three word phrases in a corpus

Hi Abraham,


On Tue, Oct 7, 2014 at 10:51 AM, Abraham Mathew <abmathewks at gmail.com>
wrote:> Let's say I have a corpus and want to find the two, three, etc word
phrases
> that occur most frequently in the data. I normally do this in the following
> manner but am getting an error message and am having some difficulty
> diagnosing what is going wrong. Given the following data, I'd just want
a
> count of 2 for the number of 2 word phrases given that "that
sucks" appears
> twice.
>
> dat = c("love it", "who goes there", "what is
wrong", "that sucks", "that
> sucks")
>
> (corpus <- Corpus(VectorSource(dat)))
>
> matrix <- create_matrix(corpus, ngramLength=2)
It would have been nice for you to tell us which package this comes
from. If this is from RTextTools, I don't see anything in the
documentation that suggests that create_matrix expects a corpus. It
asks for a character vector, e.g.e,

matrix <- create_matrix(dat, ngramLength=2)

That may be all you need. When I tried it I got some errors that seem
to be related to the version of Weka I had installed. I never did get
create_matrix to work, following the instructions on the tm package
FAQ page (http://tm.r-forge.r-project.org/faq.html) I was able to get
this working as

library(tm)
library(RWeka)

dat = c("love", "who goes there", "what is wrong",
"that sucks", "that sucks")

(corpus <- Corpus(VectorSource(dat)))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max =
2))

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

bww_freq = findFreqTerms(tdm, lowfreq=2)


Best,
Ista
>
> bww_freq = findFreqTerms(matrix, lowfreq=5)
>
> Here is the error message when I attempt to create a matrix
>
>> (corpus <- Corpus(VectorSource(dat)))
> <<VCorpus (documents: 5, metadata (corpus/indexed): 0/0)>>
>> matrix <- create_matrix(corpus, ngramLength=2)
> Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x),
>  :
>   dims [product 5] do not match the length of object [3]
>
>
> Can anyone tell me what could be going wrong? or a workaround? or another
> package which could give me the desires result in a more efficient manner.
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Oct 2014 - Find common two and three word phrases in a corpus

[R] Find common two and three word phrases in a corpus

[R] Find common two and three word phrases in a corpus