thr3ads.net - R help - [R] findFreqTerms vs minDocFreq in Package 'tm' [Sep 2011]

If this information is useful, please help other people find it:
Share via:

vioravis

2011-Sep-12 06:28 UTC

[R] findFreqTerms vs minDocFreq in Package 'tm'

I am using 'tm' package for text mining and facing an issue with finding
the
frequently occuring terms. From the definition it appears that findFreqTerms
and minDocFreq are equivalent commands and both tries to identify the
documents with terms appearing more than a specified threshold. However, I
am getting drastically different results with both. I have given the results
from both the commands below:

findFreqTerms identifies 3140 words that appear more than 5 times but
minDocFreq identifies only 659 terms. Can someone please explain the reason
for the different or whether I have misunderstood their definitions??

>tdm1 <- TermDocumentMatrix(tr1,control=list(weighting=weightBin))
> freq_terms <- findFreqTerms(tdm1, lowfreq =5, highfreq = Inf) 
> str(freq_terms) chr [1:3140] "abc" "abil" "abl"
"abnorm" "abort" "absenc" ...

> tdm2 <-
TermDocumentMatrix(tr1,control=list(minDocFreq=5,minWordLength=1))
> str(tdm2)List of 6
 $ i       : int [1:4703] 173 616 624 241 350 534 563 609 129 333 ...
 $ j       : int [1:4703] 1 2 3 7 7 7 7 8 10 10 ...
 $ v       : num [1:4703] 7 5 6 9 5 7 5 5 5 7 ...
 $ nrow    : int 659
 $ ncol    : int 5677
 $ dimnames:List of 2
  ..$ Terms: chr [1:659] "\024" "\026" "ac"
"access" ...
  ..$ Docs : chr [1:5677] "1" "2" "3"
"4" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix"
"simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency"
"tf"


Thank you.

Ravi



--
View this message in context:
http://r.789695.n4.nabble.com/findFreqTerms-vs-minDocFreq-in-Package-tm-tp3806644p3806644.html
Sent from the R help mailing list archive at Nabble.com.

Bettina Gruen

2011-Sep-12 13:13 UTC

head link

[R] findFreqTerms vs minDocFreq in Package 'tm'

On 09/12/2011 04:28 PM, vioravis wrote:> I am using 'tm' package for text mining and facing an issue with
finding the
> frequently occuring terms. From the definition it appears that
findFreqTerms
> and minDocFreq are equivalent commands and both tries to identify the
> documents with terms appearing more than a specified threshold. However, I
> am getting drastically different results with both. I have given the
results
> from both the commands below:
>
> findFreqTerms identifies 3140 words that appear more than 5 times but
> minDocFreq identifies only 659 terms. Can someone please explain the reason
> for the different or whether I have misunderstood their definitions??
 From the help page of termFreq:

?minDocFreq? An integer value. Words that appear less often
               in ?doc? than this number are discarded. Defaults to ?1?
               (i.e., every token will be used).

The description for findFreqTerms states:

Find frequent terms in a term-document matrix.

So minDocFreq assesses how often a word appears in a document in order to decide
if it should be included in the frequency vector of words for this document.

By contrast findFreqTerms focuses on the document-term matrix and determines how
often the word occurs in the matrix. So in fact the whole corpus is used to
decide on the frequency and if the word should be included or not.

Because one function uses frequency of words in a document, while the other uses
frequency of words in the document-term matrix, they are obviously not
equivalent commands. Your results indicate that 3140 words occur at least 5
times in the whole corpus, i.e., when summing over all documents. By contrasts
659 words occur at least 5 times in one single document.

HTH,
Bettina

-- 
-------------------------------------------------------------------
Bettina Gr?n
Institut f?r Angewandte Statistik / IFAS
Johannes Kepler Universit?t Linz
Altenbergerstra?e 69
4040 Linz, Austria

Tel: +43 732 2468-6829
Fax: +43 732 2468-6800
E-Mail: Bettina.Gruen at jku.at
www.ifas.jku.at

Reasonably Related Threads

Search for more reasonably related threads

R help - Sep 2011 - findFreqTerms vs minDocFreq in Package 'tm'

[R] findFreqTerms vs minDocFreq in Package 'tm'

[R] findFreqTerms vs minDocFreq in Package 'tm'

Reasonably Related Threads