On 09/12/2011 04:28 PM, vioravis wrote:> I am using 'tm' package for text mining and facing an issue with
finding the
> frequently occuring terms. From the definition it appears that
findFreqTerms
> and minDocFreq are equivalent commands and both tries to identify the
> documents with terms appearing more than a specified threshold. However, I
> am getting drastically different results with both. I have given the
results
> from both the commands below:
>
> findFreqTerms identifies 3140 words that appear more than 5 times but
> minDocFreq identifies only 659 terms. Can someone please explain the reason
> for the different or whether I have misunderstood their definitions??
From the help page of termFreq:
?minDocFreq? An integer value. Words that appear less often
in ?doc? than this number are discarded. Defaults to ?1?
(i.e., every token will be used).
The description for findFreqTerms states:
Find frequent terms in a term-document matrix.
So minDocFreq assesses how often a word appears in a document in order to decide
if it should be included in the frequency vector of words for this document.
By contrast findFreqTerms focuses on the document-term matrix and determines how
often the word occurs in the matrix. So in fact the whole corpus is used to
decide on the frequency and if the word should be included or not.
Because one function uses frequency of words in a document, while the other uses
frequency of words in the document-term matrix, they are obviously not
equivalent commands. Your results indicate that 3140 words occur at least 5
times in the whole corpus, i.e., when summing over all documents. By contrasts
659 words occur at least 5 times in one single document.
HTH,
Bettina
--
-------------------------------------------------------------------
Bettina Gr?n
Institut f?r Angewandte Statistik / IFAS
Johannes Kepler Universit?t Linz
Altenbergerstra?e 69
4040 Linz, Austria
Tel: +43 732 2468-6829
Fax: +43 732 2468-6800
E-Mail: Bettina.Gruen at jku.at
www.ifas.jku.at