Displaying 10 results from an estimated 10 matches for "mindocfreq".
2011 Sep 12
1
findFreqTerms vs minDocFreq in Package 'tm'
I am using 'tm' package for text mining and facing an issue with finding the
frequently occuring terms. From the definition it appears that findFreqTerms
and minDocFreq are equivalent commands and both tries to identify the
documents with terms appearing more than a specified threshold. However, I
am getting drastically different results with both. I have given the results
from both the commands below:
findFreqTerms identifies 3140 words that appear more than 5 t...
2006 Oct 03
1
new to R: don't understand errors
...is some sort of
limit on the number of files. Even when I only use the same number as
previous working collections, I still get the errors. So I am wondering
if it might be something in the files themselves...
At any rate I routinely get these two errors. The first is generated
when I include a minDocFreq=x, and it looks a little like this when I
run it:
> data(stopwords_en)
> CCauto = textmatrix( "CultureMineTXT" , minWordLength=3,
minDocFreq=50, stopwords=stopwords_en)
> Error in data.frame(docs = basename(file), terms = names(tab),
Freq = tab, :
>...
2007 Aug 18
2
Problem with lsa package (data.frame) on Windows XP
...ne on my
mac os x, but when I run the same code on Windows XP, it doesn't work
any more.
### code:
library("lsa")
matrix1 = textmatrix("C:\\Documents and Settings\\tine stalmans.TINE.
000\\LSA\\cuentos\\", stemming=TRUE, language="spanish",
minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL)
print(matrix1,bag_lines = 3, bag_cols = 3)
matrix1 = lw_bintf(matrix1) * gw_idf(matrix1)
space = lsa(matrix1, dims = dimcalc_share())
as.textmatrix(space)
### the following line fails on windows XP
matrix2 = textmatrix("C:\\Documents and Settings\\tine stal...
2010 Mar 31
1
tm package- remove stowords failling
Hi,
I just noticed that by inspecting the matrix term that no all stopwords are
removed, does someone know how to fix that?
library(tm)
data("crude")
d<-tm_map(crude, removeWords, stopwords(language='english'))
dt<-DocumentTermMatrix(d,control=list(minWordLength=3, minDocFreq=2))
inspect( dt)
I am using R version 2.10, tm package 0.5-3
cheers
Welma
[[alternative HTML version deleted]]
2008 Mar 25
0
Error "... x must be atomic" when using lsa (latent semantic analysis) package
...t;)
9: sort(unique.default(x), na.last = TRUE)
8: factor(a, exclude = exclude)
7: table(txt)
6: inherits(x, "factor")
5: is.factor(x)
4: sort(table(txt), decreasing = TRUE)
3: FUN(X[[238]], ...)
2: lapply(dir(mydir, full.names = TRUE), textvector, stemming, language,
minWordLength, minDocFreq, stopwords, vocabulary)
1: textmatrix(SnippetsPath, stopwords = stopwords_en)
Alex
[[alternative HTML version deleted]]
2008 Mar 25
0
Solution to: Error "... x must be atomic" when using lsa (latent semantic analysis) package
...t;)
9: sort(unique.default(x), na.last = TRUE)
8: factor(a, exclude = exclude)
7: table(txt)
6: inherits(x, "factor")
5: is.factor(x)
4: sort(table(txt), decreasing = TRUE)
3: FUN(X[[238]], ...)
2: lapply(dir(mydir, full.names = TRUE), textvector, stemming, language,
minWordLength, minDocFreq, stopwords, vocabulary)
1: textmatrix(SnippetsPath, stopwords = stopwords_en)
Alex
[[alternative HTML version deleted]]
2006 Oct 04
0
FW: new to R: don't understand errors
...d you the alpha-release of the updated lsa package
in a separate message which also includes a parameter called
minGlobFreq which is filtering out terms that appear less
than x times in the whole document collection. I guess that is
what you were looking for.
Considering the sanitizing: if you set minDocFreq to 1
and set minWordLength to 1, you should not get an error
with your document collection as you then are basically
taking everything (even a single character appearing
only once). It probably is not so problematic as the
LSA step will anyway group this low-frequency terms
in a lower order factor....
2010 Mar 18
0
error while usig "tm" package
I have recently started using "tm" package by Feinerer, K. Hornik, and D.
Meyer.
While trying to create a term-document matrix from a corpus (approxly 440
docs)
I get the following error:
tdm <- TermDocumentMatrix(tmp, control=list(weighting=weightTfIdf,
minDocFreq=2, minWordLength=3))
*Error in rowSums(m > 0) : 'x' must be an array of at least two dimensions*
This error appears for option weighting=weightTfIdf and not for
weighting=weightTf
As Idf would need division by df, is this anything to do with nature of my
data?
May be I am doing somethin...
2009 Nov 12
2
package "tm" fails to remove "the" with remove stopwords
I am using code that previously worked to remove stopwords using package
"tm". Even manually adding "the" to the list does not work to remove "the".
This package has undergone extensive redevelopment with changes to the
function syntax, so perhaps I am just missing something.
Please see my simple example, output, and sessionInfo() below.
Thanks!
Mark
require(tm)
2007 Aug 21
2
Partial comparison in string vector
...the same code on Windows XP, it doesn't work
> any more.
>
> ### code:
> library("lsa")
> matrix1 = textmatrix("C:\\Documents and Settings\\tine stalmans.TINE.
> 000\\LSA\\cuentos\\", stemming=TRUE, language="spanish",
> minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL)
> print(matrix1,bag_lines = 3, bag_cols = 3)
> matrix1 = lw_bintf(matrix1) * gw_idf(matrix1)
> space = lsa(matrix1, dims = dimcalc_share())
> as.textmatrix(space)
>
> ### the following line fails on windows XP
> matrix2 = textmatrix("C:...