search for: mindocfreq

Displaying 10 results from an estimated 10 matches for "mindocfreq".

2011 Sep 12
1
findFreqTerms vs minDocFreq in Package 'tm'
I am using 'tm' package for text mining and facing an issue with finding the frequently occuring terms. From the definition it appears that findFreqTerms and minDocFreq are equivalent commands and both tries to identify the documents with terms appearing more than a specified threshold. However, I am getting drastically different results with both. I have given the results from both the commands below: findFreqTerms identifies 3140 words that appear more than 5 t...
2006 Oct 03
1
new to R: don't understand errors
...is some sort of limit on the number of files. Even when I only use the same number as previous working collections, I still get the errors. So I am wondering if it might be something in the files themselves... At any rate I routinely get these two errors. The first is generated when I include a minDocFreq=x, and it looks a little like this when I run it: > data(stopwords_en) > CCauto = textmatrix( "CultureMineTXT" , minWordLength=3, minDocFreq=50, stopwords=stopwords_en) > Error in data.frame(docs = basename(file), terms = names(tab), Freq = tab, : >...
2007 Aug 18
2
Problem with lsa package (data.frame) on Windows XP
...ne on my mac os x, but when I run the same code on Windows XP, it doesn't work any more. ### code: library("lsa") matrix1 = textmatrix("C:\\Documents and Settings\\tine stalmans.TINE. 000\\LSA\\cuentos\\", stemming=TRUE, language="spanish", minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL) print(matrix1,bag_lines = 3, bag_cols = 3) matrix1 = lw_bintf(matrix1) * gw_idf(matrix1) space = lsa(matrix1, dims = dimcalc_share()) as.textmatrix(space) ### the following line fails on windows XP matrix2 = textmatrix("C:\\Documents and Settings\\tine stal...
2010 Mar 31
1
tm package- remove stowords failling
Hi, I just noticed that by inspecting the matrix term that no all stopwords are removed, does someone know how to fix that? library(tm) data("crude") d<-tm_map(crude, removeWords, stopwords(language='english')) dt<-DocumentTermMatrix(d,control=list(minWordLength=3, minDocFreq=2)) inspect( dt) I am using R version 2.10, tm package 0.5-3 cheers Welma [[alternative HTML version deleted]]
2008 Mar 25
0
Error "... x must be atomic" when using lsa (latent semantic analysis) package
...t;) 9: sort(unique.default(x), na.last = TRUE) 8: factor(a, exclude = exclude) 7: table(txt) 6: inherits(x, "factor") 5: is.factor(x) 4: sort(table(txt), decreasing = TRUE) 3: FUN(X[[238]], ...) 2: lapply(dir(mydir, full.names = TRUE), textvector, stemming, language, minWordLength, minDocFreq, stopwords, vocabulary) 1: textmatrix(SnippetsPath, stopwords = stopwords_en) Alex [[alternative HTML version deleted]]
2008 Mar 25
0
Solution to: Error "... x must be atomic" when using lsa (latent semantic analysis) package
...t;) 9: sort(unique.default(x), na.last = TRUE) 8: factor(a, exclude = exclude) 7: table(txt) 6: inherits(x, "factor") 5: is.factor(x) 4: sort(table(txt), decreasing = TRUE) 3: FUN(X[[238]], ...) 2: lapply(dir(mydir, full.names = TRUE), textvector, stemming, language, minWordLength, minDocFreq, stopwords, vocabulary) 1: textmatrix(SnippetsPath, stopwords = stopwords_en) Alex [[alternative HTML version deleted]]
2006 Oct 04
0
FW: new to R: don't understand errors
...d you the alpha-release of the updated lsa package in a separate message which also includes a parameter called minGlobFreq which is filtering out terms that appear less than x times in the whole document collection. I guess that is what you were looking for. Considering the sanitizing: if you set minDocFreq to 1 and set minWordLength to 1, you should not get an error with your document collection as you then are basically taking everything (even a single character appearing only once). It probably is not so problematic as the LSA step will anyway group this low-frequency terms in a lower order factor....
2010 Mar 18
0
error while usig "tm" package
I have recently started using "tm" package by Feinerer, K. Hornik, and D. Meyer. While trying to create a term-document matrix from a corpus (approxly 440 docs) I get the following error: tdm <- TermDocumentMatrix(tmp, control=list(weighting=weightTfIdf, minDocFreq=2, minWordLength=3)) *Error in rowSums(m > 0) : 'x' must be an array of at least two dimensions* This error appears for option weighting=weightTfIdf and not for weighting=weightTf As Idf would need division by df, is this anything to do with nature of my data? May be I am doing somethin...
2009 Nov 12
2
package "tm" fails to remove "the" with remove stopwords
I am using code that previously worked to remove stopwords using package "tm". Even manually adding "the" to the list does not work to remove "the". This package has undergone extensive redevelopment with changes to the function syntax, so perhaps I am just missing something. Please see my simple example, output, and sessionInfo() below. Thanks! Mark require(tm)
2007 Aug 21
2
Partial comparison in string vector
...the same code on Windows XP, it doesn't work > any more. > > ### code: > library("lsa") > matrix1 = textmatrix("C:\\Documents and Settings\\tine stalmans.TINE. > 000\\LSA\\cuentos\\", stemming=TRUE, language="spanish", > minWordLength=2, minDocFreq=1, stopwords=NULL, vocabulary=NULL) > print(matrix1,bag_lines = 3, bag_cols = 3) > matrix1 = lw_bintf(matrix1) * gw_idf(matrix1) > space = lsa(matrix1, dims = dimcalc_share()) > as.textmatrix(space) > > ### the following line fails on windows XP > matrix2 = textmatrix("C:...