C.H.
2012-May-16 11:22 UTC
[R] tm package: problem of TermDocumentMatrix and minWordLength
Dear All, The following code illustrate the problem. [R code] require(tm) exampledoc <- c("R is good", "R is really good") examplecorpus <- Corpus(VectorSource(exampledoc), encoding = "UTF-8") dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1)) as.matrix(dtm) [/R code] The term "R" and "is" were not included in the dtm even the control parameter minWordLength was set to 1. Terms Docs good really 1 1 0 2 1 1 Would you reproduce this problem? The following is my sessionInfo> sessionInfo()R version 2.15.0 (2012-03-30) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] tm_0.5-7.1 loaded via a namespace (and not attached): [1] compiler_2.15.0 slam_0.1-23 tools_2.15.0 Regards, CH
Baoqiang Cao
2012-May-16 14:14 UTC
[R] tm package: problem of TermDocumentMatrix and minWordLength
try this: dtm <- DocumentTermMatrix(examplecorpus, control = list(wordLengths=c(1,100))) On Wed, May 16, 2012 at 6:22 AM, C.H. <chainsawtiney at gmail.com> wrote:> Dear All, > > The following code illustrate the problem. > > [R code] > require(tm) > exampledoc <- c("R is good", "R is really good") > examplecorpus <- Corpus(VectorSource(exampledoc), encoding = "UTF-8") > dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1)) > as.matrix(dtm) > [/R code] > > The term "R" and "is" were not included in the dtm even the control > parameter minWordLength was set to 1. > > ? ?Terms > Docs good really > ? 1 ? ?1 ? ? ?0 > ? 2 ? ?1 ? ? ?1 > > Would you reproduce this problem? > > The following is my sessionInfo > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] tm_0.5-7.1 > > loaded via a namespace (and not attached): > [1] compiler_2.15.0 slam_0.1-23 ? ? tools_2.15.0 > > Regards, > > CH > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.