C.H.
2012-May-16 11:22 UTC
[R] tm package: problem of TermDocumentMatrix and minWordLength
Dear All,
The following code illustrate the problem.
[R code]
require(tm)
exampledoc <- c("R is good", "R is really good")
examplecorpus <- Corpus(VectorSource(exampledoc), encoding =
"UTF-8")
dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1))
as.matrix(dtm)
[/R code]
The term "R" and "is" were not included in the dtm even the
control
parameter minWordLength was set to 1.
Terms
Docs good really
1 1 0
2 1 1
Would you reproduce this problem?
The following is my sessionInfo
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.5-7.1
loaded via a namespace (and not attached):
[1] compiler_2.15.0 slam_0.1-23 tools_2.15.0
Regards,
CH
Baoqiang Cao
2012-May-16 14:14 UTC
[R] tm package: problem of TermDocumentMatrix and minWordLength
try this: dtm <- DocumentTermMatrix(examplecorpus, control = list(wordLengths=c(1,100))) On Wed, May 16, 2012 at 6:22 AM, C.H. <chainsawtiney at gmail.com> wrote:> Dear All, > > The following code illustrate the problem. > > [R code] > require(tm) > exampledoc <- c("R is good", "R is really good") > examplecorpus <- Corpus(VectorSource(exampledoc), encoding = "UTF-8") > dtm <- DocumentTermMatrix(examplecorpus, control = list(minWordLength = 1)) > as.matrix(dtm) > [/R code] > > The term "R" and "is" were not included in the dtm even the control > parameter minWordLength was set to 1. > > ? ?Terms > Docs good really > ? 1 ? ?1 ? ? ?0 > ? 2 ? ?1 ? ? ?1 > > Would you reproduce this problem? > > The following is my sessionInfo > >> sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] tm_0.5-7.1 > > loaded via a namespace (and not attached): > [1] compiler_2.15.0 slam_0.1-23 ? ? tools_2.15.0 > > Regards, > > CH > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.