Mark Kimpel
2009-Nov-12 16:29 UTC
[R] package "tm" fails to remove "the" with remove stopwords
I am using code that previously worked to remove stopwords using package "tm". Even manually adding "the" to the list does not work to remove "the". This package has undergone extensive redevelopment with changes to the function syntax, so perhaps I am just missing something. Please see my simple example, output, and sessionInfo() below. Thanks! Mark require(tm) myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water") text.corp <- Corpus(VectorSource(myDocument)) ######################### text.corp <- tm_map(text.corp, stripWhitespace) text.corp <- tm_map(text.corp, removeNumbers) text.corp <- tm_map(text.corp, removePunctuation) ## text.corp <- tm_map(text.corp, stemDocument) text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english"))) dtm <- DocumentTermMatrix(text.corp) dtm dtm.mat <- as.matrix(dtm) dtm.mat> dtm.matTerms Docs falls fetch hill jack jill mainly pail plain rain ran spain the water 1 0 0 0 0 0 0 0 0 1 0 1 1 0 2 1 0 0 0 0 1 0 1 0 0 0 0 0 3 0 0 1 1 1 0 0 0 0 1 0 0 0 4 0 1 0 0 0 0 1 0 0 0 0 0 1 R version 2.10.0 Patched (2009-10-27 r50222) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1 loaded via a namespace (and not attached): [1] grid_2.10.0 rJava_0.8-1 slam_0.1-6 tools_2.10.0 Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please [[alternative HTML version deleted]]
Sam Thomas
2009-Nov-12 17:04 UTC
[R] package "tm" fails to remove "the" with remove stopwords
I'm not sure what's wrong with your approach, but this seems to strip "the" require(tm) params <- list(minDocFreq = 1, removeNumbers = TRUE, stemming = TRUE, stopwords = TRUE, weighting = weightTf) myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water") text.corp <- Corpus(VectorSource(myDocument)) dtm <- DocumentTermMatrix(text.corp, control = params) dtm dtm.mat <- as.matrix(dtm) dtm.mat From: Mark Kimpel [mailto:mwkimpel@gmail.com] Sent: Thursday, November 12, 2009 11:30 AM To: r-help@r-project.org; feinerer@logic.at; Sam Thomas Subject: package "tm" fails to remove "the" with remove stopwords I am using code that previously worked to remove stopwords using package "tm". Even manually adding "the" to the list does not work to remove "the". This package has undergone extensive redevelopment with changes to the function syntax, so perhaps I am just missing something. Please see my simple example, output, and sessionInfo() below. Thanks! Mark require(tm) myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water") text.corp <- Corpus(VectorSource(myDocument)) ######################### text.corp <- tm_map(text.corp, stripWhitespace) text.corp <- tm_map(text.corp, removeNumbers) text.corp <- tm_map(text.corp, removePunctuation) ## text.corp <- tm_map(text.corp, stemDocument) text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english"))) dtm <- DocumentTermMatrix(text.corp) dtm dtm.mat <- as.matrix(dtm) dtm.mat> dtm.matTerms Docs falls fetch hill jack jill mainly pail plain rain ran spain the water 1 0 0 0 0 0 0 0 0 1 0 1 1 0 2 1 0 0 0 0 1 0 1 0 0 0 0 0 3 0 0 1 1 1 0 0 0 0 1 0 0 0 4 0 1 0 0 0 0 1 0 0 0 0 0 1 R version 2.10.0 Patched (2009-10-27 r50222) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1 loaded via a namespace (and not attached): [1] grid_2.10.0 rJava_0.8-1 slam_0.1-6 tools_2.10.0 Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please [[alternative HTML version deleted]]
Ingo Feinerer
2009-Nov-15 16:05 UTC
[R] package "tm" fails to remove "the" with remove stopwords
On Thu, Nov 12, 2009 at 11:29:50AM -0500, Mark Kimpel wrote:> I am using code that previously worked to remove stopwords using package "tm".Thanks for reporting. This is a bug in the removeWords() function in tm version 0.5-1 available from CRAN:> require(tm) > myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water") > text.corp <- Corpus(VectorSource(myDocument)) > ######################### > text.corp <- tm_map(text.corp, stripWhitespace) > text.corp <- tm_map(text.corp, removeNumbers) > text.corp <- tm_map(text.corp, removePunctuation) > ## text.corp <- tm_map(text.corp, stemDocument) > text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english"))) > dtm <- DocumentTermMatrix(text.corp) > dtm > dtm.mat <- as.matrix(dtm) > dtm.mat > > > dtm.mat > Terms > Docs falls fetch hill jack jill mainly pail plain rain ran spain the water > 1 0 0 0 0 0 0 0 0 1 0 1 1 0 > 2 1 0 0 0 0 1 0 1 0 0 0 0 0 > 3 0 0 1 1 1 0 0 0 0 1 0 0 0 > 4 0 1 0 0 0 0 1 0 0 0 0 0 1The function removeWords() fails to remove patterns at the beginning or at the end of a line. This bug is fixed in the latest development version on R-Forge, and the fix will be included in the next CRAN release. Please see https://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/pkg/inst/NEWS?root=tm&view=markup for a list of all bug fixes and changes between each tm version. Best regards, Ingo Feinerer -- Ingo Feinerer Vienna University of Technology http://www.dbai.tuwien.ac.at/staff/feinerer