thr3ads.net - R help - [R] package "tm" fails to remove "the" with remove stopwords [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Mark Kimpel

2009-Nov-12 16:29 UTC

[R] package "tm" fails to remove "the" with remove stopwords

I am using code that previously worked to remove stopwords using package
"tm". Even manually adding "the" to the list does not work
to remove "the".
This package has undergone extensive redevelopment with changes to the
function syntax, so perhaps I am just missing something.

Please see my simple example, output, and sessionInfo() below.

Thanks!
Mark

require(tm)
myDocument <- c("the rain in Spain", "falls mainly on the
plain", "jack and
jill ran up the hill", "to fetch a pail of water")
text.corp <- Corpus(VectorSource(myDocument))
#########################
text.corp <- tm_map(text.corp, stripWhitespace)
text.corp <- tm_map(text.corp, removeNumbers)
text.corp <- tm_map(text.corp, removePunctuation)
## text.corp <- tm_map(text.corp, stemDocument)
text.corp <- tm_map(text.corp, removeWords, c("the",
stopwords("english")))
dtm <- DocumentTermMatrix(text.corp)
dtm
dtm.mat <- as.matrix(dtm)
dtm.mat
> dtm.mat    Terms
Docs falls fetch hill jack jill mainly pail plain rain ran spain the water
   1     0     0    0    0    0      0    0     0    1   0     1   1     0
   2     1     0    0    0    0      1    0     1    0   0     0   0     0
   3     0     0    1    1    1      0    0     0    0   1     0   0     0
   4     0     1    0    0    0      0    1     0    0   0     0   0     1

R version 2.10.0 Patched (2009-10-27 r50222)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1

loaded via a namespace (and not attached):
[1] grid_2.10.0  rJava_0.8-1  slam_0.1-6   tools_2.10.0


Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please

	[[alternative HTML version deleted]]

Sam Thomas

2009-Nov-12 17:04 UTC

head link

[R] package "tm" fails to remove "the" with remove stopwords

I'm not sure what's wrong with your approach, but this seems to strip
"the"

 

require(tm)

params <- list(minDocFreq = 1, 

                                removeNumbers = TRUE,

                                stemming = TRUE,

                                stopwords = TRUE,

                                weighting = weightTf)

 

myDocument <- c("the rain in Spain", "falls mainly on the
plain", "jack
and jill ran up the hill", "to fetch a pail of water")

text.corp <- Corpus(VectorSource(myDocument))

dtm <- DocumentTermMatrix(text.corp, control = params)

dtm

dtm.mat <- as.matrix(dtm)

dtm.mat

 

 

From: Mark Kimpel [mailto:mwkimpel@gmail.com] 
Sent: Thursday, November 12, 2009 11:30 AM
To: r-help@r-project.org; feinerer@logic.at; Sam Thomas
Subject: package "tm" fails to remove "the" with remove
stopwords

 

I am using code that previously worked to remove stopwords using package
"tm". Even manually adding "the" to the list does not work
to remove
"the". This package has undergone extensive redevelopment with changes
to the function syntax, so perhaps I am just missing something. 

 

Please see my simple example, output, and sessionInfo() below.

 

Thanks!

Mark

 

require(tm)

myDocument <- c("the rain in Spain", "falls mainly on the
plain", "jack
and jill ran up the hill", "to fetch a pail of water")

text.corp <- Corpus(VectorSource(myDocument))

#########################

text.corp <- tm_map(text.corp, stripWhitespace)

text.corp <- tm_map(text.corp, removeNumbers)

text.corp <- tm_map(text.corp, removePunctuation)

## text.corp <- tm_map(text.corp, stemDocument)

text.corp <- tm_map(text.corp, removeWords, c("the",
stopwords("english")))

dtm <- DocumentTermMatrix(text.corp)

dtm

dtm.mat <- as.matrix(dtm)

dtm.mat

 
> dtm.mat
    Terms

Docs falls fetch hill jack jill mainly pail plain rain ran spain the
water

   1     0     0    0    0    0      0    0     0    1   0     1   1
0

   2     1     0    0    0    0      1    0     1    0   0     0   0
0

   3     0     0    1    1    1      0    0     0    0   1     0   0
0

   4     0     1    0    0    0      0    1     0    0   0     0   0
1

 

R version 2.10.0 Patched (2009-10-27 r50222) 

x86_64-unknown-linux-gnu 

 

locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

 

attached base packages:

[1] stats     graphics  grDevices datasets  utils     methods   base


 

other attached packages:

[1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1    

 

loaded via a namespace (and not attached):

[1] grid_2.10.0  rJava_0.8-1  slam_0.1-6   tools_2.10.0

 

 

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please


	[[alternative HTML version deleted]]

Ingo Feinerer

2009-Nov-15 16:05 UTC

head link

[R] package "tm" fails to remove "the" with remove stopwords

On Thu, Nov 12, 2009 at 11:29:50AM -0500, Mark Kimpel
wrote:> I am using code that previously worked to remove stopwords using package
"tm".
Thanks for reporting. This is a bug in the removeWords() function in
tm version 0.5-1 available from CRAN:
> require(tm)
> myDocument <- c("the rain in Spain", "falls mainly on the
plain", "jack and jill ran up the hill", "to fetch a pail of
water")
> text.corp <- Corpus(VectorSource(myDocument))
> #########################
> text.corp <- tm_map(text.corp, stripWhitespace)
> text.corp <- tm_map(text.corp, removeNumbers)
> text.corp <- tm_map(text.corp, removePunctuation)
> ## text.corp <- tm_map(text.corp, stemDocument)
> text.corp <- tm_map(text.corp, removeWords, c("the",
stopwords("english")))
> dtm <- DocumentTermMatrix(text.corp)
> dtm
> dtm.mat <- as.matrix(dtm)
> dtm.mat
> 
> > dtm.mat
>     Terms
> Docs falls fetch hill jack jill mainly pail plain rain ran spain the water
>    1     0     0    0    0    0      0    0     0    1   0     1   1     0
>    2     1     0    0    0    0      1    0     1    0   0     0   0     0
>    3     0     0    1    1    1      0    0     0    0   1     0   0     0
>    4     0     1    0    0    0      0    1     0    0   0     0   0     1
The function removeWords() fails to remove patterns at the beginning or at the
end
of a line.

This bug is fixed in the latest development version on R-Forge, and
the fix will be included in the next CRAN release.

Please see
r-forge.r-project.org/plugins/scmsvn/viewcvs.php/pkg/inst/NEWS?root=tm&view=markup
for a list of all bug fixes and changes between each tm version.

Best regards, Ingo Feinerer

-- 
Ingo Feinerer
Vienna University of Technology
dbai.tuwien.ac.at/staff/feinerer

Apparently Analagous Threads

Search for more reasonably related threads

R help - Nov 2009 - package "tm" fails to remove "the" with remove stopwords

[R] package "tm" fails to remove "the" with remove stopwords

[R] package "tm" fails to remove "the" with remove stopwords

[R] package "tm" fails to remove "the" with remove stopwords

Apparently Analagous Threads