Yanchang Zhao
2011-Nov-04 01:28 UTC
[R] Help: stemming and stem completion with package tm in R
Hi All I came across a problem below when doing stemming and stem completion with package tm in R. Word "mining" was stemmed to "mine" with stemDocument(), and then completed to "miners"with stemCompletion(). However, I prefer to keep "mining" intact. For stemCompletion(), the default type of completion is "prevalent", which takes the most frequent match as completion. Although "mining" is much more frequent than "miners" in my text, it still completed "mine" to "miners". An example is shown below. ############################################ library(tm) (a <- c("mining", "miners", "mining")) (b <- stemDocument(a)) (d <- stemCompletion(b, dictionary=a)) ############################################ Some possible solutions are: 1) to change the options or dictionary in stemDocument(), so that "mining" is not stemmed to "mine", which I think is the best way; 2) to change the options or dictionary in stemCompletion(), so that "mine" is completed to "mining"; 3) to manually correct this after stem completion, which is the last option. I am looking for a solution for above 1) or 2), but cannot find the way to do it with stemDocument() in package tm. Any help will be appreciated. Thanks, Yanchang Zhao Email: yanchangzhao(at)gmail.com RDataMining: http://www.rdatamining.com Twitter: http://twitter.com/RDataMining Group on Linkedin: http://group2.rdatamining.com [[alternative HTML version deleted]]
Felix Andrews
2011-Nov-07 12:38 UTC
[R] Help: stemming and stem completion with package tm in R
Hi Yanchang, The problem seems to be that stemCompletion only looks for words that begin with "mine", and "mining" does not strictly begin with "mine". I don't think there is any easy way to modify stemCompletion to get around that. However, maybe you could substitute the most prevalent word in your document for each of the stemmed words, then you would not need to use stemCompletion at all: e.g. topfreq <- function(x) rev(names(sort(table(x))))[1] (d <- ave(a, b, FUN = topfreq)) # [1] "mining" "miners" "mining" Cheers Felix On 4 November 2011 12:28, Yanchang Zhao <yanchangzhao at gmail.com> wrote:> Hi All > > I came across a problem below when doing stemming and stem completion > with package tm in R. Word "mining" was stemmed to "mine" with > stemDocument(), and then completed to "miners"with stemCompletion(). > However, I prefer to keep "mining" intact. > > For stemCompletion(), the default type of completion is "prevalent", > which takes the most frequent match as completion. Although "mining" > is much more frequent than "miners" in my text, it still completed > "mine" to "miners". > > An example is shown below. > > ############################################ > library(tm) > (a <- c("mining", "miners", "mining")) > (b <- stemDocument(a)) > (d <- stemCompletion(b, dictionary=a)) > ############################################ > > Some possible solutions are: > 1) to change the options or dictionary in stemDocument(), so that > "mining" is not stemmed to "mine", which I think is the best way; > 2) to change the options or dictionary in stemCompletion(), so that > "mine" is completed to "mining"; > 3) to manually correct this after stem completion, which is the last > option. > > I am looking for a solution for above 1) or 2), but cannot find the > way to do it with stemDocument() in package tm. > > Any help will be appreciated. > > Thanks, > Yanchang Zhao > Email: yanchangzhao(at)gmail.com > > RDataMining: ? ? ? ? ? http://www.rdatamining.com > Twitter: ? ? ? ? ? ? ? http://twitter.com/RDataMining > Group on Linkedin: ? http://group2.rdatamining.com > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Felix Andrews / ??? http://www.neurofractal.org/felix/
Apparently Analagous Threads
- Troubles with stemming (tm + Snowball packages) under MacOS
- Call for contribution: the RDataMining package - an R package for data mining
- tm::stemDocument function not work
- Time Series Analysis and Mining with R - slides in PDF
- package "tm" fails to remove "the" with remove stopwords