I too have this problem. Everything worked fine last year, but after
updating R and packages I can no longer do word stemming.
Unfortunately, I didn't save the old binaries, otherwise I would just
revert back.
Hoping someone finds a solution for R on Windows. Thanks!
There is a potential solution for R on Mac OS from Kurt Hornik copied
below, but I cannot get this to work on Windows.
Here's the code I'm running:
#1) Using package Snowball
library(Snowball)
source <- readLines(system.file("words",
"porter","voc.txt",package = "Snowball"))
result <- SnowballStemmer(source)
#2) Using package tm
library(tm)
data("crude")
stemDocument(crude[[1]])
In both instances I got a Java error "Could not initialize the
GenericPropertiesCreator. This exception was produced:
java.lang.NullPointerException". After receiving this error once in
the session, no further error messages are generated. However,
SnowballStemmer() and stemDocument() return the original unstemmed
text.
Possible Solution:
For those on Mac OS, Kurt Hornik wrote...
These issues seem to be specific to Mac OS X. Recent versions of Weka
have added a package management system not unlike R's, to the effect
that now when external packages (or the Snowball jar) is loaded their
KnowledgeFlow GUI is started, which in turn requires AWT---and from what
I understand, this does not work on Mac OS X.
Short term, you should be able to Sys.setenv("NOAWT",
"true").
More long term, the Weka maintainers have patched their upstream code so
that it is possible to turn off the dynamic class discovery altogether,
but I have not found the time to test this ...
I realize this solution was for Mac OS, but not knowing anything about
rJava I tried this on Windows anyways resulting in "Error in
Sys.setenv("NOAWT", "true") : all arguments must be
named"
Here's my session info.
R version 2.13.0 Patched (2011-04-21 r55576)
Platform: i386-pc-mingw32/i386 (32-bit) (Windows Vista)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices datasets utils
methods base
other attached packages:
[1] Snowball_0.0-7 tm_0.5-6 rcom_2.2-3.1 rscproxy_1.3-1
loaded via a namespace (and not attached):
[1] grid_2.13.0 rJava_0.9-0 (same error with multiple
older versions) RWeka_0.4-7 RWekajars_3.7.3-1
[5] slam_0.1-22 tools_2.13.0