Hello Jerad,
> It was suggested I contact you for possible help with this issue. Well,
> as you can see for the emails below, that is what I was told at R-help.
> Any insight to my lsa problems (also listed below) would be of great
> help.
from what I see, the problem probably indeed lies within the
textfiles: for performance reasons, it was not possible to
include any "check" routines that exclude a file if it contains
no words (or words below a docFrequency) and thus produces
an empty column-vector.
I am pretty sure that you do not want to use docFrequency
with a value like 50 (it would mean that a term in a document
is only included if it appears more than 50 times in *that*
document).
I will send you the alpha-release of the updated lsa package
in a separate message which also includes a parameter called
minGlobFreq which is filtering out terms that appear less
than x times in the whole document collection. I guess that is
what you were looking for.
Considering the sanitizing: if you set minDocFreq to 1
and set minWordLength to 1, you should not get an error
with your document collection as you then are basically
taking everything (even a single character appearing
only once). It probably is not so problematic as the
LSA step will anyway group this low-frequency terms
in a lower order factor. Of course you will still get
an error if you use documents that are completely empty,
so delete all 0 bytes documents beforehands.
I am thinking about what to do with this sanitizing part.
It is not a good idea to integrate that into the
textmatrix method -- it would slow things down
tremendously.
So what about this idea: does it make sense to provide a
sanitizing collection of methods that help to select the
files you want to work with (copy them to a different
directory or just return a list with the filenames of
the ones that are "good")? What should we do with other
sanitizing options (deleting urls from texts, deleting
short words, etc.)?
Hope, I could be of help,
Best,
Fridolin
--
Fridolin Wild, Institute for Information Systems and New Media,
Vienna University of Economics and Business Administration (WUW),
Augasse 2-6, A-1090 Wien, Austria
fon +43-1-31336-4488, fax +43-1-31336-746