On Thu, Jun 28, 2007 at 11:34:06AM +0100, Tom Mortimer
wrote:> I'm using SimpleStopper with TermGenerator in a Python indexing
> script, in an attempt to keep my index size down (currently 30K per
> doc, and I have 200 million docs to index, which I think implies
> 6TB.) However, unprefixed (positional?) terms are not affected by
> the stopper, though Z-prefixed terms are.
>
> I assume this is intentional for phrase queries
Yes, that's exactly the idea.
> but I need to reduce my index size drastically. Is it possible to
> generate positional terms, filtered with a stoplist, and not generate
> the Z terms? Or should I just write my own term generator?
There should probably be more configurability in TermGenerator, but
1.0 was already later than hoped, and more options means more
combinations to test, so the current implementation is probably more
hard-wired than is ideal.
There's an option in the code for "hard stopping", but no exposed
API
for it yet. If you edit queryparser/termgenerator_internal.cc and
change stop_mode to STOPWORDS_IGNORE then stop words won't be indexed
at all.
If you don't set a stemmer, you'll only get unstemmed terms, without
a Z prefix. If you only want stemmed terms without a prefix, you'll
need to tweak the code, at least for now.
Cheers,
Olly