On Thu, Mar 26, 2009 at 12:14 PM, Ben Campbell <ben at scumways.com>
wrote:> I'm looking at adding some stopwords to my indexing procedure, and was
> wondering if anyone had any good rules of thumb on how to pick which
> words to blacklist. It all seems a little... well... vague. Although I
> guess it kind of depends on the sort of documents you're wanting to
index.
You may want to consider not defining any stop words at all. You
probably don't need them, and using them has down sides and
very little positive benefit in most cases.
See http://xapian.org/docs/stemming.html (near the end), which describes
them.
> My current idea is to write a little script to output the terms with the
> highest frequency in my existing database (just over 1 million
> documents), manually eyeball that list to make sure it's sensible, and
> then use them as my stopwords.
If you do want them though, you could in fact use Xapian itself to
give you the list. Just index everything completely first, to get
a "corpus". Then Xapian can tell you the most frequent terms.
Those would supposedly become your stop words; and you can
go back and re-index everything again with the stop words in
place. If you want to.
> (I only need to worry about english language for now, which helps
> a little :-)
Certainly. Stop words are very problematic if not impossible to
define reasonably in multi-lingual indexes.
--
Deron Meranda