On Sat, Nov 01, 2008 at 06:49:03PM +0100, Torsten Foertsch
wrote:> I am writing an indexer using your Search::Xapian module. From what I
> have learned from xapian-omega/omindex.cc I think the following is
> correct. Can you please confirm it?
> [snip]
It looks plausible - I didn't try to run it though.
> Is it correct to pass UTF8 encoded text to index_text() as done above?
I'm not intimately familiar with Perl's UTF-8 handling, so I'm not
sure
on the "as done above" part. But you certainly should pass in UTF-8
data
to TermGenerator::index_text().
> How about stop words? Do I need them for indexing? I have read somewhere
> that stopwords can be created from the database/termlist. How to do
> that?
It's generally best to index stopwords and perform stopword handling at
search time. That allows phrase searching to include stopwords - a
search for "the cat in the hat" would be less precise if you
haven't
indexed `the' or `in' at all. And a search for the band "the
the" or
the dance the "can can" would be impossible!
Because stopwords are very common, the postlists and positional data for
them compresses very well, so the space saving isn't as great as you
might expect.
You can create a stopword list by just taking the most frequent terms,
but that can be problematic - it's probably better to at least have a
human vet that list, or just use a standard list. Snowball has lists
for most languages it has stemmers for - for example, English is here:
http://svn.tartarus.org/snowball/trunk/website/algorithms/english/stop.txt?rev=431&view=markup
> I have read your theoretical background page. It is said there that the
> probabilistic IR model is the "correct" one. Other IR systems use
the
> TF/IDF (http://en.wikipedia.org/wiki/Tf-idf) algorithm to compute
> ranks. How is it related to the probabilistic IR?
*Some* other IR systems use TF-IDF. Xapian certainly isn't the only
system to use BM25, and for example, Terrier favours DfR (see below
for more on this).
I'm not sure the "correct" remark there is very helpful (I
didn't write
that document), but the probabilistic model has indeed been very
successful for many years, and the BM25 weighting formula has
consistently scored highly in TREC tests.
TF-IDF is another weighting algorithm, used with the vector space model.
The wikipedia definition is rather narrow - normally it's taken to cover
a family of formulae where some measure of "term frequency" is
multiplied by some measure of "Inverse Document Frequency" - see here:
http://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html
With this definition, BM25 is a close relative of the TF-IDF family.
If you ignore the "extra" term-independent part of the formula (by
setting k2 to 0) and the query frequency normalistion (by setting k3 to
0) then you're left with a sum of the product of a TF measure and an
IDF measure (actually, you can probably consider the query frequency
normalistion as part of the TF measure).
I think the SMART TF-IDF formulae could be implemented for Xapian, but
nobody has yet. BM25 performs better in the studies I've seen, and the
theoretical foundation for the Probabilistic model seems more satisfying
to me, but having TF-IDF schemes available would be useful for academic
studies if nothing else.
I find the Divergence from Randomness (DfR) models more interesting than
TF-IDF, and am (very slowly) working towards implementing them for
Xapian (the only real obstacle is that we need to track some more
statistics in the database to allow an efficient implementation - these
will also make BM25 matches more efficient than currently). You can
read about those here:
http://ir.dcs.gla.ac.uk/wiki/DivergenceFromRandomness
DfR models seem to do as well or better than BM25, and some of the
models have the big advantage of being parameter-free. To get the best
results from BM25, you may need to tune the constants used, which isn't
very appealing outside of academic studies. A parameter-free model has
no parameters to be tuned.
Cheers,
Olly