thr3ads.net - Xapian devel - [Xapian-devel] Interested in IR, Getting started with Xapian [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Akshay M S

2012-Mar-05 18:09 UTC

[Xapian-devel] Interested in IR, Getting started with Xapian

Hi everyone,
I'm Akshay, an Information Science undergrad from Bangalore. I'm
interested
in Information Retrieval and I'd like to contribute to Xapian as a part of
GSoC and later to feed my interests.

I liked the idea of adding more weighting schemes (Project #2). I did a
project last semester on Document Retrieval on Hadoop using TF-IDF and
Cosine Similarity (the query had to be a document).
I read about BM25 from the resources. I don't have a good idea about DFR.
I'm referring to [1] and [2] for more information on DFR in addition to the
resources mentioned on the page.

And in the project description, I couldn't understand this -
*"Additionally,
for faster searching, an upper bound on each component is needed (each
database stores a number of summary statistics to help with this - if
additional statistics would be useful, you could add them as part of the
project)."*
I'm thinking components refer to the per-document term weights and the
DF/IDF weights? Any elaboration would be really helpful.

Can someone please point me to patches/bugs that are related to this
project so I can understand the existing code better, especially related to
Xapian::Weight class or anything else that can get me started with Xapian
codebase?

References -
[1]
http://dl.acm.org/citation.cfm?id=582416&dl=ACM&coll=DL&CFID=88428217&CFTOKEN=33171494
[2]  http://terrier.org/docs/v3.5/dfr_description.html

-- 
Regards,
Akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20120305/78b8dc87/attachment.html>

Olly Betts

2012-Mar-06 23:44 UTC

head link

[Xapian-devel] Interested in IR, Getting started with Xapian

On Mon, Mar 05, 2012 at 11:39:31PM +0530, Akshay M S
wrote:> And in the project description, I couldn't understand this -
*"Additionally,
> for faster searching, an upper bound on each component is needed (each
> database stores a number of summary statistics to help with this - if
> additional statistics would be useful, you could add them as part of the
> project)."*
> I'm thinking components refer to the per-document term weights and the
> DF/IDF weights? Any elaboration would be really helpful.
Xapian assumes that a document's weight is the sum of a contribution
from each term from the query which indexes it, plus an optional
per-document contribution which is independent of query.  You can
also add in other contributions (e.g. from link analysis, click-through
data, how much an advertiser has paid you to boost their pages (yuck),
etc) using a PostingSource.

For each of these contributions, we want an upper bound on what the
contribution can be for a particular term (or PostingSource) in a
particular query.

A concrete example is probably the best way to show how this can be
useful, so consider a query: moon OR cheese

Say we want to get the best 10 matches (i.e. highest weights).  We
consider documents in ascending order of the numeric Xapian document
id.  Once we have found 10, we can ignore anything which has a weight
less than the lowest weight of those 10, and as we gradually improve
out candidate set of 10, that minimum weight rises.

If we have bounds on the weights which "moon" and "cheese"
can return,
then at some point we can spot that both terms will need to match to
achieve that minimum weight - for example, if "moon" can return at
most
2.0 and "cheese" 3.0, and the lowest weight in our current best 10 is
4.0.  Once this happens, we can convert that OR to AND, and that allows
us to skip through the document ids more quickly.

There are a number of optimisations we have which make use of these
bounds - in some cases we can even spot that there's no chance of
getting another candidate and stop early.

The tighter the bound you can calculate for a weighting scheme, the
more effective these optimisations can be.

If implementing a new backend, you could just return the maximum value
of the type and Xapian will work, but it won't be as fast.  One of the
improvements the chert backend has over the flint backend is that it
keeps track of bounds on document length and within-document-frequency
which allow tighter bounds to be calculated for BM25.

If you implement a new weighting scheme, you might find that keeping
track of one or more additional statistics would allow a tighter
bound to be calculated, so modifying the backend to track these could
be worthwhile (and older backends could just return a really high or
low value, depending if it's an upper or lower bound).
> Can someone please point me to patches/bugs that are related to this
> project so I can understand the existing code better, especially related to
> Xapian::Weight class or anything else that can get me started with Xapian
> codebase?
I don't think there are any patches or bugs related to this one.

This page talks a bit more about the optimisations, but isn't currently
completely up to date:

http://xapian.org/docs/matcherdesign.html

And you can see an implementation of coordinate weighting (probably the
simplest weighting scheme after BoolWeight) here:

http://trac.xapian.org/browser/trunk/xapian-core/tests/api_db.cc#L1755

Cheers,
    Olly

Seemingly Similar Threads

Search for more apparently analagous threads

Xapian devel - Mar 2012 - Interested in IR, Getting started with Xapian

[Xapian-devel] Interested in IR, Getting started with Xapian

[Xapian-devel] Interested in IR, Getting started with Xapian

Seemingly Similar Threads