Akshay M S
2012-Mar-05 18:09 UTC
[Xapian-devel] Interested in IR, Getting started with Xapian
Hi everyone, I'm Akshay, an Information Science undergrad from Bangalore. I'm interested in Information Retrieval and I'd like to contribute to Xapian as a part of GSoC and later to feed my interests. I liked the idea of adding more weighting schemes (Project #2). I did a project last semester on Document Retrieval on Hadoop using TF-IDF and Cosine Similarity (the query had to be a document). I read about BM25 from the resources. I don't have a good idea about DFR. I'm referring to [1] and [2] for more information on DFR in addition to the resources mentioned on the page. And in the project description, I couldn't understand this - *"Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project)."* I'm thinking components refer to the per-document term weights and the DF/IDF weights? Any elaboration would be really helpful. Can someone please point me to patches/bugs that are related to this project so I can understand the existing code better, especially related to Xapian::Weight class or anything else that can get me started with Xapian codebase? References - [1] http://dl.acm.org/citation.cfm?id=582416&dl=ACM&coll=DL&CFID=88428217&CFTOKEN=33171494 [2] http://terrier.org/docs/v3.5/dfr_description.html -- Regards, Akshay -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120305/78b8dc87/attachment.html>
Olly Betts
2012-Mar-06 23:44 UTC
[Xapian-devel] Interested in IR, Getting started with Xapian
On Mon, Mar 05, 2012 at 11:39:31PM +0530, Akshay M S wrote:> And in the project description, I couldn't understand this - *"Additionally, > for faster searching, an upper bound on each component is needed (each > database stores a number of summary statistics to help with this - if > additional statistics would be useful, you could add them as part of the > project)."* > I'm thinking components refer to the per-document term weights and the > DF/IDF weights? Any elaboration would be really helpful.Xapian assumes that a document's weight is the sum of a contribution from each term from the query which indexes it, plus an optional per-document contribution which is independent of query. You can also add in other contributions (e.g. from link analysis, click-through data, how much an advertiser has paid you to boost their pages (yuck), etc) using a PostingSource. For each of these contributions, we want an upper bound on what the contribution can be for a particular term (or PostingSource) in a particular query. A concrete example is probably the best way to show how this can be useful, so consider a query: moon OR cheese Say we want to get the best 10 matches (i.e. highest weights). We consider documents in ascending order of the numeric Xapian document id. Once we have found 10, we can ignore anything which has a weight less than the lowest weight of those 10, and as we gradually improve out candidate set of 10, that minimum weight rises. If we have bounds on the weights which "moon" and "cheese" can return, then at some point we can spot that both terms will need to match to achieve that minimum weight - for example, if "moon" can return at most 2.0 and "cheese" 3.0, and the lowest weight in our current best 10 is 4.0. Once this happens, we can convert that OR to AND, and that allows us to skip through the document ids more quickly. There are a number of optimisations we have which make use of these bounds - in some cases we can even spot that there's no chance of getting another candidate and stop early. The tighter the bound you can calculate for a weighting scheme, the more effective these optimisations can be. If implementing a new backend, you could just return the maximum value of the type and Xapian will work, but it won't be as fast. One of the improvements the chert backend has over the flint backend is that it keeps track of bounds on document length and within-document-frequency which allow tighter bounds to be calculated for BM25. If you implement a new weighting scheme, you might find that keeping track of one or more additional statistics would allow a tighter bound to be calculated, so modifying the backend to track these could be worthwhile (and older backends could just return a really high or low value, depending if it's an upper or lower bound).> Can someone please point me to patches/bugs that are related to this > project so I can understand the existing code better, especially related to > Xapian::Weight class or anything else that can get me started with Xapian > codebase?I don't think there are any patches or bugs related to this one. This page talks a bit more about the optimisations, but isn't currently completely up to date: http://xapian.org/docs/matcherdesign.html And you can see an implementation of coordinate weighting (probably the simplest weighting scheme after BoolWeight) here: http://trac.xapian.org/browser/trunk/xapian-core/tests/api_db.cc#L1755 Cheers, Olly
Reasonably Related Threads
- How to add an custom weight to the relevancy value and sort it.
- How to get the serialise score returned in Xapian::KeyMaker->operator().
- prioritizing aggregated DBs
- How to get the serialise score returned in Xapian::KeyMaker->operator().
- Weighting recent results