Emmanuel Eckard
2007-Sep-03 10:28 UTC
[Xapian-discuss] Xapian and research in IR: a few suggestions from experience
Greetings, In its present state, Xapian offers Databases and Documents, from where TermIterators and PostingIterators allow accessing individual terms and documents, as well as information such as Term Frequency (called "Word Document Frequency", wdf) and idf (Invert Document Frequency) terms. Recent research in information retrieval has focused on retrieval models based entirely or partially on "latent" features of documents and document collections (for instance sets of probabilities computed from the TF and IDF terms). This comes down to associating "additional data" to documents, terms and document collections -- the nature of the information and to what it is associated varies according to the model. An extension to Xapian allowing to generically support such retrieval models would make Xapian a huge asset for the Information retrieval research community (as well as Document classification and connex topics). I propose to create a layer offering generic methods for access, modification, reading and loading such data. As far as I understand, this would imply 1) Set generic "slots" on all objects susceptible of being associated with latent data in a latent retrieval model; this would include the major components of Xapian: Databases, Documents and Termiterators (the "Termiterator::get_latent_data()" method could ideally have the behaviour of Termiterator::get_wdf(), by "knowing" whether it's been instanciated from a Document or a Database and returning different results depending on this context). 2) Offer generic read/write methods. Coherence of the data should be maintained, for instance by offering virtual mehods "documentAdded()", "documentRemoved()", ... to allow the user to apply necessary operations on his data is needed. For a less important and fundamental suggestion, I'd like to mention that in research, it is often important to have unique and determined identifiers (strings) for documents. I have seen this done by using prefixed terms (which is not very clean) or by using the "data" field of documents (which lacks an iterator: one cannot jump to one particular document easily this way). It might be interesting to do something on this level (maybe simply by wrapping the "prefixed term" way into something cleaner). Thank you very much for your interest, and most of all for Xapian itself. -- Emmanuel Eckard ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Artificial Intelligence Laboratory, EPFL LIA/IC 1014 Ecublens, Suisse ? ? ? ? ? ? ? ? ? ? +41 21 693 66 97 ? ? ? () ?ascii ribbon campaign - against html mail /\ ? ? ? ? ? ? ? ? ? ? ? ?- against microsoft attachments ? ? ?
Olly Betts
2007-Sep-04 20:55 UTC
[Xapian-discuss] Xapian and research in IR: a few suggestions from experience
On Mon, Sep 03, 2007 at 11:27:27AM +0200, Emmanuel Eckard wrote:> In its present state, Xapian offers Databases and Documents, from where > TermIterators and PostingIterators allow accessing individual terms and > documents, as well as information such as Term Frequency (called "Word > Document Frequency", wdf)Close, it's "Within Document Frequency"!> and idf (Invert Document Frequency) terms.To be precise, we store the number of documents which each term occurs in, which can be used to calculate the idf.> 1) Set generic "slots" on all objects susceptible of being associated with > latent data in a latent retrieval model; this would include the major > components of Xapian: Databases, Documents and TermiteratorsDo you have some pointers to the models you have in mind so I can get an idea what sort of data we might be talking about? (I can see this could be useful for storing a "reputation" score for each document, derived from link analysis, user clicks, etc.) I've actually been looking at implementing support for Divergence from Randomness models: http://ir.dcs.gla.ac.uk/wiki/DivergenceFromRandomness Studies show that these can give retrieval results as good or better than BM25, but some of them are parameter free, whereas BM25 has 4 parameters, which ideally need tuning for best performance - the best values vary depending on the documents in the collection. In an academic study people are willing to do that, but in the real world it's rarely done. So being parameter free is a great feature for us. To implement these efficiently, we need to know some more per-database statistics (for example, an upper bound on the document length), so I'm looking at storing these. Note that at least in this case they can't just be opaque data since the database backend needs to update their values, and we need to know how to derive the combined statistic when searching several databases. They could be generic data with suitable callbacks (as you suggest), but in this case I think it makes more sense to just code them in directly, since the space required for per-database statistics is a small fixed overhead, and some of the statistics could also be used to help optimise BM25 weighted searches. Also callbacks aren't available with some of the bindings (e.g. PHP, Tcl). I don't think the DfR schemes will need extra per-document or per-posting information. An extra issue for that is that it's hard to know how best to compactly store generic information, which is especially important for per-posting statistics. Perhaps that can be addressed by just leaving packing and unpacking up to the user.> 2) Offer generic read/write methods. Coherence of the data should be > maintained, for instance by offering virtual > mehods "documentAdded()", "documentRemoved()", ... to allow the user to apply > necessary operations on his data is needed.I don't think we can usefully support subclassing of the PIMPL classes (i.e. most of the Xapian API classes), but that's a detail really - there are other ways to allow this such as passing in an object upon which these hooks will be called.> For a less important and fundamental suggestion, I'd like to mention that in > research, it is often important to have unique and determined identifiers > (strings) for documents. I have seen this done by using prefixed terms (which > is not very clean) or by using the "data" field of documents (which lacks an > iterator: one cannot jump to one particular document easily this way). It > might be interesting to do something on this level (maybe simply by wrapping > the "prefixed term" way into something cleaner).What do you have in mind? You can already add/replace or delete a document by term. An overloaded version of get_document() which could retrieve the first document matching a particular term would be fairly easy to add and might save some internal work over creating a PostingIterator. Actually using arbitrary strings as docids would be more tricky. Non-sparse numeric ids are easy to compress, and useful externally too (e.g. omindex uses a vector<bool> to track which docids it has updated during a run). Although non-numeric ids would avoid the mapping of docids we current have to do when searching multiple databases. Cheers, Olly
Emmanuel Eckard
2007-Sep-05 17:45 UTC
[Xapian-discuss] Re: Xapian and research in IR: a few suggestions from experience
> Do you have some pointers to the models you have in mind so I can get > an idea what sort of data we might be talking about? (I can see this > could be useful for storing a "reputation" score for each document, > derived from link analysis, user clicks, etc.)I am now working with PLSA (Probabilistic Latent Semantic Analysis), which assumes that documents (d) and terms (w) are associated with categories (z), and represents the data as mixture models of P(z), P(d|z) and P(w|z). I have implemented this model using objects built around Xapian. There is also some work done on Naive bayesian models, and Latent Dirichlet Allocation. All these models would call for doubles, or vectors of doubles, to be associated with Documents, TermIterators and Databases.> > For a less important and fundamental suggestion, I'd like to mention that > > in research, it is often important to have unique and determined > > identifiers (strings) for documents. I have seen this done by using > > prefixed terms (which is not very clean) or by using the "data" field of > > documents (which lacks an iterator: one cannot jump to one particular > > document easily this way). It might be interesting to do something on > > this level (maybe simply by wrapping the "prefixed term" way into > > something cleaner). > > What do you have in mind? You can already add/replace or delete a > document by term. An overloaded version of get_document() which could > retrieve the first document matching a particular term would be fairly > easy to add and might save some internal work over creating a > PostingIterator.I was thinking of the toolchain of a scientist working on TREC, for instance: documents identified by string docIds are indexed, retrieval is applied, then the programme outputs a codified list of documents (the string docIds) which is used to evaluation with trec_eval. Presently I store the string docIds in the "data" field, so there would be no elegant way for me to retrieve a document given its string docId. But I have not felt the need for this yes, so it's a nicety, really. Cheers ! -- Emmanuel Eckard Artificial Intelligence Laboratory, EPFL LIA/IC 1014 Ecublens, Suisse +41 21 693 66 97 () ascii ribbon campaign - against html mail /\ - against microsoft attachments