thr3ads.net - Xapian discuss - [Xapian-discuss] Xapian and research in IR: a few suggestions from experience [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Emmanuel Eckard

2007-Sep-03 10:28 UTC

[Xapian-discuss] Xapian and research in IR: a few suggestions from experience

Greetings,

In its present state, Xapian offers Databases and Documents, from where 
TermIterators and PostingIterators allow accessing individual terms and 
documents, as well as information such as Term Frequency (called "Word 
Document Frequency", wdf) and idf (Invert Document Frequency) terms.

Recent research in information retrieval has focused on retrieval models based 
entirely or partially on "latent" features of documents and document 
collections (for instance sets of probabilities computed from the TF and IDF 
terms). This comes down to associating "additional data"  to
documents, terms
and document collections -- the nature of the information and to what it is 
associated varies according to the model.

An extension to Xapian allowing to generically support such retrieval models 
would make Xapian a huge asset for the Information retrieval research 
community (as well as Document classification and connex topics). I propose 
to create a layer offering generic methods for access, modification, reading 
and loading such data. As far as I understand, this would imply

1) Set generic "slots" on all objects susceptible of being associated
with
latent data in a latent retrieval model; this would include the major 
components of Xapian: Databases, Documents and Termiterators 
(the "Termiterator::get_latent_data()" method could ideally have the 
behaviour of Termiterator::get_wdf(), by "knowing" whether it's
been
instanciated from a Document or a Database and returning different results 
depending on this context).
2) Offer generic read/write methods. Coherence of the data should be 
maintained, for instance by offering virtual 
mehods "documentAdded()", "documentRemoved()", ... to allow
the user to apply
necessary operations on his data is needed.

For a less important and fundamental suggestion, I'd like to mention that in
research, it is often important to have unique and determined identifiers 
(strings) for documents. I have seen this done by using prefixed terms (which 
is not very clean) or by using the "data" field of documents (which
lacks an
iterator: one cannot jump to one particular document easily this way). It 
might be interesting to do something on this level (maybe simply by wrapping 
the "prefixed term" way into something cleaner).

Thank you very much for your interest, and most of all for Xapian itself.

-- 
Emmanuel Eckard ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Artificial Intelligence Laboratory, EPFL
LIA/IC 1014 Ecublens, Suisse ? ? ? ? ? ? ? ? ? ? 
+41 21 693 66 97 ? ? ? 

() ?ascii ribbon campaign - against html mail 
/\ ? ? ? ? ? ? ? ? ? ? ? ?- against microsoft attachments ? ? ?

Olly Betts

2007-Sep-04 20:55 UTC

head link

[Xapian-discuss] Xapian and research in IR: a few suggestions from experience

On Mon, Sep 03, 2007 at 11:27:27AM +0200, Emmanuel Eckard
wrote:> In its present state, Xapian offers Databases and Documents, from where 
> TermIterators and PostingIterators allow accessing individual terms and 
> documents, as well as information such as Term Frequency (called "Word
> Document Frequency", wdf)
Close, it's "Within Document Frequency"!
> and idf (Invert Document Frequency) terms.
To be precise, we store the number of documents which each term occurs
in, which can be used to calculate the idf.
> 1) Set generic "slots" on all objects susceptible of being
associated with
> latent data in a latent retrieval model; this would include the major 
> components of Xapian: Databases, Documents and Termiterators 
Do you have some pointers to the models you have in mind so I can get
an idea what sort of data we might be talking about?  (I can see this
could be useful for storing a "reputation" score for each document,
derived from link analysis, user clicks, etc.)

I've actually been looking at implementing support for Divergence from
Randomness models:

http://ir.dcs.gla.ac.uk/wiki/DivergenceFromRandomness

Studies show that these can give retrieval results as good or better
than BM25, but some of them are parameter free, whereas BM25 has 4
parameters, which ideally need tuning for best performance - the best
values vary depending on the documents in the collection.  In an
academic study people are willing to do that, but in the real world it's
rarely done.  So being parameter free is a great feature for us.

To implement these efficiently, we need to know some more per-database
statistics (for example, an upper bound on the document length), so I'm
looking at storing these.  Note that at least in this case they can't
just be opaque data since the database backend needs to update their
values, and we need to know how to derive the combined statistic when
searching several databases.

They could be generic data with suitable callbacks (as you suggest), but
in this case I think it makes more sense to just code them in directly,
since the space required for per-database statistics is a small fixed
overhead, and some of the statistics could also be used to help optimise
BM25 weighted searches.  Also callbacks aren't available with some of
the bindings (e.g. PHP, Tcl).

I don't think the DfR schemes will need extra per-document or
per-posting information.  An extra issue for that is that it's hard to
know how best to compactly store generic information, which is
especially important for per-posting statistics.  Perhaps that can be
addressed by just leaving packing and unpacking up to the user.
> 2) Offer generic read/write methods. Coherence of the data should be 
> maintained, for instance by offering virtual 
> mehods "documentAdded()", "documentRemoved()", ... to
allow the user to apply
> necessary operations on his data is needed.
I don't think we can usefully support subclassing of the PIMPL classes
(i.e. most of the Xapian API classes), but that's a detail really -
there are other ways to allow this such as passing in an object upon
which these hooks will be called.
> For a less important and fundamental suggestion, I'd like to mention
that in
> research, it is often important to have unique and determined identifiers 
> (strings) for documents. I have seen this done by using prefixed terms
(which
> is not very clean) or by using the "data" field of documents
(which lacks an
> iterator: one cannot jump to one particular document easily this way). It 
> might be interesting to do something on this level (maybe simply by
wrapping
> the "prefixed term" way into something cleaner).
What do you have in mind?  You can already add/replace or delete a
document by term.  An overloaded version of get_document() which could
retrieve the first document matching a particular term would be fairly
easy to add and might save some internal work over creating a
PostingIterator.

Actually using arbitrary strings as docids would be more tricky.
Non-sparse numeric ids are easy to compress, and useful externally too
(e.g. omindex uses a vector<bool> to track which docids it has updated
during a run).  Although non-numeric ids would avoid the mapping of
docids we current have to do when searching multiple databases.

Cheers,
    Olly

Emmanuel Eckard

2007-Sep-05 17:45 UTC

head link

[Xapian-discuss] Re: Xapian and research in IR: a few suggestions from experience

> Do you have some pointers to the models you have in mind so I can get
> an idea what sort of data we might be talking about?  (I can see this
> could be useful for storing a "reputation" score for each
document,
> derived from link analysis, user clicks, etc.)
I am now working with PLSA (Probabilistic Latent Semantic Analysis), which 
assumes that documents (d) and terms (w) are associated with categories (z), 
and represents the data as mixture models of P(z), P(d|z) and P(w|z). I have 
implemented this model using objects built around Xapian. There is also some 
work done on Naive bayesian models, and Latent Dirichlet Allocation. 

All these models would call for doubles, or vectors of doubles, to be 
associated with Documents, TermIterators and Databases.
> > For a less important and fundamental suggestion, I'd like to
mention that
> > in research, it is often important to have unique and determined
> > identifiers (strings) for documents. I have seen this done by using
> > prefixed terms (which is not very clean) or by using the
"data" field of
> > documents (which lacks an iterator: one cannot jump to one particular
> > document easily this way). It might be interesting to do something on
> > this level (maybe simply by wrapping the "prefixed term" way
into
> > something cleaner).
>
> What do you have in mind?  You can already add/replace or delete a
> document by term.  An overloaded version of get_document() which could
> retrieve the first document matching a particular term would be fairly
> easy to add and might save some internal work over creating a
> PostingIterator.
I was thinking of the toolchain of a scientist working on TREC, for instance: 
documents identified by string docIds are indexed, retrieval is applied, then 
the programme outputs a codified list of documents (the string docIds) which 
is used to evaluation with trec_eval. 

Presently I store the string docIds in the "data" field, so there
would be no
elegant way for me to retrieve a document given its string docId. But I have 
not felt the need for this yes, so it's a nicety, really.

Cheers !

-- 
Emmanuel Eckard                              
Artificial Intelligence Laboratory, EPFL
LIA/IC 1014 Ecublens, Suisse                     
+41 21 693 66 97       

()  ascii ribbon campaign - against html mail 
/\                        - against microsoft attachments

Xapian discuss - Sep 2007 - Xapian and research in IR: a few suggestions from experience

[Xapian-discuss] Xapian and research in IR: a few suggestions from experience

[Xapian-discuss] Xapian and research in IR: a few suggestions from experience

[Xapian-discuss] Re: Xapian and research in IR: a few suggestions from experience