thr3ads.net - Xapian devel - [Xapian-devel] GSOC 2015 [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Richhiey Thomas

2015-Feb-25 18:36 UTC

[Xapian-devel] GSOC 2015

Hello xapian devs,

For GSOC 2015, I would like to work on Heimstra's language modelling and
LDA based relavance language modelling for the project idea 'Weighting
schemes for Xapian'.

Heimstra's LM:

Heimstra suggested a parsimonious language model which models what language
use distinguishes a relevant document from other documents. For example,
adding words which are common in the English language to the language model
would only make the language model less effective and large. Parsimonius LM
helps in language modelling by reducing the number of parameters required
to model the data.
This approach can be used for indexing and ranking documents and is
implemented with the help of a mixture model. The mixture model can use two
or more language model components. In this case, based on the the paper,
the link of which is given below, it uses a background language model and a
document
model along with expectation maximization estimation algorithm. While
retreival, it also uses a relevance or request model which is used to rank
the documents by using Kullback-Leibler divergence between this and
document model.

Original paper :
http://research.microsoft.com/pubs/66933/hiemstra_sigir04.pdf

LDA based relevance language modelling:

This is an approach to integrate the advantages of both relevance language
models with Latent Dirchlet Allocation topic modelling. It is a generative
model and can retrieve relevant documents for a given query.
The language model depends on the language model to describe a term in the
query, the language model to describe the background topic and a language
model used to descirbe other ideas in the document.
The good part about this approach is that unlike relevance language models
which consider all tokens are generated by term t, this approach includes
the background topic and words which are specific to the given document.
This gives us an insight for extrapolating various document specific
features and identifying the non relevant parts of the document which we
wouldnt be able to do otherwise.
The model used in the paper uses Gibbs sampling for the inference since we
will be dealing with Dirchlet distributions.

Original paper : http://dollar.biz.uiowa.edu/~street/airs09.pdf

This is just a rough idea of what I would like to do. I would like to have
a discussion on this and any constructive advice
is welcome. Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20150226/0086d2a6/attachment-0002.html>

James Aylett

2015-Mar-08 10:43 UTC

head link

[Xapian-devel] GSOC 2015

On 25 Feb 2015, at 18:36, Richhiey Thomas <richhiey.thomas at gmail.com>
wrote:
> For GSOC 2015, I would like to work on Heimstra's language modelling
and LDA based relavance language modelling for the project idea 'Weighting
schemes for Xapian?.
Richhiey ? apologies for not replying to you sooner. Xapian hasn?t been accepted
as a mentoring organisation for GSoC this year; however we?re still happy to
provide the same mentoring and support for anyone who wants to work on Xapian
this summer (or at any time), so if you?re able and still interested, it?d be
great to develop these weighting schemes into a concrete plan so you can work on
them.

The key here is going to be identifying any information that will be needed
while processing the weight of a document for a particular query which we don?t
currently track. I haven?t had a chance to look more than briefly at the two
papers; for LDA it looks like there?s going to be something around the topic
distributions (assuming this approach can be coerced into a suitable shape for
Xapian?s weighting mechanism); for the parsinomious LM it looks like there?s at
least some post-index iterative calculation, with related storage requirements.
(It may be that going back to the work that introduced parsimonious language
modelling, which I assume is  the Sparck-Jones et al paper[1], will suggest
other specific approaches within Xapian?s framework.)

[1] K. Sparck-Jones, S.E. Robertson, D. Hiemstra, and H. Zaragoza: Language
modelling and relevance
(http://www.cl.cam.ac.uk/archive/ksj21/ksjdigipapers/langmodbook03.pdf)

J

-- 
 James Aylett, occasional trouble-maker
 xapian.org

Xapian devel - Feb 2015 - GSOC 2015

[Xapian-devel] GSOC 2015

[Xapian-devel] GSOC 2015