Hello xapian devs, For GSOC 2015, I would like to work on Heimstra's language modelling and LDA based relavance language modelling for the project idea 'Weighting schemes for Xapian'. Heimstra's LM: Heimstra suggested a parsimonious language model which models what language use distinguishes a relevant document from other documents. For example, adding words which are common in the English language to the language model would only make the language model less effective and large. Parsimonius LM helps in language modelling by reducing the number of parameters required to model the data. This approach can be used for indexing and ranking documents and is implemented with the help of a mixture model. The mixture model can use two or more language model components. In this case, based on the the paper, the link of which is given below, it uses a background language model and a document model along with expectation maximization estimation algorithm. While retreival, it also uses a relevance or request model which is used to rank the documents by using Kullback-Leibler divergence between this and document model. Original paper : http://research.microsoft.com/pubs/66933/hiemstra_sigir04.pdf LDA based relevance language modelling: This is an approach to integrate the advantages of both relevance language models with Latent Dirchlet Allocation topic modelling. It is a generative model and can retrieve relevant documents for a given query. The language model depends on the language model to describe a term in the query, the language model to describe the background topic and a language model used to descirbe other ideas in the document. The good part about this approach is that unlike relevance language models which consider all tokens are generated by term t, this approach includes the background topic and words which are specific to the given document. This gives us an insight for extrapolating various document specific features and identifying the non relevant parts of the document which we wouldnt be able to do otherwise. The model used in the paper uses Gibbs sampling for the inference since we will be dealing with Dirchlet distributions. Original paper : http://dollar.biz.uiowa.edu/~street/airs09.pdf This is just a rough idea of what I would like to do. I would like to have a discussion on this and any constructive advice is welcome. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150226/0086d2a6/attachment-0002.html>
On 25 Feb 2015, at 18:36, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:> For GSOC 2015, I would like to work on Heimstra's language modelling and LDA based relavance language modelling for the project idea 'Weighting schemes for Xapian?.Richhiey ? apologies for not replying to you sooner. Xapian hasn?t been accepted as a mentoring organisation for GSoC this year; however we?re still happy to provide the same mentoring and support for anyone who wants to work on Xapian this summer (or at any time), so if you?re able and still interested, it?d be great to develop these weighting schemes into a concrete plan so you can work on them. The key here is going to be identifying any information that will be needed while processing the weight of a document for a particular query which we don?t currently track. I haven?t had a chance to look more than briefly at the two papers; for LDA it looks like there?s going to be something around the topic distributions (assuming this approach can be coerced into a suitable shape for Xapian?s weighting mechanism); for the parsinomious LM it looks like there?s at least some post-index iterative calculation, with related storage requirements. (It may be that going back to the work that introduced parsimonious language modelling, which I assume is the Sparck-Jones et al paper[1], will suggest other specific approaches within Xapian?s framework.) [1] K. Sparck-Jones, S.E. Robertson, D. Hiemstra, and H. Zaragoza: Language modelling and relevance (http://www.cl.cam.ac.uk/archive/ksj21/ksjdigipapers/langmodbook03.pdf) J -- James Aylett, occasional trouble-maker xapian.org