Gaurav Arora
2012-Mar-22  01:10 UTC
[Xapian-devel] GSOC : Language Modelling for information retrieval with Diversified Search results
Hello, I am a undergraduate student at DA-IICT,India pursuing Btech in Information and Communication Technology.Major field of my Research is Information Retrieval and Natural Language processing. xapain being an powerful Information retrieval library have attracted me towards implementing stuff learned in class for this project.I have worked on entity search on RDF data,SMS based FAQ retrieval,Question Answering under competitions in evaluation forums like CLEF ,FIRE.I want to grab GSOC opportunity and join world of FOSS developers. I would like to work and include hooping techniques like Language Modelling and Diversified Search in information retrieval. Brief Summary of idea: Language Modelling for Information retrieval approach focus on building probabilistic language models for documents and rank document based on probability of model generating the query.Technique is heavy and costlier than the traditional information retrieval technique but has proved to preform better in literature than traditional methods. Language modelling approach performs better as it tries to capture word and phrase association to capture user context. Diversified search is key ways for user satisfaction in absence of explicit knowledge of user intent.Diversified search algorithm tries to find out(estimate) different possible context of user query and tries to pull potential document of all context rather than explicitly assuming a context. Diversification can be done by generating different rank list for different context or adding document from different context in a single rank list. Resources: http://nlp.stanford.edu/IR-book/html/htmledition/ponte-and-crofts-experiments-1.html http://dl.acm.org/citation.cfm?id=291008 http://goo.gl/klqYy http://dl.acm.org/citation.cfm?id=1860709 I have compiled and installed xapian and tried playing with xapian in past few days.I have few queries regarding xapian :- 1. xapain supports relevance feedback(query expansion) through " Xapian::Enquire::get_eset" function.which algorithm is used to expand query in Enquire class. Since search result diversification is its naive form performed by expanding query with different context and adding document from different context in final rank-list, thereby catering to all context of query. I was thinking if i can use the algorithm implemented in expanded set for query expansion and implement a new algorithm in Search diversification in this way query expansion feature of xapian will also get powerful. 2. I have read that xapian supports passage retrieval ,proximity based query ,wildcard query and passage retrieval but I could not find any documentation or function providing these facilities of xapain.I will be glad if you can point me towards any available documentation describing to use such options. I would be glad if mentors from xapian community can comment on my idea of implementing Language modelling technique and search result diversification as a project in scenario of Open Source Search Engine Library( xapian). Will implementing these techniques help xapian as a open source project? wishing to join xapian community. -- with regards Gaurav A. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120322/3049aa0c/attachment.html>
Olly Betts
2012-Mar-22  09:34 UTC
[Xapian-devel] GSOC : Language Modelling for information retrieval with Diversified Search results
On Thu, Mar 22, 2012 at 06:40:46AM +0530, Gaurav Arora wrote:> Language Modelling for Information retrieval approach focus on building > probabilistic language models for documents and rank document based on > probability of model generating the query.Technique is heavy and costlier > than the traditional information retrieval technique but has proved to > preform better in literature than traditional methods. > > Language modelling approach performs better as it tries to capture word and > phrase association to capture user context.How well does it fit with Xapian's concept of what a weighting scheme is though? The scale of a project to implement this would be hugely different if you are essentially implementing a Xapian::Weight subclass, or implementing a whole new matcher, and possibly also new backend data-structures. I've not look closely enough at LM to know which would be the case. BTW, on the very next page to the one you link to says: Nevertheless, there is perhaps still insufficient evidence that its performance so greatly exceeds that of a well-tuned traditional vector space retrieval system as to justify changing an existing implementation. http://nlp.stanford.edu/IR-book/html/htmledition/language-modeling-versus-other-approaches-in-ir-1.html Both DfR and Learning to Rank are claimed to outperform BM25 and vector space models too - do you know how LM compares to these? I don't recall seeing any such comparisons myself.> 1. xapain supports relevance feedback(query expansion) through " > Xapian::Enquire::get_eset" function.which algorithm is used to expand query > in Enquire class.The probabilistic one from the Robertson-Sp?rck Jones paper.> Since search result diversification is its naive form performed by > expanding query with different context and adding document from different > context in final rank-list, thereby catering to all context of query. > > I was thinking if i can use the algorithm implemented in expanded set for > query expansion and implement a new algorithm in Search diversification in > this way query expansion feature of xapian will also get powerful.Possibly, but maybe it would be better to use an approach from the literature which has already been tried and evaluated?> 2. I have read that xapian supports passage retrieval ,proximity based > query ,wildcard query and passage retrieval but I could not find any > documentation or function providing these facilities of xapain.I will be > glad if you can point me towards any available documentation describing to > use such options.I don't think we claim to support passage retrieval anywhere (I suppose you could implement it by breaking large documents up into sections in a second database and performing a second search within those). Proximity and wildcards are both supported - you just aren't looking very hard, for example: http://xapian.org/search?P=wildcard> I would be glad if mentors from xapian community can comment on my idea of > implementing Language modelling technique and search result diversification > as a project in scenario of Open Source Search Engine Library( xapian). > Will implementing these techniques help xapian as a open source project?Diversification of results is certainly something people have asked about before - it would be useful in a lot of applications I think. Language Modelling is interesting. I think if it can be fitted into the framework we already have it would be worthwhile to implement it. I'm not so sure if it would require a second matcher or even more to be implemented. That would be a lot of extra code to maintain. Cheers, Olly