Gaurav Arora
2012-Apr-15 01:09 UTC
[Xapian-devel] Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.
Hi, I have implemented initial prototype of the Xapian::Weight subclass for Unigram Language Modelling to support UnigramLM weighing in xapian.Other changes include adding collection_frequency to TermFreqs struct to store collection frequency of terms and some changes to support it xapian Framework,Changing simplesearch.cc to search using UnigramLMWeight class. Following issues have not being addressed in this patch(I am working on following issues): 1. Log trick for handling multiplication for LM need to made more robust than just adding some random number to avoid rejecting document due to negative value returned by log. Since each term contribution is probability(b/w 0 and 1). Hence finding log will result in negative value and eventually rejection of document.Hence a random linear weight has been added.It need to be addressed by using log diffrent bases and some other techniques. Discussion about log trick needed to be used are here for reference: http://comments.gmane.org/gmane.comp.search.xapian.devel/1857 2. Setting tighter bound for the get_maxpart() to make matching process more efficient. 3. Adding other smoothing factors to the UnigramLMWeight implementation. PFA 5 patches for the initial prototype implementation of Unigram Language Model in Xapian. Thanks, -- with regards Gaurav A. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Added-UnigramLMWeigh-to-the-Xapian-Weight-Subclass.c.patch Type: application/octet-stream Size: 19054 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0010.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-Made-changes-to-remote-backend-class-to-accomodate-c.patch Type: application/octet-stream Size: 2393 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0011.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-Adding-dependency-classunigramlmweight.Plo-for-unigramlmweight.cc.patch Type: application/octet-stream Size: 12686 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0012.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0004-Removed-a-implementation-bug-of-Collection-Frequency.patch Type: application/octet-stream Size: 3273 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0013.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0005-Minor-indentation-and-comment-changes-in-the-code.patch Type: application/octet-stream Size: 5736 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120415/4f9b3ad9/attachment-0014.obj>
Olly Betts
2012-Apr-17 02:36 UTC
[Xapian-devel] Patch for Initial Prototype implementation of Unigram Langauage Modelling in xapian-core.
On Sun, Apr 15, 2012 at 06:39:33AM +0530, Gaurav Arora wrote:> I have implemented initial prototype of the Xapian::Weight subclass for > Unigram Language Modelling to support UnigramLM weighing in xapian.Other > changes include adding collection_frequency to TermFreqs struct to store > collection frequency of terms and some changes to support it xapian > Framework,Changing simplesearch.cc to search using UnigramLMWeight class. > > Following issues have not being addressed in this patch(I am working on > following issues): > > 1. Log trick for handling multiplication for LM need to made more robust > than just adding some random number to avoid rejecting document due to > negative value returned by log.BTW, log() in C/C++ is natural logarithm (so base e), so 10 seems particularly arbitrary to add. Log to base 10 is log10(). I'm not sure what the best answer is here though.> PFA 5 patches for the initial prototype implementation of Unigram Language > Model in Xapian.Thanks for the patches. They look good, though I didn't try them out yet. Three minor things: You shouldn't commit the .Plo files - they're generated during the build. It's only really meaningful to mark a constructor as "explicit" if it takes (or has optional parameters such that it can take) a single argument. The "explicit" marking means it would be use to implicitly convert a value. So if you had an array class that could be initialised with a size: Array::Array(size_t size); If you don't mark that as explicit, then the user could pass an integer where an Array was expected, and the compiler would create a temporary array and pass it in, which isn't something you want to happen for this sort of case. And in the final patch some of the comments aren't actually multi-line but instead are really one long line which looks like a multi-line comment if viewed wrapper at 80 columns. If you look at the diff itself you will probably see what I mean. Cheers, Olly