Hi Olly, Wouldn't setting the weight of terms in title back to normal (e.g. 5 to 1) by below line, automatically adjust the wdfs and field lengths? indexer.index_text(title, 5, "S"); -> indexer.index_text(title, 1, "S"); if it does not then we should include that part in the patch too. I like to create a patch for xapian-letor for resolving common code of xapian. Cheers, Parth. On Wed, Mar 12, 2014 at 3:13 AM, Jiarong Wei <vcamx3 at gmail.com> wrote:> Thank you Parth and Olly! I'll try it :) > > Jiarong Wei > > On Mar 11, 2014, at 16:57, Olly Betts <olly at survex.com> wrote: > > > On Tue, Mar 11, 2014 at 03:20:31PM +0100, Parth Gupta wrote: > >>> > >>> On current trunk, we index the title with prefix "S" by default in > >>> omindex, though with a wdf inc of 5 rather than 1: > >>> > >>> indexer.index_text(title, 5, "S"); > >>> > >>> So I don't think you need that change to omindex now. > >> > >> Yes, but please make sure to change 5 to 1 otherwise divide the final > count > >> statistics by 5 . :) > > > > We really need to resolve any instances where letor requires code in > > other parts of Xapian to be patched. > > > > In this case, possibly the bias on the title should be done differently, > > but won't this just mean both the wdfs and the field length for the S > > prefix are 5 times larger, and it won't matter? > > > > Cheers, > > Olly > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140317/c083d2cc/attachment-0002.html>
On Mon, Mar 17, 2014 at 09:07:29PM +0100, Parth Gupta wrote:> Wouldn't setting the weight of terms in title back to normal (e.g. 5 to 1) > by below line, automatically adjust the wdfs and field lengths? > > indexer.index_text(title, 5, "S"); -> indexer.index_text(title, 1, "S"); > > if it does not then we should include that part in the patch too. I like to > create a patch for xapian-letor for resolving common code of xapian.I'm not sure I follow. The reason we use 5 here is that the page title is that matching terms in the title are usually a good indicator of a page that should be ranked highly for a search (note omindex is not usually working in a domain where evil SEOs are trying to distort the rankings). If we simply change 5 to 1 here, then the title won't be given any extra consideration, which would be a regression in this area. Cheers, Olly
For unsupervised approaches like BM25 this approach works well but letor does not need special weighting for title in this form as it itself assigns weights to title features separately. But I see your concern it would be a problem when BM25 is used on the index with this setup. Hence its preferable to take a note of this uplift in title weight for xapian-letor and normalize it everywhere calculating the statistics. Cheers, Parth. On Thu, Mar 20, 2014 at 2:35 AM, Olly Betts <olly at survex.com> wrote:> On Mon, Mar 17, 2014 at 09:07:29PM +0100, Parth Gupta wrote: > > Wouldn't setting the weight of terms in title back to normal (e.g. 5 to > 1) > > by below line, automatically adjust the wdfs and field lengths? > > > > indexer.index_text(title, 5, "S"); -> indexer.index_text(title, 1, "S"); > > > > if it does not then we should include that part in the patch too. I like > to > > create a patch for xapian-letor for resolving common code of xapian. > > I'm not sure I follow. > > The reason we use 5 here is that the page title is that matching terms > in the title are usually a good indicator of a page that should be > ranked highly for a search (note omindex is not usually working in a > domain where evil SEOs are trying to distort the rankings). > > If we simply change 5 to 1 here, then the title won't be given any extra > consideration, which would be a regression in this area. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140322/d8f6c762/attachment-0002.html>