jiangwen jiang
2013-Jun-16 04:32 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
Hi, all: I have wrote a demo patch for Backend for Lucene format indexes, Lucene version is 3.6.2. http://lucene.apache.org/core/3_6_2/fileformats.html Now, this demo patch just support the basic features in Lucene. Compound File(.cfs/.cfe)?term vector(.tvx/.tvd/.tvf) delete document(.del) are not supported, skip list in .fdx is not supported too example/quest.cc is used to test this demo. query like this: field_name:term, or file_name:term1 AND field_name:term2 Until now, I found some data needed for BM25 in Xapian are not existed in Lucene: 1. doclength_lower_bound?doclength_upper_bound 2. wdf_lower_bound?wdf_uppper_bound 3. total_length 4. doclength(for each document) 1-3 are statistics data, can be caculated when doing copydatabase, and store them in somewhere. But doclengh is hard to do this way. 1. some other data instead of doclength? 2. Xapian support other rank algorithm which does not need doclength? Is there some suggestions to solve this problem? And the demo patch is here: https://github.com/white127/xapian-patch/blob/master/xapian_lucene_demo.patch Regards Jiang -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130616/42f7567a/attachment.htm>
jiangwen jiang
2013-Jun-16 04:41 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
Additional, I set fixed default values to datas which not existed in Lucene, to make this demo runable, the demo is not fully tested 2013/6/16 jiangwen jiang <jiangwen127 at gmail.com>> Hi, all: > > I have wrote a demo patch for Backend for Lucene format indexes, Lucene > version is 3.6.2. > http://lucene.apache.org/core/3_6_2/fileformats.html > > Now, this demo patch just support the basic features in Lucene. Compound > File(.cfs/.cfe)?term vector(.tvx/.tvd/.tvf) > delete document(.del) are not supported, skip list in .fdx is not > supported too > > example/quest.cc is used to test this demo. query like this: > field_name:term, or file_name:term1 AND field_name:term2 > > Until now, I found some data needed for BM25 in Xapian are not existed in > Lucene: > 1. doclength_lower_bound?doclength_upper_bound > 2. wdf_lower_bound?wdf_uppper_bound > 3. total_length > 4. doclength(for each document) > 1-3 are statistics data, can be caculated when doing copydatabase, and > store them in somewhere. But doclengh is > hard to do this way. > > 1. some other data instead of doclength? > 2. Xapian support other rank algorithm which does not need doclength? > Is there some suggestions to solve this problem? > > And the demo patch is here: > > https://github.com/white127/xapian-patch/blob/master/xapian_lucene_demo.patch > > Regards > Jiang >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130616/6a398a28/attachment.htm>
Olly Betts
2013-Jun-17 11:39 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
On Sun, Jun 16, 2013 at 12:32:31PM +0800, jiangwen jiang wrote:> I have wrote a demo patch for Backend for Lucene format indexes, Lucene > version is 3.6.2. > http://lucene.apache.org/core/3_6_2/fileformats.htmlSounds cool.> Until now, I found some data needed for BM25 in Xapian are not existed in > Lucene: > 1. doclength_lower_bound??doclength_upper_bound > 2. wdf_lower_bound??wdf_uppper_bound > 3. total_length > 4. doclength(for each document) > 1-3 are statistics data, can be caculated when doing copydatabase, and > store them in somewhere. But doclengh is > hard to do this way.Xapian's doclength is defined as sum(wdf), so I think you should be able to calculate it with a tool which scans the database in a copydatabase-like manner. Or do you mean that it's one number per document whereas the other stats are per database, so it's harder to store it?> 1. some other data instead of doclength?I don't know what else you could use instead.> 2. Xapian support other rank algorithm which does not need doclength?Yes. With certain parameter settings, BM25Weight and TradWeight don't need doclength. If you look in include/xapian/weight.h, you can see when need_stat(DOC_LENGTH) is called: BM25Weight: if (param_k1 != 0 && param_b != 0) need_stat(DOC_LENGTH); (so if you set k1=0 or b=0, BM25Weight won't use doclength). TradWeight: if (param_k != 0.0) { need_stat(AVERAGE_LENGTH); need_stat(DOC_LENGTH); } (so if k=0, TradWeight won't use doclength). Also, TfIdfWeight (which is on trunk, and 1.3.1) never uses doclength.> And the demo patch is here: > https://github.com/white127/xapian-patch/blob/master/xapian_lucene_demo.patchThanks - I'll take a look. Cheers, Olly
jiangwen jiang
2013-Jun-17 13:28 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
*Or do you mean that it's one number per document whereas the other stats are per database, so it's harder to store it?* yes, I mean this. It's a huge data. If a new doclength list(contains all the doclength in a list, like chert) is added by myself, I am concern about: 1. This doclength list may be the bottlenect in this backend, http://trac.xapian.org/ticket/326 2. Change too much above Lucene file format, then it's hard to compare performance between Xapian and Lucene Some ideas: 1. Using rank algorithm without doclength, such as BM25Weight or TradWeight without doclength, or tfidfWeight. If ranking results will be not good without doclength? 2. Stores doclength in .prx payload when doing Lucene indexing. https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html http://searchhub.org/2009/08/05/getting-started-with-payloads/ But this method has obvious drawback, it's not for general Lucene index data, if doclength is not stored, this method doesn't works> > Any suggestions?Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/c83c7595/attachment-0001.html>