jiangwen jiang
2013-Jun-17 13:28 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
*Or do you mean that it's one number per document whereas the other stats are per database, so it's harder to store it?* yes, I mean this. It's a huge data. If a new doclength list(contains all the doclength in a list, like chert) is added by myself, I am concern about: 1. This doclength list may be the bottlenect in this backend, http://trac.xapian.org/ticket/326 2. Change too much above Lucene file format, then it's hard to compare performance between Xapian and Lucene Some ideas: 1. Using rank algorithm without doclength, such as BM25Weight or TradWeight without doclength, or tfidfWeight. If ranking results will be not good without doclength? 2. Stores doclength in .prx payload when doing Lucene indexing. https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html http://searchhub.org/2009/08/05/getting-started-with-payloads/ But this method has obvious drawback, it's not for general Lucene index data, if doclength is not stored, this method doesn't works> > Any suggestions?Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/c83c7595/attachment-0001.html>
Richard Boulton
2013-Jun-17 15:06 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
You might want to look at how Lucene has implemented document length lookup for the BM25Similarity class (added in Lucene 4.0): http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html I assumed they're using a document payload for storing the lengths, but haven't looked into it. On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote:> *Or do you mean that it's one number per document whereas the other stats > are per database, so it's harder to store it?* > > yes, I mean this. It's a huge data. If a new doclength list(contains all > the doclength in a list, like chert) > is added by myself, I am concern about: > 1. This doclength list may be the bottlenect in this backend, > http://trac.xapian.org/ticket/326 > 2. Change too much above Lucene file format, then it's hard to compare > performance between Xapian and Lucene > > Some ideas: > 1. Using rank algorithm without doclength, such as BM25Weight or > TradWeight without doclength, or tfidfWeight. > If ranking results will be not good without doclength? > 2. Stores doclength in .prx payload when doing Lucene indexing. > > https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html > http://searchhub.org/2009/08/05/getting-started-with-payloads/ > But this method has obvious drawback, it's not for general Lucene > index data, if doclength is not stored, this method > doesn't works > >> >> Any suggestions? > > Regards > > _______________________________________________ > Xapian-devel mailing list > Xapian-devel at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-devel > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/2054aace/attachment-0001.html>
Richard Boulton
2013-Jun-17 15:12 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
Ah, a quick follow-on from that: read http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html There's a per-document "norm" which can be stored, which BM25Similarity uses to store the document length. Additional factors can be stored in DocValuesFields (which are very similar to document values in Xapian, in that they're stored in separate sequences, though are a bit more flexible). On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org> wrote:> You might want to look at how Lucene has implemented document length > lookup for the BM25Similarity class (added in Lucene 4.0): > > > http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html > > I assumed they're using a document payload for storing the lengths, but > haven't looked into it. > > > On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote: > >> *Or do you mean that it's one number per document whereas the other stats >> are per database, so it's harder to store it?* >> >> yes, I mean this. It's a huge data. If a new doclength list(contains all >> the doclength in a list, like chert) >> is added by myself, I am concern about: >> 1. This doclength list may be the bottlenect in this backend, >> http://trac.xapian.org/ticket/326 >> 2. Change too much above Lucene file format, then it's hard to compare >> performance between Xapian and Lucene >> >> Some ideas: >> 1. Using rank algorithm without doclength, such as BM25Weight or >> TradWeight without doclength, or tfidfWeight. >> If ranking results will be not good without doclength? >> 2. Stores doclength in .prx payload when doing Lucene indexing. >> >> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html >> http://searchhub.org/2009/08/05/getting-started-with-payloads/ >> But this method has obvious drawback, it's not for general Lucene >> index data, if doclength is not stored, this method >> doesn't works >> >>> >>> Any suggestions? >> >> Regards >> >> _______________________________________________ >> Xapian-devel mailing list >> Xapian-devel at lists.xapian.org >> http://lists.xapian.org/mailman/listinfo/xapian-devel >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/db6eb082/attachment-0001.html>