thr3ads.net - Xapian devel - [Xapian-devel] Backend for Lucene format indexes-How to get doclength [Jun 2013]

If this information is useful, please help other people find it:
Share via:

jiangwen jiang

2013-Jun-17 13:28 UTC

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

*Or do you mean that it's one number per document whereas the other stats
are per database, so it's harder to store it?*

yes, I mean this. It's a huge data. If a new doclength list(contains all
the doclength in a list, like chert)
is added by myself, I am concern about:
1. This doclength list may be the bottlenect in this backend,
http://trac.xapian.org/ticket/326
2. Change too much above Lucene file format, then it's hard to compare
performance between Xapian and Lucene

Some ideas:
1. Using rank algorithm without doclength, such as BM25Weight or TradWeight
without doclength, or tfidfWeight.
    If ranking results will be not good without doclength?
2. Stores doclength in .prx payload when doing Lucene indexing.

https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
    http://searchhub.org/2009/08/05/getting-started-with-payloads/
    But this method has obvious drawback, it's not for general Lucene index
data, if doclength is not stored, this method
    doesn't works
>
> Any suggestions?
Regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/c83c7595/attachment-0001.html>

Richard Boulton

2013-Jun-17 15:06 UTC

head link

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

You might want to look at how Lucene has implemented document length lookup
for the BM25Similarity class (added in Lucene 4.0):

http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html

I assumed they're using a document payload for storing the lengths, but
haven't looked into it.


On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote:
> *Or do you mean that it's one number per document whereas the other
stats
> are per database, so it's harder to store it?*
>
> yes, I mean this. It's a huge data. If a new doclength list(contains
all
> the doclength in a list, like chert)
> is added by myself, I am concern about:
> 1. This doclength list may be the bottlenect in this backend,
> http://trac.xapian.org/ticket/326
> 2. Change too much above Lucene file format, then it's hard to compare
> performance between Xapian and Lucene
>
> Some ideas:
> 1. Using rank algorithm without doclength, such as BM25Weight or
> TradWeight without doclength, or tfidfWeight.
>     If ranking results will be not good without doclength?
> 2. Stores doclength in .prx payload when doing Lucene indexing.
>
>
https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>     But this method has obvious drawback, it's not for general Lucene
> index data, if doclength is not stored, this method
>     doesn't works
>
>>
>> Any suggestions?
>
> Regards
>
> _______________________________________________
> Xapian-devel mailing list
> Xapian-devel at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-devel
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/2054aace/attachment-0001.html>

Richard Boulton

2013-Jun-17 15:12 UTC

head link

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Ah, a quick follow-on from that: read
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html

There's a per-document "norm" which can be stored, which
BM25Similarity
uses to store the document length.  Additional factors can be stored in
DocValuesFields (which are very similar to document values in Xapian, in
that they're stored in separate sequences, though are a bit more flexible).


On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org> wrote:
> You might want to look at how Lucene has implemented document length
> lookup for the BM25Similarity class (added in Lucene 4.0):
>
>
>
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
>
> I assumed they're using a document payload for storing the lengths, but
> haven't looked into it.
>
>
> On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com>
wrote:
>
>> *Or do you mean that it's one number per document whereas the other
stats
>> are per database, so it's harder to store it?*
>>
>> yes, I mean this. It's a huge data. If a new doclength
list(contains all
>> the doclength in a list, like chert)
>> is added by myself, I am concern about:
>> 1. This doclength list may be the bottlenect in this backend,
>> http://trac.xapian.org/ticket/326
>> 2. Change too much above Lucene file format, then it's hard to
compare
>> performance between Xapian and Lucene
>>
>> Some ideas:
>> 1. Using rank algorithm without doclength, such as BM25Weight or
>> TradWeight without doclength, or tfidfWeight.
>>     If ranking results will be not good without doclength?
>> 2. Stores doclength in .prx payload when doing Lucene indexing.
>>
>>
https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>>     But this method has obvious drawback, it's not for general
Lucene
>> index data, if doclength is not stored, this method
>>     doesn't works
>>
>>>
>>> Any suggestions?
>>
>> Regards
>>
>> _______________________________________________
>> Xapian-devel mailing list
>> Xapian-devel at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/db6eb082/attachment-0001.html>

Xapian devel - Jun 2013 - Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength