thr3ads.net - Xapian devel - [Xapian-devel] Backend for Lucene format indexes-How to get doclength [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Richard Boulton

2013-Jun-17 15:12 UTC

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Ah, a quick follow-on from that: read
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html

There's a per-document "norm" which can be stored, which
BM25Similarity
uses to store the document length.  Additional factors can be stored in
DocValuesFields (which are very similar to document values in Xapian, in
that they're stored in separate sequences, though are a bit more flexible).


On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org> wrote:
> You might want to look at how Lucene has implemented document length
> lookup for the BM25Similarity class (added in Lucene 4.0):
>
>
>
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
>
> I assumed they're using a document payload for storing the lengths, but
> haven't looked into it.
>
>
> On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com>
wrote:
>
>> *Or do you mean that it's one number per document whereas the other
stats
>> are per database, so it's harder to store it?*
>>
>> yes, I mean this. It's a huge data. If a new doclength
list(contains all
>> the doclength in a list, like chert)
>> is added by myself, I am concern about:
>> 1. This doclength list may be the bottlenect in this backend,
>> http://trac.xapian.org/ticket/326
>> 2. Change too much above Lucene file format, then it's hard to
compare
>> performance between Xapian and Lucene
>>
>> Some ideas:
>> 1. Using rank algorithm without doclength, such as BM25Weight or
>> TradWeight without doclength, or tfidfWeight.
>>     If ranking results will be not good without doclength?
>> 2. Stores doclength in .prx payload when doing Lucene indexing.
>>
>>
https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>>     But this method has obvious drawback, it's not for general
Lucene
>> index data, if doclength is not stored, this method
>>     doesn't works
>>
>>>
>>> Any suggestions?
>>
>> Regards
>>
>> _______________________________________________
>> Xapian-devel mailing list
>> Xapian-devel at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/db6eb082/attachment-0001.html>

jiangwen jiang

2013-Aug-20 11:28 UTC

head link

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

hi, guys:

I think norm(t, d) in Lucene can used to caculate the number which is
similar to doc length(see norm(t,d) in
http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm).

And this feature is applied into this pull request(
https://github.com/xapian/xapian/pull/25). Here's the informations about
new features and prerformance test:

 This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,
and not fully tested, I send this patch for wandering if it works for the
idea http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes.
until now, fewer features supported, includes:
1. Single term search.
2. 'AND' search supported, but performance needed to be optimize.
3. Multiple segments.
4. Doc length. Using .nrm instead.

Additonally:
1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.
These datas are not exsits in Lucene backend, I'v used constant to instead,
so the search results may be not good.
2. Compound file is not suppoted. so Compound file must be disable where
doing index.

I've built a performance test of 1,000,000 documents(actually, I've
download a single file from wiki, which include 1,000,000 lines, I'v treat
one line as a document) from wiki. When doing single term seach,
performance of Lucene backend is as fast as xapian Chert.
Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.
242 terms, doing single term seach per term, cacultes the total time used
for these 242 searches(results are fluctuant, so I give 10 results per
backend):
1. backend Lucene
1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,
1551ms
2. backend Chert
1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,
1809ms

Code for testing is quest.cc, you can look this file for details.

Code for Lucene indexing like this(And Xapian indexing used
example/simpleindex.cc):

    IndexWriter indexWriter = new IndexWriter(directory, new
EnglishAnalyzer(Version.LUCENE_36),
            IndexWriter.MaxFieldLength.UNLIMITED);
    indexWriter.setUseCompoundFile(false); //CompoundFile must be disable
    int lineId = 0;
    while (br.ready()) {  //read lines from input file, each line as a document
        lineId++;
        String origLine = br.readLine();
        origLine = origLine.trim();

        Document doc = new Document();
        doc.add(new Field("data", origLine, Field.Store.YES,
Field.Index.ANALYZED));
        doc.add(new Field("dataorigin", origLine, Field.Store.YES,
                Field.Index.NOT_ANALYZED));
        doc.add(new Field("lid", String.valueOf(lineId),
Field.Store.YES,
                Field.Index.NOT_ANALYZED));
        indexWriter.addDocument(doc);
    }

2013/6/17 Richard Boulton <richard at tartarus.org>
> Ah, a quick follow-on from that: read
>
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html
>
> There's a per-document "norm" which can be stored, which
BM25Similarity
> uses to store the document length.  Additional factors can be stored in
> DocValuesFields (which are very similar to document values in Xapian, in
> that they're stored in separate sequences, though are a bit more
flexible).
>
>
> On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org>
wrote:
>
>> You might want to look at how Lucene has implemented document length
>> lookup for the BM25Similarity class (added in Lucene 4.0):
>>
>>
>>
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html
>>
>> I assumed they're using a document payload for storing the lengths,
but
>> haven't looked into it.
>>
>>
>> On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com>
wrote:
>>
>>> *Or do you mean that it's one number per document whereas the
other
>>> stats
>>> are per database, so it's harder to store it?*
>>>
>>> yes, I mean this. It's a huge data. If a new doclength
list(contains all
>>> the doclength in a list, like chert)
>>> is added by myself, I am concern about:
>>> 1. This doclength list may be the bottlenect in this backend,
>>> http://trac.xapian.org/ticket/326
>>> 2. Change too much above Lucene file format, then it's hard to
compare
>>> performance between Xapian and Lucene
>>>
>>> Some ideas:
>>> 1. Using rank algorithm without doclength, such as BM25Weight or
>>> TradWeight without doclength, or tfidfWeight.
>>>     If ranking results will be not good without doclength?
>>> 2. Stores doclength in .prx payload when doing Lucene indexing.
>>>
>>>
https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html
>>>     http://searchhub.org/2009/08/05/getting-started-with-payloads/
>>>     But this method has obvious drawback, it's not for general
Lucene
>>> index data, if doclength is not stored, this method
>>>     doesn't works
>>>
>>>>
>>>> Any suggestions?
>>>
>>> Regards
>>>
>>> _______________________________________________
>>> Xapian-devel mailing list
>>> Xapian-devel at lists.xapian.org
>>> http://lists.xapian.org/mailman/listinfo/xapian-devel
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130820/58bf7f2b/attachment.html>

Olly Betts

2013-Aug-25 01:11 UTC

head link

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang
wrote:> I think norm(t, d) in Lucene can used to caculate the number which is
> similar to doc length(see norm(t,d) in
>
http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm).
It sounds similar (especially if document and field boosts aren't in use),
though some places may rely on the doc_length = sum(wdf) definition - in
particular, some other measure of length may violate assumptions like
wdf <= doc_length.

For now, using weighting schemes which don't use document length is
probably the simplest answer.
> And this feature is applied into this pull request(
> https://github.com/xapian/xapian/pull/25). Here's the informations
about
> new features and prerformance test:
You've made great progress!  I've started to look through the pull
request and made some comments via github.
>  This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,
> and not fully tested, I send this patch for wandering if it works for the
> idea
http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes.
> until now, fewer features supported, includes:
> 1. Single term search.
> 2. 'AND' search supported, but performance needed to be optimize.
> 3. Multiple segments.
> 4. Doc length. Using .nrm instead.
> 
> Additonally:
> 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.
> These datas are not exsits in Lucene backend, I'v used constant to
instead,
> so the search results may be not good.
You should simply not define these methods for your backend - Xapian has
fall-back versions (used for inmemory) which will then be used.  If you
return some constant which isn't actually a valid bound, the matcher
will make invalid assumptions while optimising, resulting in incorrect
search results.
> 2. Compound file is not suppoted. so Compound file must be disable where
> doing index.
> 
> I've built a performance test of 1,000,000 documents(actually, I've
> download a single file from wiki, which include 1,000,000 lines, I'v
treat
> one line as a document) from wiki. When doing single term seach,
> performance of Lucene backend is as fast as xapian Chert.
> Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.
> 242 terms, doing single term seach per term, cacultes the total time used
> for these 242 searches(results are fluctuant, so I give 10 results per
> backend):
> 1. backend Lucene
> 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,
> 1551ms
> 2. backend Chert
> 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,
> 1809ms
So this benchmark is pretty much meaningless because of the incorrect
constant bounds in use.

Cheers,
    Olly

Reasonably Related Threads

Search for more seemingly similar threads

Xapian devel - Aug 2013 - Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

Reasonably Related Threads