thr3ads.net - Xapian devel - [Xapian-devel] Backend for Lucene format indexes-How to get doclength [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Olly Betts

2013-Aug-25 01:11 UTC

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang
wrote:> I think norm(t, d) in Lucene can used to caculate the number which is
> similar to doc length(see norm(t,d) in
>
http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm).
It sounds similar (especially if document and field boosts aren't in use),
though some places may rely on the doc_length = sum(wdf) definition - in
particular, some other measure of length may violate assumptions like
wdf <= doc_length.

For now, using weighting schemes which don't use document length is
probably the simplest answer.
> And this feature is applied into this pull request(
> https://github.com/xapian/xapian/pull/25). Here's the informations
about
> new features and prerformance test:
You've made great progress!  I've started to look through the pull
request and made some comments via github.
>  This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2,
> and not fully tested, I send this patch for wandering if it works for the
> idea
http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes.
> until now, fewer features supported, includes:
> 1. Single term search.
> 2. 'AND' search supported, but performance needed to be optimize.
> 3. Multiple segments.
> 4. Doc length. Using .nrm instead.
> 
> Additonally:
> 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported.
> These datas are not exsits in Lucene backend, I'v used constant to
instead,
> so the search results may be not good.
You should simply not define these methods for your backend - Xapian has
fall-back versions (used for inmemory) which will then be used.  If you
return some constant which isn't actually a valid bound, the matcher
will make invalid assumptions while optimising, resulting in incorrect
search results.
> 2. Compound file is not suppoted. so Compound file must be disable where
> doing index.
> 
> I've built a performance test of 1,000,000 documents(actually, I've
> download a single file from wiki, which include 1,000,000 lines, I'v
treat
> one line as a document) from wiki. When doing single term seach,
> performance of Lucene backend is as fast as xapian Chert.
> Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.
> 242 terms, doing single term seach per term, cacultes the total time used
> for these 242 searches(results are fluctuant, so I give 10 results per
> backend):
> 1. backend Lucene
> 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms,
> 1551ms
> 2. backend Chert
> 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms,
> 1809ms
So this benchmark is pretty much meaningless because of the incorrect
constant bounds in use.

Cheers,
    Olly

jiangwen jiang

2013-Aug-26 01:41 UTC

head link

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

*For now, using weighting schemes which don't use document length is
probably the simplest answer.*

There's tf-idf weighting scheme on svn master, is it suitable for lucene
backend?

*You've made great progress!  I've started to look through the pull
request and made some comments via github.

*
Thanks for your comments, I will update the code as soon as possible.*

*
Regards*
*


2013/8/25 Olly Betts <olly at survex.com>
> On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote:
> > I think norm(t, d) in Lucene can used to caculate the number which is
> > similar to doc length(see norm(t,d) in
> >
>
http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm
> ).
>
> It sounds similar (especially if document and field boosts aren't in
use),
> though some places may rely on the doc_length = sum(wdf) definition - in
> particular, some other measure of length may violate assumptions like
> wdf <= doc_length.
>
> For now, using weighting schemes which don't use document length is
> probably the simplest answer.
>
> > And this feature is applied into this pull request(
> > https://github.com/xapian/xapian/pull/25). Here's the informations
about
> > new features and prerformance test:
>
> You've made great progress!  I've started to look through the pull
> request and made some comments via github.
>
> >  This is a patch of Lucene 3.6.2 backend, it is just support
Lucene3.6.2,
> > and not fully tested, I send this patch for wandering if it works for
the
> > idea
> http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes.
> > until now, fewer features supported, includes:
> > 1. Single term search.
> > 2. 'AND' search supported, but performance needed to be
optimize.
> > 3. Multiple segments.
> > 4. Doc length. Using .nrm instead.
> >
> > Additonally:
> > 1. xxx_lower_bound, xxx_upper_bound, total doc length are not
supported.
> > These datas are not exsits in Lucene backend, I'v used constant to
> instead,
> > so the search results may be not good.
>
> You should simply not define these methods for your backend - Xapian has
> fall-back versions (used for inmemory) which will then be used.  If you
> return some constant which isn't actually a valid bound, the matcher
> will make invalid assumptions while optimising, resulting in incorrect
> search results.
>
> > 2. Compound file is not suppoted. so Compound file must be disable
where
> > doing index.
> >
> > I've built a performance test of 1,000,000 documents(actually,
I've
> > download a single file from wiki, which include 1,000,000 lines,
I'v
> treat
> > one line as a document) from wiki. When doing single term seach,
> > performance of Lucene backend is as fast as xapian Chert.
> > Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M.
> > 242 terms, doing single term seach per term, cacultes the total time
used
> > for these 242 searches(results are fluctuant, so I give 10 results per
> > backend):
> > 1. backend Lucene
> > 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms,
1218ms,
> > 1551ms
> > 2. backend Chert
> > 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms,
1688ms,
> > 1809ms
>
> So this benchmark is pretty much meaningless because of the incorrect
> constant bounds in use.
>
> Cheers,
>     Olly
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20130826/d7cbc032/attachment.html>

Olly Betts

2013-Aug-26 02:17 UTC

head link

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

On Mon, Aug 26, 2013 at 09:41:07AM +0800, jiangwen jiang
wrote:> > For now, using weighting schemes which don't use document length
is
> > probably the simplest answer.
> 
> There's tf-idf weighting scheme on svn master, is it suitable for
lucene
> backend?
Yes - TfIdfWeight doesn't ever use the document length (at least with
the normalisations currently implemented).

You could also use BM25 with parameter b=0.

Cheers,
    Olly

Xapian devel - Aug 2013 - Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength

[Xapian-devel] Backend for Lucene format indexes-How to get doclength