Olly Betts
2013-Aug-25  01:11 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote:> I think norm(t, d) in Lucene can used to caculate the number which is > similar to doc length(see norm(t,d) in > http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm).It sounds similar (especially if document and field boosts aren't in use), though some places may rely on the doc_length = sum(wdf) definition - in particular, some other measure of length may violate assumptions like wdf <= doc_length. For now, using weighting schemes which don't use document length is probably the simplest answer.> And this feature is applied into this pull request( > https://github.com/xapian/xapian/pull/25). Here's the informations about > new features and prerformance test:You've made great progress! I've started to look through the pull request and made some comments via github.> This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2, > and not fully tested, I send this patch for wandering if it works for the > idea http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes. > until now, fewer features supported, includes: > 1. Single term search. > 2. 'AND' search supported, but performance needed to be optimize. > 3. Multiple segments. > 4. Doc length. Using .nrm instead. > > Additonally: > 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported. > These datas are not exsits in Lucene backend, I'v used constant to instead, > so the search results may be not good.You should simply not define these methods for your backend - Xapian has fall-back versions (used for inmemory) which will then be used. If you return some constant which isn't actually a valid bound, the matcher will make invalid assumptions while optimising, resulting in incorrect search results.> 2. Compound file is not suppoted. so Compound file must be disable where > doing index. > > I've built a performance test of 1,000,000 documents(actually, I've > download a single file from wiki, which include 1,000,000 lines, I'v treat > one line as a document) from wiki. When doing single term seach, > performance of Lucene backend is as fast as xapian Chert. > Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M. > 242 terms, doing single term seach per term, cacultes the total time used > for these 242 searches(results are fluctuant, so I give 10 results per > backend): > 1. backend Lucene > 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms, > 1551ms > 2. backend Chert > 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms, > 1809msSo this benchmark is pretty much meaningless because of the incorrect constant bounds in use. Cheers, Olly
jiangwen jiang
2013-Aug-26  01:41 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
*For now, using weighting schemes which don't use document length is probably the simplest answer.* There's tf-idf weighting scheme on svn master, is it suitable for lucene backend? *You've made great progress! I've started to look through the pull request and made some comments via github. * Thanks for your comments, I will update the code as soon as possible.* * Regards* * 2013/8/25 Olly Betts <olly at survex.com>> On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote: > > I think norm(t, d) in Lucene can used to caculate the number which is > > similar to doc length(see norm(t,d) in > > > http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm > ). > > It sounds similar (especially if document and field boosts aren't in use), > though some places may rely on the doc_length = sum(wdf) definition - in > particular, some other measure of length may violate assumptions like > wdf <= doc_length. > > For now, using weighting schemes which don't use document length is > probably the simplest answer. > > > And this feature is applied into this pull request( > > https://github.com/xapian/xapian/pull/25). Here's the informations about > > new features and prerformance test: > > You've made great progress! I've started to look through the pull > request and made some comments via github. > > > This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2, > > and not fully tested, I send this patch for wandering if it works for the > > idea > http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes. > > until now, fewer features supported, includes: > > 1. Single term search. > > 2. 'AND' search supported, but performance needed to be optimize. > > 3. Multiple segments. > > 4. Doc length. Using .nrm instead. > > > > Additonally: > > 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported. > > These datas are not exsits in Lucene backend, I'v used constant to > instead, > > so the search results may be not good. > > You should simply not define these methods for your backend - Xapian has > fall-back versions (used for inmemory) which will then be used. If you > return some constant which isn't actually a valid bound, the matcher > will make invalid assumptions while optimising, resulting in incorrect > search results. > > > 2. Compound file is not suppoted. so Compound file must be disable where > > doing index. > > > > I've built a performance test of 1,000,000 documents(actually, I've > > download a single file from wiki, which include 1,000,000 lines, I'v > treat > > one line as a document) from wiki. When doing single term seach, > > performance of Lucene backend is as fast as xapian Chert. > > Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M. > > 242 terms, doing single term seach per term, cacultes the total time used > > for these 242 searches(results are fluctuant, so I give 10 results per > > backend): > > 1. backend Lucene > > 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms, > > 1551ms > > 2. backend Chert > > 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms, > > 1809ms > > So this benchmark is pretty much meaningless because of the incorrect > constant bounds in use. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130826/d7cbc032/attachment.html>
Olly Betts
2013-Aug-26  02:17 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
On Mon, Aug 26, 2013 at 09:41:07AM +0800, jiangwen jiang wrote:> > For now, using weighting schemes which don't use document length is > > probably the simplest answer. > > There's tf-idf weighting scheme on svn master, is it suitable for lucene > backend?Yes - TfIdfWeight doesn't ever use the document length (at least with the normalisations currently implemented). You could also use BM25 with parameter b=0. Cheers, Olly