Richard Boulton
2013-Jun-17 15:12 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
Ah, a quick follow-on from that: read http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html There's a per-document "norm" which can be stored, which BM25Similarity uses to store the document length. Additional factors can be stored in DocValuesFields (which are very similar to document values in Xapian, in that they're stored in separate sequences, though are a bit more flexible). On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org> wrote:> You might want to look at how Lucene has implemented document length > lookup for the BM25Similarity class (added in Lucene 4.0): > > > http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html > > I assumed they're using a document payload for storing the lengths, but > haven't looked into it. > > > On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote: > >> *Or do you mean that it's one number per document whereas the other stats >> are per database, so it's harder to store it?* >> >> yes, I mean this. It's a huge data. If a new doclength list(contains all >> the doclength in a list, like chert) >> is added by myself, I am concern about: >> 1. This doclength list may be the bottlenect in this backend, >> http://trac.xapian.org/ticket/326 >> 2. Change too much above Lucene file format, then it's hard to compare >> performance between Xapian and Lucene >> >> Some ideas: >> 1. Using rank algorithm without doclength, such as BM25Weight or >> TradWeight without doclength, or tfidfWeight. >> If ranking results will be not good without doclength? >> 2. Stores doclength in .prx payload when doing Lucene indexing. >> >> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html >> http://searchhub.org/2009/08/05/getting-started-with-payloads/ >> But this method has obvious drawback, it's not for general Lucene >> index data, if doclength is not stored, this method >> doesn't works >> >>> >>> Any suggestions? >> >> Regards >> >> _______________________________________________ >> Xapian-devel mailing list >> Xapian-devel at lists.xapian.org >> http://lists.xapian.org/mailman/listinfo/xapian-devel >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130617/db6eb082/attachment-0001.html>
jiangwen jiang
2013-Aug-20 11:28 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
hi, guys: I think norm(t, d) in Lucene can used to caculate the number which is similar to doc length(see norm(t,d) in http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm). And this feature is applied into this pull request( https://github.com/xapian/xapian/pull/25). Here's the informations about new features and prerformance test: This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2, and not fully tested, I send this patch for wandering if it works for the idea http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes. until now, fewer features supported, includes: 1. Single term search. 2. 'AND' search supported, but performance needed to be optimize. 3. Multiple segments. 4. Doc length. Using .nrm instead. Additonally: 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported. These datas are not exsits in Lucene backend, I'v used constant to instead, so the search results may be not good. 2. Compound file is not suppoted. so Compound file must be disable where doing index. I've built a performance test of 1,000,000 documents(actually, I've download a single file from wiki, which include 1,000,000 lines, I'v treat one line as a document) from wiki. When doing single term seach, performance of Lucene backend is as fast as xapian Chert. Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M. 242 terms, doing single term seach per term, cacultes the total time used for these 242 searches(results are fluctuant, so I give 10 results per backend): 1. backend Lucene 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms, 1551ms 2. backend Chert 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms, 1809ms Code for testing is quest.cc, you can look this file for details. Code for Lucene indexing like this(And Xapian indexing used example/simpleindex.cc): IndexWriter indexWriter = new IndexWriter(directory, new EnglishAnalyzer(Version.LUCENE_36), IndexWriter.MaxFieldLength.UNLIMITED); indexWriter.setUseCompoundFile(false); //CompoundFile must be disable int lineId = 0; while (br.ready()) { //read lines from input file, each line as a document lineId++; String origLine = br.readLine(); origLine = origLine.trim(); Document doc = new Document(); doc.add(new Field("data", origLine, Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("dataorigin", origLine, Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("lid", String.valueOf(lineId), Field.Store.YES, Field.Index.NOT_ANALYZED)); indexWriter.addDocument(doc); } 2013/6/17 Richard Boulton <richard at tartarus.org>> Ah, a quick follow-on from that: read > http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/Similarity.html > > There's a per-document "norm" which can be stored, which BM25Similarity > uses to store the document length. Additional factors can be stored in > DocValuesFields (which are very similar to document values in Xapian, in > that they're stored in separate sequences, though are a bit more flexible). > > > On 17 June 2013 16:06, Richard Boulton <richard at tartarus.org> wrote: > >> You might want to look at how Lucene has implemented document length >> lookup for the BM25Similarity class (added in Lucene 4.0): >> >> >> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/BM25Similarity.html >> >> I assumed they're using a document payload for storing the lengths, but >> haven't looked into it. >> >> >> On 17 June 2013 14:28, jiangwen jiang <jiangwen127 at gmail.com> wrote: >> >>> *Or do you mean that it's one number per document whereas the other >>> stats >>> are per database, so it's harder to store it?* >>> >>> yes, I mean this. It's a huge data. If a new doclength list(contains all >>> the doclength in a list, like chert) >>> is added by myself, I am concern about: >>> 1. This doclength list may be the bottlenect in this backend, >>> http://trac.xapian.org/ticket/326 >>> 2. Change too much above Lucene file format, then it's hard to compare >>> performance between Xapian and Lucene >>> >>> Some ideas: >>> 1. Using rank algorithm without doclength, such as BM25Weight or >>> TradWeight without doclength, or tfidfWeight. >>> If ranking results will be not good without doclength? >>> 2. Stores doclength in .prx payload when doing Lucene indexing. >>> >>> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/Payload.html >>> http://searchhub.org/2009/08/05/getting-started-with-payloads/ >>> But this method has obvious drawback, it's not for general Lucene >>> index data, if doclength is not stored, this method >>> doesn't works >>> >>>> >>>> Any suggestions? >>> >>> Regards >>> >>> _______________________________________________ >>> Xapian-devel mailing list >>> Xapian-devel at lists.xapian.org >>> http://lists.xapian.org/mailman/listinfo/xapian-devel >>> >>> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130820/58bf7f2b/attachment.html>
Olly Betts
2013-Aug-25 01:11 UTC
[Xapian-devel] Backend for Lucene format indexes-How to get doclength
On Tue, Aug 20, 2013 at 07:28:42PM +0800, jiangwen jiang wrote:> I think norm(t, d) in Lucene can used to caculate the number which is > similar to doc length(see norm(t,d) in > http://lucene.apache.org/core/3_5_0/api/all/org/apache/lucene/search/Similarity.html#formula_norm).It sounds similar (especially if document and field boosts aren't in use), though some places may rely on the doc_length = sum(wdf) definition - in particular, some other measure of length may violate assumptions like wdf <= doc_length. For now, using weighting schemes which don't use document length is probably the simplest answer.> And this feature is applied into this pull request( > https://github.com/xapian/xapian/pull/25). Here's the informations about > new features and prerformance test:You've made great progress! I've started to look through the pull request and made some comments via github.> This is a patch of Lucene 3.6.2 backend, it is just support Lucene3.6.2, > and not fully tested, I send this patch for wandering if it works for the > idea http://trac.xapian.org/wiki/ProjectIdeas#BackendforLuceneformatindexes. > until now, fewer features supported, includes: > 1. Single term search. > 2. 'AND' search supported, but performance needed to be optimize. > 3. Multiple segments. > 4. Doc length. Using .nrm instead. > > Additonally: > 1. xxx_lower_bound, xxx_upper_bound, total doc length are not supported. > These datas are not exsits in Lucene backend, I'v used constant to instead, > so the search results may be not good.You should simply not define these methods for your backend - Xapian has fall-back versions (used for inmemory) which will then be used. If you return some constant which isn't actually a valid bound, the matcher will make invalid assumptions while optimising, resulting in incorrect search results.> 2. Compound file is not suppoted. so Compound file must be disable where > doing index. > > I've built a performance test of 1,000,000 documents(actually, I've > download a single file from wiki, which include 1,000,000 lines, I'v treat > one line as a document) from wiki. When doing single term seach, > performance of Lucene backend is as fast as xapian Chert. > Test environment, OS: Vitual machine Ubuntu, CPU: 1 core, MEM: 800M. > 242 terms, doing single term seach per term, cacultes the total time used > for these 242 searches(results are fluctuant, so I give 10 results per > backend): > 1. backend Lucene > 1540ms, 1587ms, 1516ms, 1706ms, 1690ms, 1597ms, 1376ms, 1570ms, 1218ms, > 1551ms > 2. backend Chert > 1286ms, 1626ms, 1575ms, 1771ms, 1661ms, 1662ms, 1808ms, 1341ms, 1688ms, > 1809msSo this benchmark is pretty much meaningless because of the incorrect constant bounds in use. Cheers, Olly