[Replying to xapian-devel, as I think a wider audience would be useful] On Mon, Oct 21, 2013 at 11:24:51PM +0800, jiangwen jiang wrote:> yes, it's less efficient. Lucene database has multiple segments, each > segment can treat as a independent database. The same term may exists in >> 1 segments.Sorry for taking a while to respond - I've been both busy and mulling this over. I think that perhaps the best way to map this into Xapian is for each Lucene "segment" to be handled as a database in Xapian, and use the multi-database support to search them together. That's likely to need some adjustments to the multi-database support, but I think otherwise we'll end up duplicating a lot of that machinery in the Lucene backend anyway. I've not looked at the Lucene file structure with this in mind yet though - do you see any obvious problems with this approach?> Xapian::TermIterator it = db_in.allterms_begin(); > This method traverse all terms in the first segment, then the second > segment, until the last segment.Iteration over all terms should return the terms in sorted order (by byte value) and without duplicates, neither of which is achieved by handling each segment in turn like this. But we already handle merging allterms lists for multiple databases. Cheers, Olly
*I think that perhaps the best way to map this into Xapian is for each Lucene "segment" to be handled as a database in Xapian, and use the multi-database support to search them together* Yes, there's two choices at the beginning: 1. Using multi-database. 2. Treat lucene database as a single database. Finally, I choose 2. It's a long time ago, I am not quiet sure why this decision is made, maybe: 1. We can handle multiple lucene databases. 2. I am not sure if multi-database can meet the requirements, such as: Getting a doc_freq(how many documents contains the term) of a particular term, actuallly, I want get sum of doc_freq of a particular term in all lucene segments, I am not sure xapian multi-database do it this way. Do you think multi-database is a better way to handle lucene database? *But we already handle merging allterms lists for multiple databases.* If term lists are merged, I think it is the most appropriate way to solve this issue. 2013/10/31 Olly Betts <olly at survex.com>> [Replying to xapian-devel, as I think a wider audience would be useful] > > On Mon, Oct 21, 2013 at 11:24:51PM +0800, jiangwen jiang wrote: > > yes, it's less efficient. Lucene database has multiple segments, each > > segment can treat as a independent database. The same term may exists in > >> > 1 segments. > > Sorry for taking a while to respond - I've been both busy and mulling > this over. > > I think that perhaps the best way to map this into Xapian is for each > Lucene "segment" to be handled as a database in Xapian, and use the > multi-database support to search them together. > > That's likely to need some adjustments to the multi-database support, > but I think otherwise we'll end up duplicating a lot of that machinery > in the Lucene backend anyway. > > I've not looked at the Lucene file structure with this in mind yet > though - do you see any obvious problems with this approach? > > > Xapian::TermIterator it = db_in.allterms_begin(); > > This method traverse all terms in the first segment, then the second > > segment, until the last segment. > > Iteration over all terms should return the terms in sorted order (by > byte value) and without duplicates, neither of which is achieved by > handling each segment in turn like this. But we already handle merging > allterms lists for multiple databases. > > Cheers, > Olly >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20131031/5fa3eeee/attachment-0002.html>
On Thu, Oct 31, 2013 at 10:24:26AM +0800, jiangwen jiang wrote:> Yes, there's two choices at the beginning: > 1. Using multi-database. > 2. Treat lucene database as a single database. > Finally, I choose 2. It's a long time ago, I am not quiet sure why this > decision is made, maybe: > 1. We can handle multiple lucene databases.That should be possible with multi-databases - you'd just end up with a subdatabase for each segment in all of the lucene databases. If we make Xapian::Database's constructor create a Database object with one sub database per segment in the Lucene database, this sort of thing should just work: Xapian::Database db; db.add_database(Xapian::Database("/path/to/lucene1")); db.add_database(Xapian::Database("/path/to/lucene2")); db.add_database(Xapian::Database("/path/to/xapian1")); db.add_database(Xapian::Database("/path/to/xapian2"));> 2. I am not sure if multi-database can meet the requirements, such as: > Getting a doc_freq(how many documents contains the term) of a particular > term, actuallly, I want > get sum of doc_freq of a particular term in all lucene segments, I am > not sure xapian multi-database do it this way.A multi-database does sum the "doc_freq" over all subdatabases. In general, multi-databases act just like a single database with the same contents. There's one exception - when generating an ESet, you can ask it to approximate statistics by extrapolating from the first sub database rather than summing over all of them, but you can also tell it to calculate the exact statistics instead. This just offers a trade-off between speed and exactness.> Do you think multi-database is a better way to handle lucene database?I think so - it seems a natural fit. Sorry for not thinking of this earlier. Cheers, Olly
Reasonably Related Threads
- Backend for Lucene format indexes-How to get doclength
- Backend for Lucene format indexes-How to get doclength
- Backend for Lucene format indexes-How to get doclength
- Backend for Lucene format indexes-How to get doclength
- Backend for Lucene format indexes-How to get doclength