Alexandre Dulaunoy
2005-Mar-20 11:43 UTC
[Xapian-discuss] Xapian and quartz scalability - feedback of current users
Hi All, We would like to make some test with Xapian and the quartz backend on a large set of sample test document (around 50 millions for starting) . quartz backend seems very flexible and the document scalability on the web site (http://www.xapian.org/docs/scalability.html) is talking of a possible way to implement a kind of cluster for concurrent search indexing and asynchronous updating. We were wondering if there is any users of quartz usting a clustering approach in the list. What is(are) the classical design ? Based on what the separation of the quartz databases is made ? How is the updating handled to provide continuous services ? How is the cluster organized ? How are you dealing with unresponsive systems part of the cluster ? Is there any other free software components available for job allocation (updating, compacting and alike) inside a quartz/xapian cluster ? Is there any technical comparison between Nutch/Lucene and Xapian/Quartz regarding large scale index ? Thanks a lot for any feedback, adulau
Olly Betts
2005-Mar-22 11:52 UTC
[Xapian-discuss] Xapian and quartz scalability - feedback of current users
On Sun, Mar 20, 2005 at 12:42:42PM +0100, Alexandre Dulaunoy wrote:> We would like to make some test with Xapian and the quartz backend on > a large set of sample test document (around 50 millions for starting)I'm very interested to hear reports of such tests. I've done some myself, but there's a danger that tuning which helps one situation hinders others. You should be aware that I'm in the process of overhauling quartz. My plan is to clone the quartz backend once we've moved to SVN (which should be in the next week or two), then replace parts of it. So quartz won't be destabilised, and the new database format can be fluid initially without annoying people trying to actually use Xapian! I've done much of the design now, though most is on paper or in my head - I need to type it up so others can take a look. A few things re already implemented (e.g. there's a patch for zlib compression) and I've already folded some simple compatible changes into quartz in CVS (so in 0.9.0 databases will be more compact both before and after quartzcompact). But this actually means benchmarking quartz would be very useful at this point. It gives a baseline, and we can then track how things change (hopefully for the better) as development progresses.> quartz backend seems very flexible and the document scalability on > the web site (http://www.xapian.org/docs/scalability.html) is talking > of a possible way to implement a kind of cluster for concurrent search > indexing and asynchronous updating. We were wondering if there is any > users of quartz usting a clustering approach in the list.Webtop used a system sort of like this, but sadly it's the source isn't open. They actually used the muscat36 backend (it was either pre-quartz or quartz was still rather experimental - I don't recall which). But the system would look pretty similar anyway. If you use quartzcompact's new merge facility (in CVS only currently), then you can build many databases of (say) a few million documents in parallel without much need for synchronisation - just partition the job and wait for them all to finish. Then you merge the built databases together - either all at once, or N at a time in parallel until you have just one. I've not experimented with quartzcompact merging benchmarking yet - I merged about 43 databases with just under 500,000 documents each for gmane in one pass, and it coped pretty well.> What is(are) the classical design ?N-way merging to produce the inverted file is textbook stuff. It's really just the old "sorting and merging from external store" approach, which is never really obsoleted by faster computers with larger memory - the dataset size where you start to need it just rises too. As a general point, you want to try to design such that the indexing processes don't need to communicate (ideally at all, though one way async communication is pretty harmless - e.g. a web crawling process generating URLs from links and spooling them to a file).> Based on what the separation of the quartz databases is made ?If you're searching over several unmerged databases, try to make them all a representative sample of the whole corpus as Xapian by default approximates term frequencies by looking at those in one database (the first I think, but check to be sure!) This is for efficiency.> How is the updating handled to provide continuous services ?You can search a database which is being updated, but if updates are being flushed at a frequency such that a search may span more than one flush, searches may be forced to restart (something the quartz overhaul should fix). If you aren't so concerned with new content being searchable right away, it's simpler to build the database, run it through quartzcompact and then add it to those searched.> How is the cluster organized ? How are you dealing with > unresponsive systems part of the cluster ?Xapian::ErrorHandler allows some control of this. Webtop used it and seemed reasonably happy with it I think, but there may be a better approach. If a system goes completely unresponsive, you probably don't want to keep waiting for timeouts from it...> Is there any other free > software components available for job allocation (updating, compacting > and alike) inside a quartz/xapian cluster ?I suspect people just roll their own with Python or perl or similar. It would be good to include some sample scripts at least if anyone has some.> Is there any technical > comparison between Nutch/Lucene and Xapian/Quartz regarding large > scale index ?Not that I know of. Divmod switched from Lucene to Xapian, and the only negative comment was that Xapian databases are larger. *If* the working set is also larger (it's not at all obvious if it would be or not), that means we'll scale less well once everything is I/O bound. But the quartz overhaul should reduce the database sizes quite substantially as well as reducing the working set size.> Thanks a lot for any feedback,No problem. Cheers, Olly