Josef Novak
2008-Dec-01 11:11 UTC
[Xapian-discuss] database stubs: practical limitations, rules of thumb?
Hi, Is there any standing recommendation on the use of database stubs with xapian? Is there a rule of thumb in terms of size+number_of_dbs limit for a stub? Aside from disk I/O, how does having the individual dbs located on a remote machine factor into stub usage? I've been searching the lists a bit, looking for posts on the usage of stubs, but I only found one highly-relevant-looking thread, http://lists.tartarus.org/pipermail/xapian-discuss/2006-August/002533.html and the doc overview, http://xapian.org/docs/overview.html and it seems, if the rather old thread is still relevant, that there is a fairly low limit to the number of dbs one can corral into a single stub, without incurring a fairly stiff performance hit. In my current scenario, I have several 1000 different dbs, each one associated with a specific geographic location, and I'm trying to come up with an optimal way of spreading load over multiple dbs, and multiple machines. At present I direct queries at the appropriate location-based db, whenever I can confirm the location unequivocally. For queries which I know less about, or nothing about, rather than creating stubs, I've opted to create a hierarchy of larger, location-based dbs, following a community<city<county<state<toplevel style format, where each city level db incorporates all community data, and each county incorporates all city data, etc. This appears to be considerably faster, and given the thread above, would appear to be the preferred way to proceed. However this means that my larger dbs are each 'all in one place', and are effectively less robust. My intuition is that it would make the most sense to shard each larger city, county, etc. db, based on overall size (and perhaps access statistics), and distribute the shards over a group of different machines, but I wonder if there is a rule of thumb in terms of shard size, and number of shards per stub. If not I guess I'll just have to experiment! Cheers
Olly Betts
2008-Dec-02 06:40 UTC
[Xapian-discuss] database stubs: practical limitations, rules of thumb?
On Mon, Dec 01, 2008 at 08:11:12PM +0900, Josef Novak wrote:> Is there any standing recommendation on the use of database stubs > with xapian? Is there a rule of thumb in terms of size+number_of_dbs > limit for a stub? Aside from disk I/O, how does having the individual > dbs located on a remote machine factor into stub usage? > I've been searching the lists a bit, looking for posts on the usage > of stubs, but I only found one highly-relevant-looking thread, > http://lists.tartarus.org/pipermail/xapian-discuss/2006-August/002533.htmlWell, what's there isn't specific to stubs, but a generic point about searching over a large number of databases. I'm not aware of anyone who has benchmarked opening a large number of local or remote databases. If you want to try, I'd certainly be interested to hear. I just did a very quick time test - a loop which just opens and closes the same database 5000 times takes about 0.85 seconds with flint (and 0.7 seconds with chert). And that should be a lower bound on how long a search over that many different databases would take. You really want searches to take under a second or they'll "feel slow", so if you try to search over 5000 databases together you'll probably have frustrated users. There's probably scope for reducing this overhead by profiling to find ways to speed up opening a database, but I suspect it's still going to be a bad idea to try to search thousands of databases together.> and it seems, if the rather old thread is still relevant, that there > is a fairly low limit to the number of dbs one can corral into a > single stub, without incurring a fairly stiff performance hit.I think you're reading a meaning I didn't intend then. I'm really just saying there's it is pointless benchmarking a few thousand databases versus one big one as the big one is clearly going to be significantly faster.> This appears to be considerably faster, and given the thread above, > would appear to be the preferred way to proceed. However this means > that my larger dbs are each 'all in one place', and are effectively > less robust. My intuition is that it would make the most sense to > shard each larger city, county, etc. db, based on overall size (and > perhaps access statistics), and distribute the shards over a group of > different machines, but I wonder if there is a rule of thumb in terms > of shard size, and number of shards per stub. If not I guess I'll > just have to experiment!I don't know of any previous experiments in this area I'm afraid. Do let us know how you get on... Cheers, Olly