thr3ads.net - Xapian discuss - [Xapian-discuss] database stubs: practical limitations, rules of thumb? [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Josef Novak

2008-Dec-01 11:11 UTC

[Xapian-discuss] database stubs: practical limitations, rules of thumb?

Hi,
  Is there any standing recommendation on the use of database stubs
with xapian?  Is there a rule of thumb in terms of size+number_of_dbs
limit for a stub?  Aside from disk I/O, how does having the individual
dbs located on a remote machine factor into stub usage?
  I've been searching the lists a bit, looking for posts on the usage
of stubs, but I only found one highly-relevant-looking thread,
http://lists.tartarus.org/pipermail/xapian-discuss/2006-August/002533.html
  and the doc overview,
http://xapian.org/docs/overview.html

  and it seems, if the rather old thread is still relevant, that there
is a fairly low limit to the number of dbs one can corral into a
single stub, without incurring a fairly stiff performance hit.
  In my current scenario, I have several 1000 different dbs, each one
associated with a specific geographic location, and I'm trying to come
up with an optimal way of spreading load over multiple dbs, and
multiple machines.  At present I direct queries at the appropriate
location-based db, whenever I can confirm the location unequivocally.
For queries which I know less about, or nothing about, rather than
creating stubs, I've opted to create a hierarchy of larger,
location-based dbs, following a
community<city<county<state<toplevel
   style format, where each city level db incorporates all community
data, and each county incorporates all city data, etc.
   This appears to be considerably faster, and given the thread above,
would appear to be the preferred way to proceed.  However this means
that my larger dbs are each 'all in one place', and are effectively
less robust.  My intuition is that it would make the most sense to
shard each larger city, county, etc. db, based on overall size (and
perhaps access statistics), and distribute the shards over a group of
different machines, but I wonder if there is a rule of thumb in terms
of shard size, and number of shards per stub.  If not I guess I'll
just have to experiment!
   Cheers

Olly Betts

2008-Dec-02 06:40 UTC

head link

[Xapian-discuss] database stubs: practical limitations, rules of thumb?

On Mon, Dec 01, 2008 at 08:11:12PM +0900, Josef Novak
wrote:>   Is there any standing recommendation on the use of database stubs
> with xapian?  Is there a rule of thumb in terms of size+number_of_dbs
> limit for a stub?  Aside from disk I/O, how does having the individual
> dbs located on a remote machine factor into stub usage?
>   I've been searching the lists a bit, looking for posts on the usage
> of stubs, but I only found one highly-relevant-looking thread,
> http://lists.tartarus.org/pipermail/xapian-discuss/2006-August/002533.html
Well, what's there isn't specific to stubs, but a generic point about
searching over a large number of databases.

I'm not aware of anyone who has benchmarked opening a large number of
local or remote databases.  If you want to try, I'd certainly be
interested to hear.

I just did a very quick time test - a loop which just opens and closes
the same database 5000 times takes about 0.85 seconds with flint (and
0.7 seconds with chert).  And that should be a lower bound on how long a
search over that many different databases would take.

You really want searches to take under a second or they'll "feel
slow",
so if you try to search over 5000 databases together you'll probably
have frustrated users.

There's probably scope for reducing this overhead by profiling to find
ways to speed up opening a database, but I suspect it's still going to
be a bad idea to try to search thousands of databases together.
>   and it seems, if the rather old thread is still relevant, that there
> is a fairly low limit to the number of dbs one can corral into a
> single stub, without incurring a fairly stiff performance hit.
I think you're reading a meaning I didn't intend then.  I'm really
just
saying there's it is pointless benchmarking a few thousand databases
versus one big one as the big one is clearly going to be significantly
faster.
>    This appears to be considerably faster, and given the thread above,
> would appear to be the preferred way to proceed.  However this means
> that my larger dbs are each 'all in one place', and are effectively
> less robust.  My intuition is that it would make the most sense to
> shard each larger city, county, etc. db, based on overall size (and
> perhaps access statistics), and distribute the shards over a group of
> different machines, but I wonder if there is a rule of thumb in terms
> of shard size, and number of shards per stub.  If not I guess I'll
> just have to experiment!
I don't know of any previous experiments in this area I'm afraid.  Do
let us know how you get on...

Cheers,
    Olly

Xapian discuss - Dec 2008 - database stubs: practical limitations, rules of thumb?

[Xapian-discuss] database stubs: practical limitations, rules of thumb?

[Xapian-discuss] database stubs: practical limitations, rules of thumb?