Philip Neustrom
2006-Mar-26 03:02 UTC
[Xapian-discuss] Spreading a database across multiple machines
Hey all, I'm working on a project that contains lots of little sub-sites which I want to act autonomously. Right now, each site has its own Xapian database and when a search is performed this site-specific database is queried. I want to be able to have a 'global' search across all of these said databases. However, I want the individual site searches to behave, when searched individually, as if they are the only database. It seems like the logical thing to do would be to create a Database object and then add_database() for each database. However, I'm looking at a situation in which there could be thousands of independent databases, and doing add_database() for each possible site seems like it could be inefficient in this case. Is there a way to maintain a single database that can be queried on a site-specific basis and act like it's a site-specific -- e.g. the probability/results are weighted according to some site-specific tag? (And then if I wanted to divide the master database up due to space concerns I would do so using add_database(), but it would logically be one large database). --Philip Neustrom
Olly Betts
2006-Mar-30 13:15 UTC
[Xapian-discuss] Spreading a database across multiple machines
On Sat, Mar 25, 2006 at 06:01:59PM -0800, Philip Neustrom wrote:> It seems like the logical thing to do would be to create a Database > object and then add_database() for each database. However, I'm > looking at a situation in which there could be thousands of > independent databases, and doing add_database() for each possible site > seems like it could be inefficient in this case.You'll eventually hit the per-process file descriptor limit too.> Is there a way to maintain a single database that can be queried on a > site-specific basis and act like it's a site-specific -- e.g. the > probability/results are weighted according to some site-specific tag?No. The problem is that you can't calculate those statistics efficiently from the information stored. Precalculating them as content is added might be possible, but is a big change. Are the statistics from a combined database different enough to matter? If so, I'd suggest building a merged database for the global search, but keep the individual databases if you want the stats to be exact for each subcollection. If you're using flint, then xapian-compact has a "--multipass" option which will cope with merging 1000s of databases. I suspect quartzcompact won't cope, but you can always merge in several goes by hand. Cheers, Olly