I'm planning to add multiple-database support for searches to my "Xappy" python wrapper (more on this wrapper later, but for now, see http://code.google.com/p/xappy for details). This is reasonably straightforward, because Xapian supports this nicely: except that "Xappy" generates a "fieldname->prefix" mapping automatically. The prefix which corresponds to a particular field is therefore hidden from the user, and crucially, it may be different in different databases. My current plan is to add a "databaseID" term to each document, and construct a composite query. For example, the search "author:foo" across databases with ids "db1" and "db2", where the prefix for author in db1 is "A" and the prefix for author in db2 is "B", would become: (Afoo FILTER db1) OR (Bfoo FILTER db2) This should give the right sort of results, but the statistics for the terms will be a bit broken. (Actually, I'm not totally convinced they'll be broken in a harmful way, because if the term is more frequent in one collection than another, this could correspond to it being more significant when it occurs in the collection in which it is less frequent.) At some point it would be nice to add the ability to have a mapping from "human-readable field name" to "prefix code" inside xapian, so the multidatabase stuff could be aware of this issue and generate the prefixes correctly for each database. However, that's not urgent, and not what I'm thinking about right now. It would also be nice to have a "virtual" posting list, which effectively returned a list of all the document IDs in a particular database, so I didn't have to explicitly store the "databaseID" terms. But that's also not what I'm thinking about right now. I was thinking about how such a query could be processed optimally. If the sub-databases were remote databases, the search would be executed against each database separately, and in each database all but one of the bracketed clauses would quickly degrade to an emptypostlist (because the filter term wouldn't be present in the database). The results of these searches would then be combined, and this would be pretty much optimal in terms of processing for the combined query. However, if the sub-databases are local, the search will be performed across all the sub-databases in parallel, and because the document IDs for each database are interleaved, none of the bracketed clauses will be able to degrade, or even do much "skipping" of document IDs based on the filter term. One way to fix this would be to add a flag (or similar mechanism) telling a multiple database to generate composite IDs by sequentially combining the databases; so DB1 might have IDs from 1 to 13498 and DB2 might have IDs from 13499 onwards. This would result in the bracketed clauses skipping all document IDs which don't correspond to the databases defined by the filter term, and should therefore be a lot more efficient. It could also improve the disk-access pattern when doing a simpler multiple database search, because each database will be processed in turn rather than skipping from one database to another. Of course, this scheme relies on the document IDs used by each database being relatively compact, and would result in the document IDs in a multidatabase changing each time the highest document ID in the first database changed, so isn't a perfect scheme by any means. Also, there are probably ways in which using this kind of document ID merging scheme could harm performance, but we could only really tell by doing some tests to investigate the performance difference. Another approach is to allow the remote-database style of multi-database search to be used for local multi-database searches - ie, compute the interesting part of the mset for each database separately, and then merge them together. This can result in a lot more documents being considered than necessary, though (particularly if the part of the mset requested is large, or starts at a high index). Thoughts welcomed - I'm not planning to work on this in the imminent future, but I thought starting a discussion of it, or just mentioning it to let the ideas ferment a while in our minds, would be worthwhile. -- Richard
On Wed, Oct 10, 2007 at 02:09:51AM +0100, Richard Boulton wrote:> I'm planning to add multiple-database support for searches to my "Xappy" > python wrapper (more on this wrapper later, but for now, see > http://code.google.com/p/xappy for details). This is reasonably > straightforward, because Xapian supports this nicely: except that > "Xappy" generates a "fieldname->prefix" mapping automatically. The > prefix which corresponds to a particular field is therefore hidden from > the user, and crucially, it may be different in different databases.I think the simplest solution here would be to just use the user's fieldname as the prefix. So the "shoe_size" field could be mapped to "XSHOE_SIZE". You could add special handling for standard prefixes if you wish. If you want case sensitivity of field names, you could either just eschew the usual Xapian scheme, or provide some sort of encoding for the case.> One way to fix this would be to add a flag (or similar mechanism) > telling a multiple database to generate composite IDs by sequentially > combining the databases; so DB1 might have IDs from 1 to 13498 and DB2 > might have IDs from 13499 onwards. [...] > Of course, this scheme relies on the document IDs used by each database > being relatively compact, and would result in the document IDs in a > multidatabase changing each time the highest document ID in the first > database changed, so isn't a perfect scheme by any means.I think it would be useful to support this in some way anyway. Interleaving isn't a perfect solution either. Really its main benefit is simply that it does provide stable merged document ids even if the constituent databases are updated.> Another approach is to allow the remote-database style of multi-database > search to be used for local multi-database searches - ie, compute the > interesting part of the mset for each database separately, and then > merge them together. This can result in a lot more documents being > considered than necessary, though (particularly if the part of the mset > requested is large, or starts at a high index).If the local results are computed sequentially, you could use the minimum MSet weight from the first match as the initial min-weight for the second match. If you merge each new MSet in as you go, this would allow each match to do progressively less work. Cheers, Olly
On 10/9/07, Richard Boulton <richard at lemurconsulting.com> wrote:> I'm planning to add multiple-database support for searches to my "Xappy" > python wrapper (more on this wrapper later, but for now, see > http://code.google.com/p/xappy for details). This is reasonably > straightforward, because Xapian supports this nicely: except that > "Xappy" generates a "fieldname->prefix" mapping automatically. The > prefix which corresponds to a particular field is therefore hidden from > the user, and crucially, it may be different in different databases. > > My current plan is to add a "databaseID" term to each document, and > construct a composite query. For example, the search "author:foo" > across databases with ids "db1" and "db2", where the prefix for author > in db1 is "A" and the prefix for author in db2 is "B", would become: > > (Afoo FILTER db1) OR (Bfoo FILTER db2) > > This should give the right sort of results, but the statistics for the > terms will be a bit broken. (Actually, I'm not totally convinced > they'll be broken in a harmful way, because if the term is more frequent > in one collection than another, this could correspond to it being more > significant when it occurs in the collection in which it is less > frequent.) At some point it would be nice to add the ability to have a > mapping from "human-readable field name" to "prefix code" inside xapian, > so the multidatabase stuff could be aware of this issue and generate the > prefixes correctly for each database. However, that's not urgent, and > not what I'm thinking about right now. > > It would also be nice to have a "virtual" posting list, which > effectively returned a list of all the document IDs in a particular > database, so I didn't have to explicitly store the "databaseID" terms. > But that's also not what I'm thinking about right now. > > I was thinking about how such a query could be processed optimally. If > the sub-databases were remote databases, the search would be executed > against each database separately, and in each database all but one of > the bracketed clauses would quickly degrade to an emptypostlist (because > the filter term wouldn't be present in the database). The results of > these searches would then be combined, and this would be pretty much > optimal in terms of processing for the combined query. > > However, if the sub-databases are local, the search will be performed > across all the sub-databases in parallel, and because the document IDs > for each database are interleaved, none of the bracketed clauses will be > able to degrade, or even do much "skipping" of document IDs based on the > filter term. > > One way to fix this would be to add a flag (or similar mechanism) > telling a multiple database to generate composite IDs by sequentially > combining the databases; so DB1 might have IDs from 1 to 13498 and DB2 > might have IDs from 13499 onwards. This would result in the bracketed > clauses skipping all document IDs which don't correspond to the > databases defined by the filter term, and should therefore be a lot more > efficient. > > It could also improve the disk-access pattern when doing a simpler > multiple database search, because each database will be processed in > turn rather than skipping from one database to another. > > Of course, this scheme relies on the document IDs used by each database > being relatively compact, and would result in the document IDs in a > multidatabase changing each time the highest document ID in the first > database changed, so isn't a perfect scheme by any means. > > Also, there are probably ways in which using this kind of document ID > merging scheme could harm performance, but we could only really tell by > doing some tests to investigate the performance difference.I will definitely measure the performance differences on my 52 million document engine and will give the true performance numbers because indexing and searching performance is my top priority. Kevin Duraj> > Another approach is to allow the remote-database style of multi-database > search to be used for local multi-database searches - ie, compute the > interesting part of the mset for each database separately, and then > merge them together. This can result in a lot more documents being > considered than necessary, though (particularly if the part of the mset > requested is large, or starts at a high index). > > Thoughts welcomed - I'm not planning to work on this in the imminent > future, but I thought starting a discussion of it, or just mentioning it > to let the ideas ferment a while in our minds, would be worthwhile. > > -- > Richard > > _______________________________________________ > Xapian-devel mailing list > Xapian-devel at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-devel >