On Fri, Feb 26, 2021 at 11:54:44AM +0000, Eric Wong
wrote:> I want to be able to run queries like:
>
> something in a giant databases AND kw:seen
>
> something in a giant databases AND NOT kw:seen
>
> Where "kw:" would only be stored in the small DB and keyed
> against docs in the giant DB by a SHA-<1|256>, or
> $PATHNAME:$DOCID, even.
>
> It's entirely possible I missed something, but there's currently
> no way in Xapian to combine DBs in a way similar to RDBMS JOINs,
> is there?
There's not a "JOIN" feature as such. It's been suggested
before, but
we don't have a clear plan for how it would work (either from the
outside or internally.)
And it really needs to be efficient (at least for typical use cases) or
it's just an annoying trap for developers - they'd build a system with
"JOIN" as part of the design only to find performance is
disappointing.
I'm not clear exactly you want to join on (SHA of what or pathname of
what?), but if you can efficiently map the Xapian::docid in the main
database to a Xapian::docid or unique ID in the auxiliary database then
a custom PostingSource subclass would work here (and the auxiliary
database wouldn't even need to be a Xapian database, or indeed a
database at all.) The `kw:seen` subquery would be a PostingSource.
> I'm using Perl Search::Xapian from Debian stable (buster).
Unfortunately that doesn't wrap PostingSource.
But note that since 1.2.7 the case of adding and/or removing boolean
terms from an existing document is optimised - only the posting lists
of the added/removed boolean terms and the termlists of the affected
documents are rewritten, not the posting lists for all terms in the
documents. This kicks in for code like:
Xapian::Document doc = db.get_document(docid);
doc.add_boolean_term("Kseen");
db.replace_document(did, doc);
It wouldn't be as fast to update as an entirely separate database, but
if you've rejected this approach based on trying it with a much older
Xapian version then it may be worth reconsidering.
Cheers,
Olly