Olly Betts <olly at survex.com> wrote:> On Fri, Feb 26, 2021 at 11:54:44AM +0000, Eric Wong wrote:
> > I want to be able to run queries like:
> >
> > something in a giant databases AND kw:seen
> >
> > something in a giant databases AND NOT kw:seen
> >
> > Where "kw:" would only be stored in the small DB and keyed
> > against docs in the giant DB by a SHA-<1|256>, or
> > $PATHNAME:$DOCID, even.
> >
> > It's entirely possible I missed something, but there's
currently
> > no way in Xapian to combine DBs in a way similar to RDBMS JOINs,
> > is there?
>
> There's not a "JOIN" feature as such. It's been
suggested before, but
> we don't have a clear plan for how it would work (either from the
> outside or internally.)
>
> And it really needs to be efficient (at least for typical use cases) or
> it's just an annoying trap for developers - they'd build a system
with
> "JOIN" as part of the design only to find performance is
disappointing.
Totally agreed :>
> I'm not clear exactly you want to join on (SHA of what or pathname of
> what?), but if you can efficiently map the Xapian::docid in the main
> database to a Xapian::docid or unique ID in the auxiliary database then
> a custom PostingSource subclass would work here (and the auxiliary
> database wouldn't even need to be a Xapian database, or indeed a
> database at all.) The `kw:seen` subquery would be a PostingSource.
Erm, the SHA would be a boolean term or stored in docdata of the
underlying document (a git blob). $PATHNAME would be the
normalized pathname of the Xapian DB (or some stable integer
mapping).
The actual set of giant, read-only Xapian DBs would be volatile
and subject to constant change depending on which DBs a user is
interested in. The docids within each of these giant, read-only
Xapian DBs is stable, however.
> > I'm using Perl Search::Xapian from Debian stable (buster).
>
> Unfortunately that doesn't wrap PostingSource.
Oh well. Writing a small Python daemon using the SWIG bindings
could be an option(*)...
> But note that since 1.2.7 the case of adding and/or removing boolean
> terms from an existing document is optimised - only the posting lists
> of the added/removed boolean terms and the termlists of the affected
> documents are rewritten, not the posting lists for all terms in the
> documents. This kicks in for code like:
>
> Xapian::Document doc = db.get_document(docid);
> doc.add_boolean_term("Kseen");
> db.replace_document(did, doc);
>
> It wouldn't be as fast to update as an entirely separate database, but
> if you've rejected this approach based on trying it with a much older
> Xapian version then it may be worth reconsidering.
Thanks for the info. I'm doing exactly this for one of the
somewhat large DBs, but can't for the majority of them.
The majority of the large DBs are shared by several users and
read-only to each Unix user. Each user will need to write small
bits of private metadata (seen, replied, flagged, etc...) into a
per-user DB.
(*) I don't expect users of my software to use non-distro packages,
nor do I want non-source distributions of my work (copyleft
is easier to enforce when it's all scripts :).
So everything is Perl or some Inline::C (libinline-c-perl), so far;
but forking a Python (or Ruby/Tcl) script to use SWIG bindings
could be acceptable.