thr3ads.net - Xapian discuss - compact checkpoints while doing days-long indexing jobs? [Aug 2025]

If this information is useful, please help other people find it:
Share via:

Eric Wong

2025-Aug-27 05:56 UTC

compact checkpoints while doing days-long indexing jobs?

Olly Betts <olly at survex.com> wrote:> On Tue, Aug 26, 2025 at 02:31:15AM +0000, Eric Wong wrote:
> > Olly Betts <olly at survex.com> wrote:
> > > Do you process in ascending NNTP_ARTICLE_NUMBER order?
> > > 
> > > If so you should get sequential update mode for tables like the
> > > "data" one.  If not you're probably triggering
random access mode
> > > for all tables.
> > 
> > Yes, it's ascending, but I use ->replace_document to keep docid
> > matching NNTP_ARTICLE_NUMBER so each shard sees docids
> > incrementing by SHARD_COUNT instead of 1.  Would doing
> > replace_document with ascending docid gaps allow glass to work
> > in sequential mode?
> 
> Yes - what matters is that each item added to the table goes immediately
> after the previously added one, which is still true if there are unused
> docids between them.
> 
> Are you saying the docids in the shards match NNTP_ARTICLE_NUMBER
> and so one has 1, 4, 7, ...; another 2, 5, 8, ...; the third 3, 6, 9,
> ...?
Yes, exactly.  So good to know.

One caveat is one of the indexers will avoid creating new Xapian
docs for cross-posted messages but add new List-IDs to existing
docs.

For example, if one message gets cross-posted to multiple
mailing lists and we process each mailing list sequentially, the
initial message would be indexed with List-Id:<a.example.com>.

However, somewhere down the line we're processing
List-Id:<b.example.com>, we'll add the new List-Id value to the
original message we saw (possibly millions of messages ago) so
non-sequential performance does end up being important, too.

IOW, if a message is cross-posted to a dozen lists, we end up
doing replace_document on the same docid a dozen times (ick!)

Thus for the initial index, I'm working on altering the strategy to
perform all deduplication first, and then index the message in
Xapian with a single replace_document with all known List-Ids only
after all deduplication is done.
> I'd have gone for making the docids in the combined database match
> NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard
> (except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller
> docid values and smaller gaps between them will encode a little more
> efficiently.
Understood; but would it be possible to continue to do parallel
indexing as NNTP article numbers are being allocated sequentially?
Since NNTP article numbers are allocated sequentially, they
round-robin across the shards to allow parallelism during indexing
(I rely on Perl to extract terms and such, so there's a CPU-limited
component)

Changing the way docids are allocated now could be very
disruptive to users with existing DBs and might be a
maintainability/support nightmare.
> (Also means you could grow SHARD_COUNT times larger before you run out of
> docid space, though with 3 shards and your numbering you can still have
> ~1.4 billion documents so it sounds like you're nowhere near that being
> an issue.)
Yes :)
> You could also then use a sharded WritableDatabase object and let Xapian
> do the sharding for you:
> 
>     $db->replace_document($nntp_article_number, $doc);
> 
> The docids in the sharded database would also then match NNTP article
> numbers at search time.
Yeah, I actually /just/ noticed WritableDatabase supported
shards while rechecking the docs this week.  I see it was added
in the 1.3.x days but I started with 1.2.x and supported 1.2
for ages due to LTS distros.

And I suppose using the combined WritableDatabase feature would
require using a single process for indexing and lose parallelism.

The main reason I've been using shards is to parallelize the
CPU-intensive Perl portions, and possibly parallelize I/O on
SSDs where contention is less likely a problem.

So, a side question: even on ext4 and ignoring cross-posted
messages; I notice Xapian shard commits taking more time as the
shards get bigger.  Trying to commit shards one-at-a-time doesn't
seem to help, so it doesn't seem bound by I/O contention with 3
shards (I capped the default shard count at 3 back in 2018 due to
I/O contention).

Thus I'm considering allowing the option to split the shards
into epochs during the indexing phase, leaving the original set
(0, 1, 2) would be untouched above a certain interval (say >100K)
until the end of indexing.

During indexing, there'd be (0.1, 1.1, 2.1) set of shards for
100K..199999, a (0.2, 1.2, 2.2) set for 200K..299999, and so forth.
To finalize indexing, `xapian-compact --no-renumber' would
combine all the 0.* into 0, 1.* into 1, and 2.* into 2 to
maintain compatibility with existing readers.

One downside of this approach would needing much more temporary
space so it can't be the default, but I'm hoping the extra work
required by compact would offset the high commit times for giant
shards when adding a lot of messages to the index.

Small incremental indexing jobs would continue to write directly
to (0, 1, 2); only large jobs would use the epochs.

Does that sound reasonable?  Thanks.

Olly Betts

2025-Aug-28 02:30 UTC

head link

compact checkpoints while doing days-long indexing jobs?

On Wed, Aug 27, 2025 at 05:56:49AM +0000, Eric Wong
wrote:> One caveat is one of the indexers will avoid creating new Xapian
> docs for cross-posted messages but add new List-IDs to existing
> docs.
> 
> For example, if one message gets cross-posted to multiple
> mailing lists and we process each mailing list sequentially, the
> initial message would be indexed with List-Id:<a.example.com>.
> 
> However, somewhere down the line we're processing
> List-Id:<b.example.com>, we'll add the new List-Id value to the
> original message we saw (possibly millions of messages ago) so
> non-sequential performance does end up being important, too.
> 
> IOW, if a message is cross-posted to a dozen lists, we end up
> doing replace_document on the same docid a dozen times (ick!)
If you do replace_document() with a Document object you got from
get_document() and use its existing docid, then the update is optimised
provided you've not modified the database in between.

I'm not clear how the "List-Id" is stored, but e.g. if it's a
boolean
term then only that term's posting list is actually updated.
> > I'd have gone for making the docids in the combined database match
> > NNTP_ARTICLE_NUMBER, which would mean they're sequential in each
shard
> > (except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller
> > docid values and smaller gaps between them will encode a little more
> > efficiently.
> 
> Understood; but would it be possible to continue to do parallel
> indexing as NNTP article numbers are being allocated sequentially?
> Since NNTP article numbers are allocated sequentially, they
> round-robin across the shards to allow parallelism during indexing
> (I rely on Perl to extract terms and such, so there's a CPU-limited
> component)
If the existing approach works, the new one should - it's really just
the same except the docids in the shards are changed by this mapping:

    new_docid = (old_docid + 2) / 3    (using integer division)
> Changing the way docids are allocated now could be very
> disruptive to users with existing DBs and might be a
> maintainability/support nightmare.
Yes.  You could perhaps store a flag in a user metadata entry
in the DB and used that to select the mapping functions to use.

Not sure what the overhead reduction would actually amount to - cases
where a gap between consecutive entries in a posting list for a term is
between 43 and 128 documents which will reduce from 2 bytes to 1 byte
each time.  Probably about 2/3 of keys containing docids would reduce in
size by a byte too.  I'd guess it's probably noticeable but not
dramatic.
> Yeah, I actually /just/ noticed WritableDatabase supported
> shards while rechecking the docs this week.  I see it was added
> in the 1.3.x days but I started with 1.2.x and supported 1.2
> for ages due to LTS distros.
> 
> And I suppose using the combined WritableDatabase feature would
> require using a single process for indexing and lose parallelism.
Yes, so that's a reason to keep doing the sharding yourself.
> So, a side question: even on ext4 and ignoring cross-posted
> messages; I notice Xapian shard commits taking more time as the
> shards get bigger.  Trying to commit shards one-at-a-time doesn't
> seem to help, so it doesn't seem bound by I/O contention with 3
> shards (I capped the default shard count at 3 back in 2018 due to
> I/O contention).
This is with Xapian 1.4.x and the glass backend?

Before that, a commit required writing O(file size of DB) data because
the freelist for a table was stored in a bitmap with one bit per block
in the table.  This was not problematic for smaller databases, but
because we need to ensure this data is actually synced to disc and
we can only write it out just before we sync it, it gradually caused
more I/O contention as the DB grew in size.

Glass instead stores the freelist in blocks which are on the freelist.

The table data still needs to be synced, though that gets written out
over a period of time so has more chance to get written to disk before
we sync it.  I'd guess that's what you're seeing.
> Thus I'm considering allowing the option to split the shards
> into epochs during the indexing phase, leaving the original set
> (0, 1, 2) would be untouched above a certain interval (say >100K)
> until the end of indexing.
> 
> During indexing, there'd be (0.1, 1.1, 2.1) set of shards for
> 100K..199999, a (0.2, 1.2, 2.2) set for 200K..299999, and so forth.
> To finalize indexing, `xapian-compact --no-renumber' would
> combine all the 0.* into 0, 1.* into 1, and 2.* into 2 to
> maintain compatibility with existing readers.
> 
> One downside of this approach would needing much more temporary
> space so it can't be the default, but I'm hoping the extra work
> required by compact would offset the high commit times for giant
> shards when adding a lot of messages to the index.
> 
> Small incremental indexing jobs would continue to write directly
> to (0, 1, 2); only large jobs would use the epochs.
> 
> Does that sound reasonable?  Thanks.
Yes, that's roughly what I'd do if I wanted to maximise the indexing
rate for an initial build of the DB.

You could try picking the size of each x.y to be indexable as a
single commit so all the merging happens via xapian-compact.

Cheers,
    Olly

Xapian discuss - Aug 2025 - compact checkpoints while doing days-long indexing jobs?

compact checkpoints while doing days-long indexing jobs?

compact checkpoints while doing days-long indexing jobs?