thr3ads.net - Xapian discuss - compact checkpoints while doing days-long indexing jobs? [Aug 2025]

If this information is useful, please help other people find it:
Share via:

Eric Wong

2025-Aug-26 02:31 UTC

compact checkpoints while doing days-long indexing jobs?

Olly Betts <olly at survex.com> wrote:> On Thu, Aug 21, 2025 at 01:05:10AM +0000, Eric Wong wrote:
> > Hello, I'm trying to get Xapian to work better on btrfs which is
> > prone to fragmentation regardless on whether or not btrfs CoW
> > is enabled.
> > 
> > Thus, I'm wondering if running xapian-compact occasionally
> > during a multi-day indexing can improve indexing performance.
> 
> tldr: I'd expect it to harm performance.
OK, thanks.
> Glass database tables have two update modes - sequential and random
> access.  Each table automatically switches to/from sequential based on
> its update pattern.  E.g. If you only add_document() then some tables
> should mostly operate in sequential mode which is why you'll often see
> significant differences in how compact the different tables are.
Ah, interesting to know (more below).
> > I'll be using the BTRFS_IOC_DEFRAG ioctl to periodically defrag
> > glass files after some (probably not all) transaction commits.
> 
> Note the format is block-based and as long as individual blocks (which
> are 8KB by default) aren't fragmented I would not expect fragmentation
> at the filesystem level to make much difference to search performance.
> We sometimes need to step to the next leaf block in tree order, but
> defragmenting at the file system level won't help that unless the leaf
> block order in the tree and in the file match up, which will generally
> only be the case right after compaction.
OK.  The performance problems could also be related to SQLite
fragmentation, I'm not sure, yet.   Disabling btrfs CoW seems to
improve Xapian indexing significantly as shards get larger.
> > I've noticed that even on small, fresh imports (with few/minimal
> > deletes) compact can reduce file sizes by 20-60%, so I'm
> > wondering if compact before btrfs defrag is helpful even if I
> > intend to add more docs right after the compact+defrag.
> 
> I'd be wary of compaction if you're about to index more unless you
> can benchmark and show it actually helps (in which case I'd be very
> curious how it is helping).
I'm mainly interesting in saving space since new messages
typically arrive at much slower rate after an initial import.
> > I'm dealing with over 20 million docs across 3 (adjustable)
> > shards in parallel (Perl is probably a bottleneck, too :x).
> > Document numbers are assigned to shards based on
> > $NNTP_ARTICLE_NUMBER % $SHARD_COUNT so I rely on --no-renumber.
> 
> Do you process in ascending NNTP_ARTICLE_NUMBER order?
> 
> If so you should get sequential update mode for tables like the
> "data" one.  If not you're probably triggering random access
mode
> for all tables.
Yes, it's ascending, but I use ->replace_document to keep docid
matching NNTP_ARTICLE_NUMBER so each shard sees docids
incrementing by SHARD_COUNT instead of 1.  Would doing
replace_document with ascending docid gaps allow glass to work
in sequential mode?

I am noticing commits to shards slow down as the shards get
bigger (even on ext4 instead of btrfs), so I'm wondering if I
could get a benefit by forcing sequential writes via
add_document.

If needed, I could probably use add_document to put placeholders
in each shard and only delete the placeholders before compact.

Thanks.

Olly Betts

2025-Aug-27 03:41 UTC

head link

compact checkpoints while doing days-long indexing jobs?

On Tue, Aug 26, 2025 at 02:31:15AM +0000, Eric Wong
wrote:> Olly Betts <olly at survex.com> wrote:
> > Do you process in ascending NNTP_ARTICLE_NUMBER order?
> > 
> > If so you should get sequential update mode for tables like the
> > "data" one.  If not you're probably triggering random
access mode
> > for all tables.
> 
> Yes, it's ascending, but I use ->replace_document to keep docid
> matching NNTP_ARTICLE_NUMBER so each shard sees docids
> incrementing by SHARD_COUNT instead of 1.  Would doing
> replace_document with ascending docid gaps allow glass to work
> in sequential mode?
Yes - what matters is that each item added to the table goes immediately
after the previously added one, which is still true if there are unused
docids between them.

Are you saying the docids in the shards match NNTP_ARTICLE_NUMBER
and so one has 1, 4, 7, ...; another 2, 5, 8, ...; the third 3, 6, 9,
...?

I'd have gone for making the docids in the combined database match
NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard
(except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller
docid values and smaller gaps between them will encode a little more
efficiently.

(Also means you could grow SHARD_COUNT times larger before you run out of
docid space, though with 3 shards and your numbering you can still have
~1.4 billion documents so it sounds like you're nowhere near that being
an issue.)

You could also then use a sharded WritableDatabase object and let Xapian
do the sharding for you:

    $db->replace_document($nntp_article_number, $doc);

The docids in the sharded database would also then match NNTP article
numbers at search time.

Cheers,
    Olly

Xapian discuss - Aug 2025 - compact checkpoints while doing days-long indexing jobs?

compact checkpoints while doing days-long indexing jobs?

compact checkpoints while doing days-long indexing jobs?