Olly Betts <olly at survex.com> wrote:> On Tue, Aug 26, 2025 at 02:31:15AM +0000, Eric Wong wrote:
> > Olly Betts <olly at survex.com> wrote:
> > > Do you process in ascending NNTP_ARTICLE_NUMBER order?
> > >
> > > If so you should get sequential update mode for tables like the
> > > "data" one. If not you're probably triggering
random access mode
> > > for all tables.
> >
> > Yes, it's ascending, but I use ->replace_document to keep docid
> > matching NNTP_ARTICLE_NUMBER so each shard sees docids
> > incrementing by SHARD_COUNT instead of 1. Would doing
> > replace_document with ascending docid gaps allow glass to work
> > in sequential mode?
>
> Yes - what matters is that each item added to the table goes immediately
> after the previously added one, which is still true if there are unused
> docids between them.
>
> Are you saying the docids in the shards match NNTP_ARTICLE_NUMBER
> and so one has 1, 4, 7, ...; another 2, 5, 8, ...; the third 3, 6, 9,
> ...?
Yes, exactly. So good to know.
One caveat is one of the indexers will avoid creating new Xapian
docs for cross-posted messages but add new List-IDs to existing
docs.
For example, if one message gets cross-posted to multiple
mailing lists and we process each mailing list sequentially, the
initial message would be indexed with List-Id:<a.example.com>.
However, somewhere down the line we're processing
List-Id:<b.example.com>, we'll add the new List-Id value to the
original message we saw (possibly millions of messages ago) so
non-sequential performance does end up being important, too.
IOW, if a message is cross-posted to a dozen lists, we end up
doing replace_document on the same docid a dozen times (ick!)
Thus for the initial index, I'm working on altering the strategy to
perform all deduplication first, and then index the message in
Xapian with a single replace_document with all known List-Ids only
after all deduplication is done.
> I'd have gone for making the docids in the combined database match
> NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard
> (except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller
> docid values and smaller gaps between them will encode a little more
> efficiently.
Understood; but would it be possible to continue to do parallel
indexing as NNTP article numbers are being allocated sequentially?
Since NNTP article numbers are allocated sequentially, they
round-robin across the shards to allow parallelism during indexing
(I rely on Perl to extract terms and such, so there's a CPU-limited
component)
Changing the way docids are allocated now could be very
disruptive to users with existing DBs and might be a
maintainability/support nightmare.
> (Also means you could grow SHARD_COUNT times larger before you run out of
> docid space, though with 3 shards and your numbering you can still have
> ~1.4 billion documents so it sounds like you're nowhere near that being
> an issue.)
Yes :)
> You could also then use a sharded WritableDatabase object and let Xapian
> do the sharding for you:
>
> $db->replace_document($nntp_article_number, $doc);
>
> The docids in the sharded database would also then match NNTP article
> numbers at search time.
Yeah, I actually /just/ noticed WritableDatabase supported
shards while rechecking the docs this week. I see it was added
in the 1.3.x days but I started with 1.2.x and supported 1.2
for ages due to LTS distros.
And I suppose using the combined WritableDatabase feature would
require using a single process for indexing and lose parallelism.
The main reason I've been using shards is to parallelize the
CPU-intensive Perl portions, and possibly parallelize I/O on
SSDs where contention is less likely a problem.
So, a side question: even on ext4 and ignoring cross-posted
messages; I notice Xapian shard commits taking more time as the
shards get bigger. Trying to commit shards one-at-a-time doesn't
seem to help, so it doesn't seem bound by I/O contention with 3
shards (I capped the default shard count at 3 back in 2018 due to
I/O contention).
Thus I'm considering allowing the option to split the shards
into epochs during the indexing phase, leaving the original set
(0, 1, 2) would be untouched above a certain interval (say >100K)
until the end of indexing.
During indexing, there'd be (0.1, 1.1, 2.1) set of shards for
100K..199999, a (0.2, 1.2, 2.2) set for 200K..299999, and so forth.
To finalize indexing, `xapian-compact --no-renumber' would
combine all the 0.* into 0, 1.* into 1, and 2.* into 2 to
maintain compatibility with existing readers.
One downside of this approach would needing much more temporary
space so it can't be the default, but I'm hoping the extra work
required by compact would offset the high commit times for giant
shards when adding a lot of messages to the index.
Small incremental indexing jobs would continue to write directly
to (0, 1, 2); only large jobs would use the epochs.
Does that sound reasonable? Thanks.