Olly Betts <olly at survex.com> wrote:> On Tue, Aug 26, 2025 at 02:31:15AM +0000, Eric Wong wrote: > > Olly Betts <olly at survex.com> wrote: > > > Do you process in ascending NNTP_ARTICLE_NUMBER order? > > > > > > If so you should get sequential update mode for tables like the > > > "data" one. If not you're probably triggering random access mode > > > for all tables. > > > > Yes, it's ascending, but I use ->replace_document to keep docid > > matching NNTP_ARTICLE_NUMBER so each shard sees docids > > incrementing by SHARD_COUNT instead of 1. Would doing > > replace_document with ascending docid gaps allow glass to work > > in sequential mode? > > Yes - what matters is that each item added to the table goes immediately > after the previously added one, which is still true if there are unused > docids between them. > > Are you saying the docids in the shards match NNTP_ARTICLE_NUMBER > and so one has 1, 4, 7, ...; another 2, 5, 8, ...; the third 3, 6, 9, > ...?Yes, exactly. So good to know. One caveat is one of the indexers will avoid creating new Xapian docs for cross-posted messages but add new List-IDs to existing docs. For example, if one message gets cross-posted to multiple mailing lists and we process each mailing list sequentially, the initial message would be indexed with List-Id:<a.example.com>. However, somewhere down the line we're processing List-Id:<b.example.com>, we'll add the new List-Id value to the original message we saw (possibly millions of messages ago) so non-sequential performance does end up being important, too. IOW, if a message is cross-posted to a dozen lists, we end up doing replace_document on the same docid a dozen times (ick!) Thus for the initial index, I'm working on altering the strategy to perform all deduplication first, and then index the message in Xapian with a single replace_document with all known List-Ids only after all deduplication is done.> I'd have gone for making the docids in the combined database match > NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard > (except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller > docid values and smaller gaps between them will encode a little more > efficiently.Understood; but would it be possible to continue to do parallel indexing as NNTP article numbers are being allocated sequentially? Since NNTP article numbers are allocated sequentially, they round-robin across the shards to allow parallelism during indexing (I rely on Perl to extract terms and such, so there's a CPU-limited component) Changing the way docids are allocated now could be very disruptive to users with existing DBs and might be a maintainability/support nightmare.> (Also means you could grow SHARD_COUNT times larger before you run out of > docid space, though with 3 shards and your numbering you can still have > ~1.4 billion documents so it sounds like you're nowhere near that being > an issue.)Yes :)> You could also then use a sharded WritableDatabase object and let Xapian > do the sharding for you: > > $db->replace_document($nntp_article_number, $doc); > > The docids in the sharded database would also then match NNTP article > numbers at search time.Yeah, I actually /just/ noticed WritableDatabase supported shards while rechecking the docs this week. I see it was added in the 1.3.x days but I started with 1.2.x and supported 1.2 for ages due to LTS distros. And I suppose using the combined WritableDatabase feature would require using a single process for indexing and lose parallelism. The main reason I've been using shards is to parallelize the CPU-intensive Perl portions, and possibly parallelize I/O on SSDs where contention is less likely a problem. So, a side question: even on ext4 and ignoring cross-posted messages; I notice Xapian shard commits taking more time as the shards get bigger. Trying to commit shards one-at-a-time doesn't seem to help, so it doesn't seem bound by I/O contention with 3 shards (I capped the default shard count at 3 back in 2018 due to I/O contention). Thus I'm considering allowing the option to split the shards into epochs during the indexing phase, leaving the original set (0, 1, 2) would be untouched above a certain interval (say >100K) until the end of indexing. During indexing, there'd be (0.1, 1.1, 2.1) set of shards for 100K..199999, a (0.2, 1.2, 2.2) set for 200K..299999, and so forth. To finalize indexing, `xapian-compact --no-renumber' would combine all the 0.* into 0, 1.* into 1, and 2.* into 2 to maintain compatibility with existing readers. One downside of this approach would needing much more temporary space so it can't be the default, but I'm hoping the extra work required by compact would offset the high commit times for giant shards when adding a lot of messages to the index. Small incremental indexing jobs would continue to write directly to (0, 1, 2); only large jobs would use the epochs. Does that sound reasonable? Thanks.
On Wed, Aug 27, 2025 at 05:56:49AM +0000, Eric Wong wrote:> One caveat is one of the indexers will avoid creating new Xapian > docs for cross-posted messages but add new List-IDs to existing > docs. > > For example, if one message gets cross-posted to multiple > mailing lists and we process each mailing list sequentially, the > initial message would be indexed with List-Id:<a.example.com>. > > However, somewhere down the line we're processing > List-Id:<b.example.com>, we'll add the new List-Id value to the > original message we saw (possibly millions of messages ago) so > non-sequential performance does end up being important, too. > > IOW, if a message is cross-posted to a dozen lists, we end up > doing replace_document on the same docid a dozen times (ick!)If you do replace_document() with a Document object you got from get_document() and use its existing docid, then the update is optimised provided you've not modified the database in between. I'm not clear how the "List-Id" is stored, but e.g. if it's a boolean term then only that term's posting list is actually updated.> > I'd have gone for making the docids in the combined database match > > NNTP_ARTICLE_NUMBER, which would mean they're sequential in each shard > > (except if there are ever gaps in NNTP_ARTICLE_NUMBER) and the smaller > > docid values and smaller gaps between them will encode a little more > > efficiently. > > Understood; but would it be possible to continue to do parallel > indexing as NNTP article numbers are being allocated sequentially? > Since NNTP article numbers are allocated sequentially, they > round-robin across the shards to allow parallelism during indexing > (I rely on Perl to extract terms and such, so there's a CPU-limited > component)If the existing approach works, the new one should - it's really just the same except the docids in the shards are changed by this mapping: new_docid = (old_docid + 2) / 3 (using integer division)> Changing the way docids are allocated now could be very > disruptive to users with existing DBs and might be a > maintainability/support nightmare.Yes. You could perhaps store a flag in a user metadata entry in the DB and used that to select the mapping functions to use. Not sure what the overhead reduction would actually amount to - cases where a gap between consecutive entries in a posting list for a term is between 43 and 128 documents which will reduce from 2 bytes to 1 byte each time. Probably about 2/3 of keys containing docids would reduce in size by a byte too. I'd guess it's probably noticeable but not dramatic.> Yeah, I actually /just/ noticed WritableDatabase supported > shards while rechecking the docs this week. I see it was added > in the 1.3.x days but I started with 1.2.x and supported 1.2 > for ages due to LTS distros. > > And I suppose using the combined WritableDatabase feature would > require using a single process for indexing and lose parallelism.Yes, so that's a reason to keep doing the sharding yourself.> So, a side question: even on ext4 and ignoring cross-posted > messages; I notice Xapian shard commits taking more time as the > shards get bigger. Trying to commit shards one-at-a-time doesn't > seem to help, so it doesn't seem bound by I/O contention with 3 > shards (I capped the default shard count at 3 back in 2018 due to > I/O contention).This is with Xapian 1.4.x and the glass backend? Before that, a commit required writing O(file size of DB) data because the freelist for a table was stored in a bitmap with one bit per block in the table. This was not problematic for smaller databases, but because we need to ensure this data is actually synced to disc and we can only write it out just before we sync it, it gradually caused more I/O contention as the DB grew in size. Glass instead stores the freelist in blocks which are on the freelist. The table data still needs to be synced, though that gets written out over a period of time so has more chance to get written to disk before we sync it. I'd guess that's what you're seeing.> Thus I'm considering allowing the option to split the shards > into epochs during the indexing phase, leaving the original set > (0, 1, 2) would be untouched above a certain interval (say >100K) > until the end of indexing. > > During indexing, there'd be (0.1, 1.1, 2.1) set of shards for > 100K..199999, a (0.2, 1.2, 2.2) set for 200K..299999, and so forth. > To finalize indexing, `xapian-compact --no-renumber' would > combine all the 0.* into 0, 1.* into 1, and 2.* into 2 to > maintain compatibility with existing readers. > > One downside of this approach would needing much more temporary > space so it can't be the default, but I'm hoping the extra work > required by compact would offset the high commit times for giant > shards when adding a lot of messages to the index. > > Small incremental indexing jobs would continue to write directly > to (0, 1, 2); only large jobs would use the epochs. > > Does that sound reasonable? Thanks.Yes, that's roughly what I'd do if I wanted to maximise the indexing rate for an initial build of the DB. You could try picking the size of each x.y to be indexable as a single commit so all the merging happens via xapian-compact. Cheers, Olly