Thanks for the fast answer ! I've certainly no plan to store such big
objects in
Xapian. It just means that there is a missing sanity check somewhere.
The user succeeded in pinpointing the problem to a 900  MBytes mbox file.
A possible reason would be that a really bad mbox would be misparsed, producing
e.g. an enormous Subject: or From: field which would get as an attribute into
the data
record. I see that I have no size checks on this at the moment. I'll
investigate in this
direction.
Can this come from anything other than the data record ?
I'll post the result of the inquiry if we get a good explanation...
Cheers,
jf
Olly Betts writes:
 > On Wed, Mar 12, 2025 at 11:47:29AM +0100, Jean-Francois Dockes wrote:
 > > I am getting a "Can't handle insanely large tags"
exception from a
 > > replace_document() call (for a new document).
 > > 
 > > This happens on a user's very big file system, it's remote
and not
 > > very easy to test.
 > > 
 > > This is quite probably a Recoll bug, but, to help with my
 > > investigation, would someone have any idea of the potential causes ?
 > 
 > In the glass backend, at the B-tree level each table can be thought of
 > as a key->value store.  Internally in the code, this "value"
is called
 > "tag" (for historical reasons really, but it helps to avoid
confusion
 > with document value slots so the terminology has been kept).
 > 
 > Each entry in the table is limited in size - approximately:
 > 
 >   size(key)+size(tag)+per_entry_overhead <=
(block_size-per_block_overhead)/4
 > 
 > That works out at a maximum tag size of a bit under 2K for the default
 > 8K block size - longer tags are supported but get split over multiple
 > entries.  There's a counter for these which is 2 bytes, so that limits
 > the total tag size to very roughly 2K * 65536 which is 128MB.  That's
 > an overestimate as it ignores the overheads and the key size - if the
 > key is long this limit will be a bit lower (from a quick rough
 > calculation you should be able to store 109MB with 8K blocks).
 > 
 > The entries in some tables are deflate-compressed - for those tables
 > these limits are on the compressed data size.
 > 
 > It seems most likely this is triggered by storing a very large document
 > data but it would need to be over 109MB after compression.  It's
 > probably theoretically possible to hit for other tables but I'd be
much
 > more surprised.  It's really a Xapian size limit, but if it is the
 > document data and you aren't intending to store something that large
it
 > could be a Recoll bug too.
 > 
 > There is a simple workaround which is to increase the block size.  That
 > needs to be done when you create the database, or you can convert an
 > existing database to a different block size with xapian-compact (also
 > available via the API).
 > 
 > Honey isn't block based and won't need to split up entries like
this.
 > It doesn't yet support update though, but once it's actually
finished
 > it won't have this problem.
 > 
 > I'll improve the exception message to (a) report the tag size
 > encountered and (b) suggest using a larger block size.
 > 
 > Cheers,
 >     Olly