Thanks for the fast answer ! I've certainly no plan to store such big
objects in
Xapian. It just means that there is a missing sanity check somewhere.
The user succeeded in pinpointing the problem to a 900 MBytes mbox file.
A possible reason would be that a really bad mbox would be misparsed, producing
e.g. an enormous Subject: or From: field which would get as an attribute into
the data
record. I see that I have no size checks on this at the moment. I'll
investigate in this
direction.
Can this come from anything other than the data record ?
I'll post the result of the inquiry if we get a good explanation...
Cheers,
jf
Olly Betts writes:
> On Wed, Mar 12, 2025 at 11:47:29AM +0100, Jean-Francois Dockes wrote:
> > I am getting a "Can't handle insanely large tags"
exception from a
> > replace_document() call (for a new document).
> >
> > This happens on a user's very big file system, it's remote
and not
> > very easy to test.
> >
> > This is quite probably a Recoll bug, but, to help with my
> > investigation, would someone have any idea of the potential causes ?
>
> In the glass backend, at the B-tree level each table can be thought of
> as a key->value store. Internally in the code, this "value"
is called
> "tag" (for historical reasons really, but it helps to avoid
confusion
> with document value slots so the terminology has been kept).
>
> Each entry in the table is limited in size - approximately:
>
> size(key)+size(tag)+per_entry_overhead <=
(block_size-per_block_overhead)/4
>
> That works out at a maximum tag size of a bit under 2K for the default
> 8K block size - longer tags are supported but get split over multiple
> entries. There's a counter for these which is 2 bytes, so that limits
> the total tag size to very roughly 2K * 65536 which is 128MB. That's
> an overestimate as it ignores the overheads and the key size - if the
> key is long this limit will be a bit lower (from a quick rough
> calculation you should be able to store 109MB with 8K blocks).
>
> The entries in some tables are deflate-compressed - for those tables
> these limits are on the compressed data size.
>
> It seems most likely this is triggered by storing a very large document
> data but it would need to be over 109MB after compression. It's
> probably theoretically possible to hit for other tables but I'd be
much
> more surprised. It's really a Xapian size limit, but if it is the
> document data and you aren't intending to store something that large
it
> could be a Recoll bug too.
>
> There is a simple workaround which is to increase the block size. That
> needs to be done when you create the database, or you can convert an
> existing database to a different block size with xapian-compact (also
> available via the API).
>
> Honey isn't block based and won't need to split up entries like
this.
> It doesn't yet support update though, but once it's actually
finished
> it won't have this problem.
>
> I'll improve the exception message to (a) report the tag size
> encountered and (b) suggest using a larger block size.
>
> Cheers,
> Olly