Olly Betts writes:
> On Wed, Mar 12, 2025 at 10:01:50PM +0100, Jean-Francois Dockes wrote:
> >
> > Thanks for the fast answer ! I've certainly no plan to store such
big objects in
> > Xapian. It just means that there is a missing sanity check somewhere.
> >
> > The user succeeded in pinpointing the problem to a 900 MBytes mbox
file.
> >
> > A possible reason would be that a really bad mbox would be misparsed,
producing
> > e.g. an enormous Subject: or From: field which would get as an
attribute into the data
> > record. I see that I have no size checks on this at the moment.
I'll investigate in this
> > direction.
> >
> > Can this come from anything other than the data record ?
>
> Probably - the document data is the simplest to reason about (because
> it gets compressed with zlib and we have a reasonably idea how well
> zlib will compress typical data).
>
> Postlists are chunked at a higher level to support efficient
> skipping forwards so postlist table entries shouldn't be more than
about
> 2000 bytes, but I'd think it's probably possible for at least some
other
> tables.
>
> Some other tables might be possible - for example, if you indexed a
> document by enough distinct terms you'd probably end up with a
termlist
> entry that's too big to store, but the encoding used tends to become
> more compact the more terms there are so it's hard to say at what
point
> this would happen without testing.
Thanks, this is very helpful as I was able to eliminate the two obvious
candidates: data record and stored document text (as metadata), so that
I was wondering if there still was something mysterious.
The file was some kind of mail archive not in mbox format. It was detected
as a single message (like a Maildir file), which resulted in a few headers
and a 900 MB body. I had a safeguard test on mbox member size, not on email
body...
I could not reproduce the exact error on a similar document with a body
made of concatenated bibles: this caused an error on set_metadata() for the
text, instead of replace_document(). I guess that the vocabulary was too
small.
Nicely enough, nothing crashed., and I now know that Xapian is somehow
limited in its ability to index Gigabyte single documents :) Which I'll try
to avoid in the future as it is not that useful...
Regards,
jf