On Fri, Mar 24, 2023 at 10:37:41AM +0000, Eric Wong
wrote:> Realizing I had documents of hugely varying sizes (0.5KB..20MB)
> and little RAM, I instead tracked the number of raw bytes in the
> text being indexed and flushed whenever I'd seen a configurable
> byte count. Not the most scientific way, but it seems to work
> well enough on low-end systems.
>
> Now, I'm dealing with many deletes at once and hitting OOM
> again. Since the raw text is no longer available and I didn't
> store its original size anywhere, would calculating something
> based on get_doclength be a reasonable approximation?
>
> I'm wondering if something like:
>
> get_doclength * (1 + 3) * mean_term_length
>
> where 1x is for the mean term length itself,
> and 3x for the position overhead
If I follow you want an approximation to the number of raw bytes in the
text to match the non-delete case, so I think you want something like:
get_doclength() / 2 * (mean_word_length + 1)
The /2 is assuming you're indexing both stemmed and unstemmed terms
since with the default indexing strategy one word in the document
generates one of each.
The +1 is for the spaces between words in the text. This is
likely to underestimate due to punctuation and runs of whitespace,
So perhaps +1.<something> is better (and perhaps better to overestimate
slightly and flush a little more often rather than risk OOM).
> And perhaps assume mean_term_length is 10 bytes, so maybe:
>
> get_doclength * 40
10 seems too long. You want the mean word length weighted by frequency
of occurrence. For English that's typically around 5 characters, which
is 5 bytes. If we go for +1 that's:
get_doclength() * 3
With +2 the factor is 3.5, so somewhere in between 3 and 3.5 is probably
about right.
> I'm using Search::Xapian XS since it's in Debian stable;
> and don't think there's a standard way to show the amount
> of memory uncommitted changes are taking up.
We don't have an easy way to calculate this in any version. It would
need us to take more control of the allocation of the memory used to
store these changes. We probably need to do that as the threshold
here really should be in terms of memory used to store the pending
changes, but it's not a trivial change.
Cheers,
Olly