thr3ads.net - Xapian discuss - manual flushing thresholds for deletes? [Mar 2023]

If this information is useful, please help other people find it:
Share via:

Eric Wong

2023-Mar-24 10:37 UTC

manual flushing thresholds for deletes?

Years ago, I ran into OOM problems with the default flush
threshold of 10000 documents while indexing (add/replace).

Realizing I had documents of hugely varying sizes (0.5KB..20MB)
and little RAM, I instead tracked the number of raw bytes in the
text being indexed and flushed whenever I'd seen a configurable
byte count.  Not the most scientific way, but it seems to work
well enough on low-end systems.


Now, I'm dealing with many deletes at once and hitting OOM
again.  Since the raw text is no longer available and I didn't
store its original size anywhere, would calculating something
based on get_doclength be a reasonable approximation?

I'm wondering if something like:

  get_doclength * (1 + 3) * mean_term_length

  where 1x is for the mean term length itself,
  and 3x for the position overhead

And perhaps assume mean_term_length is 10 bytes, so maybe:

  get_doclength * 40

?

I'm using Search::Xapian XS since it's in Debian stable;
and don't think there's a standard way to show the amount
of memory uncommitted changes are taking up.

Thanks for any thoughts you have.

Olly Betts

2023-Mar-26 23:05 UTC

head link

manual flushing thresholds for deletes?

On Fri, Mar 24, 2023 at 10:37:41AM +0000, Eric Wong
wrote:> Realizing I had documents of hugely varying sizes (0.5KB..20MB)
> and little RAM, I instead tracked the number of raw bytes in the
> text being indexed and flushed whenever I'd seen a configurable
> byte count.  Not the most scientific way, but it seems to work
> well enough on low-end systems.
> 
> Now, I'm dealing with many deletes at once and hitting OOM
> again.  Since the raw text is no longer available and I didn't
> store its original size anywhere, would calculating something
> based on get_doclength be a reasonable approximation?
> 
> I'm wondering if something like:
> 
>   get_doclength * (1 + 3) * mean_term_length
> 
>   where 1x is for the mean term length itself,
>   and 3x for the position overhead
If I follow you want an approximation to the number of raw bytes in the
text to match the non-delete case, so I think you want something like:

get_doclength() / 2 * (mean_word_length + 1)

The /2 is assuming you're indexing both stemmed and unstemmed terms
since with the default indexing strategy one word in the document
generates one of each.

The +1 is for the spaces between words in the text.  This is
likely to underestimate due to punctuation and runs of whitespace,
So perhaps +1.<something> is better (and perhaps better to overestimate
slightly and flush a little more often rather than risk OOM).
> And perhaps assume mean_term_length is 10 bytes, so maybe:
> 
>   get_doclength * 40
10 seems too long.  You want the mean word length weighted by frequency
of occurrence.  For English that's typically around 5 characters, which
is 5 bytes.  If we go for +1 that's:

    get_doclength() * 3

With +2 the factor is 3.5, so somewhere in between 3 and 3.5 is probably
about right.
> I'm using Search::Xapian XS since it's in Debian stable;
> and don't think there's a standard way to show the amount
> of memory uncommitted changes are taking up.
We don't have an easy way to calculate this in any version.  It would
need us to take more control of the allocation of the memory used to
store these changes.  We probably need to do that as the threshold
here really should be in terms of memory used to store the pending
changes, but it's not a trivial change.

Cheers,
    Olly

Seemingly Similar Threads

Search for more possibly parallel threads

Xapian discuss - Mar 2023 - manual flushing thresholds for deletes?

manual flushing thresholds for deletes?

manual flushing thresholds for deletes?

Seemingly Similar Threads