Olly Betts <olly at survex.com> wrote:> On Fri, Mar 24, 2023 at 10:37:41AM +0000, Eric Wong wrote:
> > Realizing I had documents of hugely varying sizes (0.5KB..20MB)
> > and little RAM, I instead tracked the number of raw bytes in the
> > text being indexed and flushed whenever I'd seen a configurable
> > byte count. Not the most scientific way, but it seems to work
> > well enough on low-end systems.
> >
> > Now, I'm dealing with many deletes at once and hitting OOM
> > again. Since the raw text is no longer available and I didn't
> > store its original size anywhere, would calculating something
> > based on get_doclength be a reasonable approximation?
> >
> > I'm wondering if something like:
> >
> > get_doclength * (1 + 3) * mean_term_length
> >
> > where 1x is for the mean term length itself,
> > and 3x for the position overhead
>
> If I follow you want an approximation to the number of raw bytes in the
> text to match the non-delete case, so I think you want something like:
>
> get_doclength() / 2 * (mean_word_length + 1)
>
> The /2 is assuming you're indexing both stemmed and unstemmed terms
> since with the default indexing strategy one word in the document
> generates one of each.
>
> The +1 is for the spaces between words in the text. This is
> likely to underestimate due to punctuation and runs of whitespace,
> So perhaps +1.<something> is better (and perhaps better to
overestimate
> slightly and flush a little more often rather than risk OOM).
Thanks for the response.
> > And perhaps assume mean_term_length is 10 bytes, so maybe:
> >
> > get_doclength * 40
>
> 10 seems too long. You want the mean word length weighted by frequency
> of occurrence. For English that's typically around 5 characters, which
> is 5 bytes. If we go for +1 that's:
Actually, 10 may be too short in my case since there's a lot of
40-byte SHA-1 hex (and likely SHA-256 in the future) from git; and
also long function names, etc...
Without capitalized prefixes, I get a mean length from delve as 15.3571:
xapian-delve -a -1 . | tr -d A-Z \
awk '{ d = length - mean; mean += d/NR } END { print mean }'
(that awk bit should be overflow-free)
> > I'm using Search::Xapian XS since it's in Debian stable;
> > and don't think there's a standard way to show the amount
> > of memory uncommitted changes are taking up.
>
> We don't have an easy way to calculate this in any version. It would
> need us to take more control of the allocation of the memory used to
> store these changes. We probably need to do that as the threshold
> here really should be in terms of memory used to store the pending
> changes, but it's not a trivial change.
Understood.