Displaying 3 results from an estimated 3 matches for "mean_word_length".
Did you mean:
max_word_length
2023 Mar 26
1
manual flushing thresholds for deletes?
...ngth * (1 + 3) * mean_term_length
>
> where 1x is for the mean term length itself,
> and 3x for the position overhead
If I follow you want an approximation to the number of raw bytes in the
text to match the non-delete case, so I think you want something like:
get_doclength() / 2 * (mean_word_length + 1)
The /2 is assuming you're indexing both stemmed and unstemmed terms
since with the default indexing strategy one word in the document
generates one of each.
The +1 is for the spaces between words in the text. This is
likely to underestimate due to punctuation and runs of whitespace,
So...
2023 Mar 24
1
manual flushing thresholds for deletes?
Years ago, I ran into OOM problems with the default flush
threshold of 10000 documents while indexing (add/replace).
Realizing I had documents of hugely varying sizes (0.5KB..20MB)
and little RAM, I instead tracked the number of raw bytes in the
text being indexed and flushed whenever I'd seen a configurable
byte count. Not the most scientific way, but it seems to work
well enough on low-end
2023 Mar 27
1
manual flushing thresholds for deletes?
...gt;
> > where 1x is for the mean term length itself,
> > and 3x for the position overhead
>
> If I follow you want an approximation to the number of raw bytes in the
> text to match the non-delete case, so I think you want something like:
>
> get_doclength() / 2 * (mean_word_length + 1)
>
> The /2 is assuming you're indexing both stemmed and unstemmed terms
> since with the default indexing strategy one word in the document
> generates one of each.
>
> The +1 is for the spaces between words in the text. This is
> likely to underestimate due to punc...