search for: mean_word_length

Displaying 3 results from an estimated 3 matches for "mean_word_length".

Did you mean: max_word_length
2023 Mar 26
1
manual flushing thresholds for deletes?
...ngth * (1 + 3) * mean_term_length > > where 1x is for the mean term length itself, > and 3x for the position overhead If I follow you want an approximation to the number of raw bytes in the text to match the non-delete case, so I think you want something like: get_doclength() / 2 * (mean_word_length + 1) The /2 is assuming you're indexing both stemmed and unstemmed terms since with the default indexing strategy one word in the document generates one of each. The +1 is for the spaces between words in the text. This is likely to underestimate due to punctuation and runs of whitespace, So...
2023 Mar 24
1
manual flushing thresholds for deletes?
Years ago, I ran into OOM problems with the default flush threshold of 10000 documents while indexing (add/replace). Realizing I had documents of hugely varying sizes (0.5KB..20MB) and little RAM, I instead tracked the number of raw bytes in the text being indexed and flushed whenever I'd seen a configurable byte count. Not the most scientific way, but it seems to work well enough on low-end
2023 Mar 27
1
manual flushing thresholds for deletes?
...gt; > > where 1x is for the mean term length itself, > > and 3x for the position overhead > > If I follow you want an approximation to the number of raw bytes in the > text to match the non-delete case, so I think you want something like: > > get_doclength() / 2 * (mean_word_length + 1) > > The /2 is assuming you're indexing both stemmed and unstemmed terms > since with the default indexing strategy one word in the document > generates one of each. > > The +1 is for the spaces between words in the text. This is > likely to underestimate due to punc...