thr3ads.net - search: "mean_word

Displaying 3 results from an estimated 3 matches for "mean_word_length".

2023 Mar 26

manual flushing thresholds for deletes?

...ngth * (1 + 3) * mean_term_length > > where 1x is for the mean term length itself, > and 3x for the position overhead If I follow you want an approximation to the number of raw bytes in the text to match the non-delete case, so I think you want something like: get_doclength() / 2 * (mean_word_length + 1) The /2 is assuming you're indexing both stemmed and unstemmed terms since with the default indexing strategy one word in the document generates one of each. The +1 is for the spaces between words in the text. This is likely to underestimate due to punctuation and runs of whitespace, So...

manual flushing thresholds for deletes?

2023 Mar 24

manual flushing thresholds for deletes?

Years ago, I ran into OOM problems with the default flush threshold of 10000 documents while indexing (add/replace). Realizing I had documents of hugely varying sizes (0.5KB..20MB) and little RAM, I instead tracked the number of raw bytes in the text being indexed and flushed whenever I'd seen a configurable byte count. Not the most scientific way, but it seems to work well enough on low-end

manual flushing thresholds for deletes?

2023 Mar 27

manual flushing thresholds for deletes?

...gt; > > where 1x is for the mean term length itself, > > and 3x for the position overhead > > If I follow you want an approximation to the number of raw bytes in the > text to match the non-delete case, so I think you want something like: > > get_doclength() / 2 * (mean_word_length + 1) > > The /2 is assuming you're indexing both stemmed and unstemmed terms > since with the default indexing strategy one word in the document > generates one of each. > > The +1 is for the spaces between words in the text. This is > likely to underestimate due to punc...

search for: mean_word_length