search for: est_len

Displaying 3 results from an estimated 3 matches for "est_len".

Did you mean: dst_len
2023 May 03
1
manual flushing thresholds for deletes?
...If so, you want to use the collection frequency for non-boolean terms and the term frequency for boolean terms, so that's: xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}' > My Perl deletion code is something like: > > my $EST_LEN = 6; > ... > for my $docid (@docids) { > $TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN; However you're using that estimate here, and the document length doesn't include boolean terms (it's sum(wdf) over the terms in the document), so including them in $EST_LEN se...
2023 May 03
1
manual flushing thresholds for deletes?
...n terms, I'd use ($3 + 1)?. e.g: awk 'NR > 1 {t += length($1)*($3+1); n += ($3+1)} END {print t/n}' # (also added "NR > 1" to ignore the delve header line) Which gives me 6.00067, so rounding to 6 seems fine either way. My Perl deletion code is something like: my $EST_LEN = 6; ... for my $docid (@docids) { $TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN; $xdb->delete_document($docid); if ($TXN_BYTES < 0) { # flush within txn $xdb->commit_transaction; $TXN_BYTES = 8000000; $xdb->begin_transaction; } } > > (that awk bit s...
2023 Mar 27
1
manual flushing thresholds for deletes?
On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > 10 seems too long. You want the mean word length weighted by frequency > > of occurrence. For English that's typically around 5 characters, which > > is 5 bytes. If we go for +1 that's: > > Actually, 10 may be too short in my case since there's a