Displaying 3 results from an estimated 3 matches for "est_len".
Did you mean:
dst_len
2023 May 03
1
manual flushing thresholds for deletes?
...If so, you want to use the collection frequency
for non-boolean terms and the term frequency for boolean terms,
so that's:
xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}'
> My Perl deletion code is something like:
>
> my $EST_LEN = 6;
> ...
> for my $docid (@docids) {
> $TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
However you're using that estimate here, and the document length
doesn't include boolean terms (it's sum(wdf) over the terms in the
document), so including them in $EST_LEN se...
2023 May 03
1
manual flushing thresholds for deletes?
...n terms, I'd use ($3 + 1)?. e.g:
awk 'NR > 1 {t += length($1)*($3+1); n += ($3+1)} END {print t/n}'
# (also added "NR > 1" to ignore the delve header line)
Which gives me 6.00067, so rounding to 6 seems fine either way.
My Perl deletion code is something like:
my $EST_LEN = 6;
...
for my $docid (@docids) {
$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
$xdb->delete_document($docid);
if ($TXN_BYTES < 0) { # flush within txn
$xdb->commit_transaction;
$TXN_BYTES = 8000000;
$xdb->begin_transaction;
}
}
> > (that awk bit s...
2023 Mar 27
1
manual flushing thresholds for deletes?
On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > 10 seems too long. You want the mean word length weighted by frequency
> > of occurrence. For English that's typically around 5 characters, which
> > is 5 bytes. If we go for +1 that's:
>
> Actually, 10 may be too short in my case since there's a