Displaying 3 results from an estimated 3 matches for "est_len".
Did you mean:
  dst_len
  
2023 May 03
1
manual flushing thresholds for deletes?
...If so, you want to use the collection frequency
for non-boolean terms and the term frequency for boolean terms,
so that's:
xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}'
> My Perl deletion code is something like:
> 
> 	my $EST_LEN = 6;
> 	...
> 	for my $docid (@docids) {
> 		$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
However you're using that estimate here, and the document length
doesn't include boolean terms (it's sum(wdf) over the terms in the
document), so including them in $EST_LEN se...
2023 May 03
1
manual flushing thresholds for deletes?
...n terms, I'd use ($3 + 1)?. e.g:
	awk 'NR > 1 {t += length($1)*($3+1); n += ($3+1)} END {print t/n}'
	# (also added "NR > 1" to ignore the delve header line)
Which gives me 6.00067, so rounding to 6 seems fine either way.
My Perl deletion code is something like:
	my $EST_LEN = 6;
	...
	for my $docid (@docids) {
		$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
		$xdb->delete_document($docid);
		if ($TXN_BYTES < 0) { # flush within txn
			$xdb->commit_transaction;
			$TXN_BYTES = 8000000;
			$xdb->begin_transaction;
		}
	}
> > (that awk bit s...
2023 Mar 27
1
manual flushing thresholds for deletes?
On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > 10 seems too long.  You want the mean word length weighted by frequency
> > of occurrence.  For English that's typically around 5 characters, which
> > is 5 bytes.  If we go for +1 that's:
> 
> Actually, 10 may be too short in my case since there's a