Olly Betts <olly at survex.com> wrote:> On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote:
> > Olly Betts <olly at survex.com> wrote:
> > > 10 seems too long. You want the mean word length weighted by
frequency
> > > of occurrence. For English that's typically around 5
characters, which
> > > is 5 bytes. If we go for +1 that's:
> >
> > Actually, 10 may be too short in my case since there's a lot of
> > 40-byte SHA-1 hex (and likely SHA-256 in the future) from git; and
> > also long function names, etc...
> >
> > Without capitalized prefixes, I get a mean length from delve as
15.3571:
> >
> > xapian-delve -a -1 . | tr -d A-Z \
> > awk '{ d = length - mean; mean += d/NR } END { print mean
}'
>
> That's not weighted by frequency though, and short words tend to be
more
> frequent, so you're likely skewing the answer. Also it'll include
> boolean terms which didn't come from the document text.
Ah, OK.
> You can take frequency into account with something like this:
>
> xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END
{print t/n}'
>
> This will also effectively ignore boolean terms, assuming you're giving
> them wdf of 0 (because $3 here is the collection frequency, which is
> sum(wdf(term)) over all documents).
Should boolean terms be ignored when estimating flushing
thresholds? They do have a wdf of 0 in my case. I'm indexing
git commit SHA-1 hex (and soon SHA-256), so that's a lot of
40-64 char terms. Every commit has the commit OID itself, and
the parent OID(s); commit OIDs are unique to each document,
but parents are not always unique, though many are...
Surely boolean terms are not free when accounting for memory use
on deletes, right? I account for them when indexing since
I extract boolean terms from raw text (and rely on the length of
the raw text (including whitespace) to account for flushing).
Anyways, your above awk snippet gave me 5.82497. Though, if I
wanted to account for boolean terms, I'd use ($3 + 1)?. e.g:
awk 'NR > 1 {t += length($1)*($3+1); n += ($3+1)} END {print t/n}'
# (also added "NR > 1" to ignore the delve header line)
Which gives me 6.00067, so rounding to 6 seems fine either way.
My Perl deletion code is something like:
my $EST_LEN = 6;
...
for my $docid (@docids) {
$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
$xdb->delete_document($docid);
if ($TXN_BYTES < 0) { # flush within txn
$xdb->commit_transaction;
$TXN_BYTES = 8000000;
$xdb->begin_transaction;
}
}
> > (that awk bit should be overflow-free)
<snip>
> Or use a language which supports arbitrary precision
> numbers.
Actually, I just used gawk instead of mawk for GMP support :>