search for: avv1

Displaying 4 results from an estimated 4 matches for "avv1".

Did you mean: avv
2023 Mar 27
1
manual flushing thresholds for deletes?
...n }' That's not weighted by frequency though, and short words tend to be more frequent, so you're likely skewing the answer. Also it'll include boolean terms which didn't come from the document text. You can take frequency into account with something like this: xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END {print t/n}' This will also effectively ignore boolean terms, assuming you're giving them wdf of 0 (because $3 here is the collection frequency, which is sum(wdf(term)) over all documents). > (that awk bit should be overflow-free)...
2023 May 03
1
manual flushing thresholds for deletes?
...ding one to each collection frequency makes little sense to me. I'm guessing the idea is to count each boolean term once for each document it's in? If so, you want to use the collection frequency for non-boolean terms and the term frequency for boolean terms, so that's: xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}' > My Perl deletion code is something like: > > my $EST_LEN = 6; > ... > for my $docid (@docids) { > $TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN; However you're using tha...
2023 May 03
1
manual flushing thresholds for deletes?
...by frequency though, and short words tend to be more > frequent, so you're likely skewing the answer. Also it'll include > boolean terms which didn't come from the document text. Ah, OK. > You can take frequency into account with something like this: > > xapian-delve -avv1 .|tr -d A-Z|awk '{t + = length($1)*$3; n += $3} END {print t/n}' > > This will also effectively ignore boolean terms, assuming you're giving > them wdf of 0 (because $3 here is the collection frequency, which is > sum(wdf(term)) over all documents). Should boolean terms be...
2023 Mar 27
1
manual flushing thresholds for deletes?
Olly Betts <olly at survex.com> wrote: > On Fri, Mar 24, 2023 at 10:37:41AM +0000, Eric Wong wrote: > > Realizing I had documents of hugely varying sizes (0.5KB..20MB) > > and little RAM, I instead tracked the number of raw bytes in the > > text being indexed and flushed whenever I'd seen a configurable > > byte count. Not the most scientific way, but it seems