Displaying 14 results from an estimated 14 matches for "get_doclength".
2023 Mar 24
1
manual flushing thresholds for deletes?
...e
byte count. Not the most scientific way, but it seems to work
well enough on low-end systems.
Now, I'm dealing with many deletes at once and hitting OOM
again. Since the raw text is no longer available and I didn't
store its original size anywhere, would calculating something
based on get_doclength be a reasonable approximation?
I'm wondering if something like:
get_doclength * (1 + 3) * mean_term_length
where 1x is for the mean term length itself,
and 3x for the position overhead
And perhaps assume mean_term_length is 10 bytes, so maybe:
get_doclength * 40
?
I'm using S...
2023 Mar 26
1
manual flushing thresholds for deletes?
...cientific way, but it seems to work
> well enough on low-end systems.
>
> Now, I'm dealing with many deletes at once and hitting OOM
> again. Since the raw text is no longer available and I didn't
> store its original size anywhere, would calculating something
> based on get_doclength be a reasonable approximation?
>
> I'm wondering if something like:
>
> get_doclength * (1 + 3) * mean_term_length
>
> where 1x is for the mean term length itself,
> and 3x for the position overhead
If I follow you want an approximation to the number of raw bytes...
2023 Mar 27
1
manual flushing thresholds for deletes?
...work
> > well enough on low-end systems.
> >
> > Now, I'm dealing with many deletes at once and hitting OOM
> > again. Since the raw text is no longer available and I didn't
> > store its original size anywhere, would calculating something
> > based on get_doclength be a reasonable approximation?
> >
> > I'm wondering if something like:
> >
> > get_doclength * (1 + 3) * mean_term_length
> >
> > where 1x is for the mean term length itself,
> > and 3x for the position overhead
>
> If I follow you want...
2023 May 03
1
manual flushing thresholds for deletes?
...frequency for boolean terms,
so that's:
xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}'
> My Perl deletion code is something like:
>
> my $EST_LEN = 6;
> ...
> for my $docid (@docids) {
> $TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
However you're using that estimate here, and the document length
doesn't include boolean terms (it's sum(wdf) over the terms in the
document), so including them in $EST_LEN seems wrong. For you doing
so increases $EST_LEN, so you'll tend to overestimate for lon...
2023 Aug 27
1
DatabaseModifiedError while iterating on mset
...abaseModifiedError on the call to create the TermIterator since the
list of terms and wdfs is stored in a single entry per document which
is fetched when the iterator is created (it is conceivable this might
be different for a new database backend in the future I suppose).
If you call methods like get_doclength() which need to consult the
database those could throw.
Cheers,
Olly
2013 Jan 17
1
FASTER Search
...t::process_next_or_skip_to(double,
Xapian::PostingIterator::Internal*)
17803 6.5989 OrPostList::next(double)
12481 4.6262 AndMaybePostList::get_weight() const
10729 3.9768 OrPostList::get_weight() const
10096 3.7422 AndMaybePostList::next(double)
8743 3.2407 ChertDatabase::get_doclength(unsigned int) const
7527 2.7900 LeafPostList::get_weight() const
7504 2.7814 ChertPostListTable::get_doclength(unsigned int,
Xapian::Internal::RefCntPtr<ChertDatabase const>) const
5402 2.0023 ChertPostList::jump_to(unsigned int)
4518 1.6746 ChertPostList::skip_to(unsi...
2023 May 03
1
manual flushing thresholds for deletes?
...($1)*($3+1); n += ($3+1)} END {print t/n}'
# (also added "NR > 1" to ignore the delve header line)
Which gives me 6.00067, so rounding to 6 seems fine either way.
My Perl deletion code is something like:
my $EST_LEN = 6;
...
for my $docid (@docids) {
$TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN;
$xdb->delete_document($docid);
if ($TXN_BYTES < 0) { # flush within txn
$xdb->commit_transaction;
$TXN_BYTES = 8000000;
$xdb->begin_transaction;
}
}
> > (that awk bit should be overflow-free)
<snip>
> Or use a language which supports...
2023 Aug 23
1
DatabaseModifiedError while iterating on mset
I'm already retrying the ->get_mset operations; but now I'm
wondering where I'd hit DatabaseModifiedErrors while inside a
Xapian::MSetIterator loop.
I assume ->get_document is a place where it gets thrown;
but once a document is retrieved, can iterating through
terms in one document (using TermIterator) also throw DB modified?
I'm dumping multiple terms per-document to a
2010 Jan 16
1
PHP XapianTermIterator/XapianPositionIterator usage
Hello again,
/thanks to Peter for previous response.
I've been digging around trying to find sample usage of
XapianTermIterator/XapianPositionIterator in PHP. The idea is to code up a
test case in PHP to perform snippet extraction (with a possible view to
coding a pecl extension in C). I found a C++ sample, but that wasn't much
help.
I must be dense this morning though, since I
2023 Aug 28
1
DatabaseModifiedError while iterating on mset
...her hand, modifications to existing documents are not
common in my use cases (but possible) so I've never noticed
errors while though an MSet.
I suppose DocumentNotFound errors can also happen while
iterating an MSet if a writer is deleting documents, too, right?
> If you call methods like get_doclength() which need to consult the
> database those could throw.
OK, will do. Thanks.
2023 Mar 27
1
manual flushing thresholds for deletes?
On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote:
> Olly Betts <olly at survex.com> wrote:
> > 10 seems too long. You want the mean word length weighted by frequency
> > of occurrence. For English that's typically around 5 characters, which
> > is 5 bytes. If we go for +1 that's:
>
> Actually, 10 may be too short in my case since there's a
2009 Feb 12
1
problem when using xapian's static libs in windows
...bnet.lib(tcpclient.obj) : error LNK2001: ????????? "public: virtual double __thiscall RemoteDatabase::get_avlength(void)const " (?get_avlength at RemoteDatabase@@UBENXZ)
libbackend.lib(dbfactory_remote.obj) : error LNK2001: ????????? "public: virtual double __thiscall RemoteDatabase::get_doclength(unsigned int)const " (?get_doclength at RemoteDatabase@@UBENI at Z)
libnet.lib(progclient.obj) : error LNK2001: ????????? "public: virtual double __thiscall RemoteDatabase::get_doclength(unsigned int)const " (?get_doclength at RemoteDatabase@@UBENI at Z)
libnet.lib(tcpclient.obj) : e...
2005 Aug 12
1
error building xapian
...oryPostList(Xapian::Internal::RefCntPtr<const
InMemoryDatabase>, const InMemoryTerm&)':
inmemory_database.cc:84: error: class 'InMemoryPostList' does not have
any field named 'db'
inmemory_database.cc: In member function 'virtual Xapian::doclength
InMemoryPostList::get_doclength() const':
inmemory_database.cc:153: error: 'db' was not declared in this scope
inmemory_database.cc: At global scope:
inmemory_database.cc:182: error: prototype for
'InMemoryTermList::InMemoryTermList(Xapian::Internal::RefCntPtr<const
InMemoryDatabase>, Xapian::docid, const In...
2009 Jan 27
1
Segmentation fault in MSetIterator get_weight
Hi,
I'm using xapian with c# and mono and i'm having a segfault in get_weight.
When i print the index variable, the value is clearly too high.
I think something write over it. Do you have any idea on how i could
trace the beginning of the segmentation fault ?
Thanks,
--
Yann