search for: get_doclength

Displaying 14 results from an estimated 14 matches for "get_doclength".

2023 Mar 24
1
manual flushing thresholds for deletes?
...e byte count. Not the most scientific way, but it seems to work well enough on low-end systems. Now, I'm dealing with many deletes at once and hitting OOM again. Since the raw text is no longer available and I didn't store its original size anywhere, would calculating something based on get_doclength be a reasonable approximation? I'm wondering if something like: get_doclength * (1 + 3) * mean_term_length where 1x is for the mean term length itself, and 3x for the position overhead And perhaps assume mean_term_length is 10 bytes, so maybe: get_doclength * 40 ? I'm using S...
2023 Mar 26
1
manual flushing thresholds for deletes?
...cientific way, but it seems to work > well enough on low-end systems. > > Now, I'm dealing with many deletes at once and hitting OOM > again. Since the raw text is no longer available and I didn't > store its original size anywhere, would calculating something > based on get_doclength be a reasonable approximation? > > I'm wondering if something like: > > get_doclength * (1 + 3) * mean_term_length > > where 1x is for the mean term length itself, > and 3x for the position overhead If I follow you want an approximation to the number of raw bytes...
2023 Mar 27
1
manual flushing thresholds for deletes?
...work > > well enough on low-end systems. > > > > Now, I'm dealing with many deletes at once and hitting OOM > > again. Since the raw text is no longer available and I didn't > > store its original size anywhere, would calculating something > > based on get_doclength be a reasonable approximation? > > > > I'm wondering if something like: > > > > get_doclength * (1 + 3) * mean_term_length > > > > where 1x is for the mean term length itself, > > and 3x for the position overhead > > If I follow you want...
2023 May 03
1
manual flushing thresholds for deletes?
...frequency for boolean terms, so that's: xapian-delve -avv1 .|tr -d A-Z|awk '{f = $3 ? $3 : $2; t += length($1)*f; n += f} END {print t/n}' > My Perl deletion code is something like: > > my $EST_LEN = 6; > ... > for my $docid (@docids) { > $TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN; However you're using that estimate here, and the document length doesn't include boolean terms (it's sum(wdf) over the terms in the document), so including them in $EST_LEN seems wrong. For you doing so increases $EST_LEN, so you'll tend to overestimate for lon...
2023 Aug 27
1
DatabaseModifiedError while iterating on mset
...abaseModifiedError on the call to create the TermIterator since the list of terms and wdfs is stored in a single entry per document which is fetched when the iterator is created (it is conceivable this might be different for a new database backend in the future I suppose). If you call methods like get_doclength() which need to consult the database those could throw. Cheers, Olly
2013 Jan 17
1
FASTER Search
...t::process_next_or_skip_to(double, Xapian::PostingIterator::Internal*) 17803 6.5989 OrPostList::next(double) 12481 4.6262 AndMaybePostList::get_weight() const 10729 3.9768 OrPostList::get_weight() const 10096 3.7422 AndMaybePostList::next(double) 8743 3.2407 ChertDatabase::get_doclength(unsigned int) const 7527 2.7900 LeafPostList::get_weight() const 7504 2.7814 ChertPostListTable::get_doclength(unsigned int, Xapian::Internal::RefCntPtr<ChertDatabase const>) const 5402 2.0023 ChertPostList::jump_to(unsigned int) 4518 1.6746 ChertPostList::skip_to(unsi...
2023 May 03
1
manual flushing thresholds for deletes?
...($1)*($3+1); n += ($3+1)} END {print t/n}' # (also added "NR > 1" to ignore the delve header line) Which gives me 6.00067, so rounding to 6 seems fine either way. My Perl deletion code is something like: my $EST_LEN = 6; ... for my $docid (@docids) { $TXN_BYTES -= $xdb->get_doclength($docid) * $EST_LEN; $xdb->delete_document($docid); if ($TXN_BYTES < 0) { # flush within txn $xdb->commit_transaction; $TXN_BYTES = 8000000; $xdb->begin_transaction; } } > > (that awk bit should be overflow-free) <snip> > Or use a language which supports...
2023 Aug 23
1
DatabaseModifiedError while iterating on mset
I'm already retrying the ->get_mset operations; but now I'm wondering where I'd hit DatabaseModifiedErrors while inside a Xapian::MSetIterator loop. I assume ->get_document is a place where it gets thrown; but once a document is retrieved, can iterating through terms in one document (using TermIterator) also throw DB modified? I'm dumping multiple terms per-document to a
2010 Jan 16
1
PHP XapianTermIterator/XapianPositionIterator usage
Hello again, /thanks to Peter for previous response. I've been digging around trying to find sample usage of XapianTermIterator/XapianPositionIterator in PHP. The idea is to code up a test case in PHP to perform snippet extraction (with a possible view to coding a pecl extension in C). I found a C++ sample, but that wasn't much help. I must be dense this morning though, since I
2023 Aug 28
1
DatabaseModifiedError while iterating on mset
...her hand, modifications to existing documents are not common in my use cases (but possible) so I've never noticed errors while though an MSet. I suppose DocumentNotFound errors can also happen while iterating an MSet if a writer is deleting documents, too, right? > If you call methods like get_doclength() which need to consult the > database those could throw. OK, will do. Thanks.
2023 Mar 27
1
manual flushing thresholds for deletes?
On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > 10 seems too long. You want the mean word length weighted by frequency > > of occurrence. For English that's typically around 5 characters, which > > is 5 bytes. If we go for +1 that's: > > Actually, 10 may be too short in my case since there's a
2009 Feb 12
1
problem when using xapian's static libs in windows
...bnet.lib(tcpclient.obj) : error LNK2001: ????????? "public: virtual double __thiscall RemoteDatabase::get_avlength(void)const " (?get_avlength at RemoteDatabase@@UBENXZ) libbackend.lib(dbfactory_remote.obj) : error LNK2001: ????????? "public: virtual double __thiscall RemoteDatabase::get_doclength(unsigned int)const " (?get_doclength at RemoteDatabase@@UBENI at Z) libnet.lib(progclient.obj) : error LNK2001: ????????? "public: virtual double __thiscall RemoteDatabase::get_doclength(unsigned int)const " (?get_doclength at RemoteDatabase@@UBENI at Z) libnet.lib(tcpclient.obj) : e...
2005 Aug 12
1
error building xapian
...oryPostList(Xapian::Internal::RefCntPtr<const InMemoryDatabase>, const InMemoryTerm&)': inmemory_database.cc:84: error: class 'InMemoryPostList' does not have any field named 'db' inmemory_database.cc: In member function 'virtual Xapian::doclength InMemoryPostList::get_doclength() const': inmemory_database.cc:153: error: 'db' was not declared in this scope inmemory_database.cc: At global scope: inmemory_database.cc:182: error: prototype for 'InMemoryTermList::InMemoryTermList(Xapian::Internal::RefCntPtr<const InMemoryDatabase>, Xapian::docid, const In...
2009 Jan 27
1
Segmentation fault in MSetIterator get_weight
Hi, I'm using xapian with c# and mono and i'm having a segfault in get_weight. When i print the index variable, the value is clearly too high. I think something write over it. Do you have any idea on how i could trace the beginning of the segmentation fault ? Thanks, -- Yann