thr3ads.net - similar to: "XAPIAN_FLUSH

Displaying 20 results from an estimated 1000 matches similar to: "XAPIAN_FLUSH_THRESHOLD"

2009 May 19

omindex options

Hi. I am writing a python equivalent of omindex (we are using scriptindex currently - but I wanted to use omindex instead, and extend it to work with our internal file format.. BUT did not want to compile code if possible... so anyway). I have tried to keep the code as close to possible to the omindex native code, but am facing a bit of confusion: what exactly is the reason for omindex to take

BUG IN XAPIAN_FLUSH_THRESHOLD

2007 Jul 17

BUG IN XAPIAN_FLUSH_THRESHOLD

There is is bug when setting XAPIAN_FLUSH_THRESHOLD=20000000 When trying for force Xapian flush documents to flush after 20 million documents Xapian ignores the size and flush it after only 10,000 documents. Data captured from delve after 60 seconds interval when has been set as follow: XAPIAN_FLUSH_THRESHOLD=20000000 perl -e ' while(1) { system("delve ."); sleep(60); } '

omindex => Unknown extension

2009 Apr 06

omindex => Unknown extension

Hi all, I'm having a recurrent problem with Omega's indexing. When I run omindex, it sometimes misses to recognize the extension of some files (.doc, .pdf) and skips them. In the same run, omindex is otherwise perfectly able to index other files with same extensions. The reason is not clear but it should occur before it selects a content converter since for example, if I manually run

[GSOC 2014] Indexing INEX dataset

2014 Mar 11

[GSOC 2014] Indexing INEX dataset

On Tue, Mar 11, 2014 at 03:20:31PM +0100, Parth Gupta wrote: > > > > On current trunk, we index the title with prefix "S" by default in > > omindex, though with a wdf inc of 5 rather than 1: > > > > indexer.index_text(title, 5, "S"); > > > > So I don't think you need that change to omindex now. > > Yes, but please

Dealing with image PDF's

2008 Jul 30

Dealing with image PDF's

Guys, I was just playing around and added a bit of code to omindex.cc so I could ocr tiff and tif with gocr which seems to work. Here's what it looks like: // Tiff: } else if (startswith(mimetype, "image/tif")) { // Inspired by http://mjr.towers.org.uk/comp/sxw2text string safefile = shell_protect(file); string cmd = "tifftopnm " + safefile + "

Dealing with image PDF's

2008 Jul 30

Dealing with image PDF's

indexing performance

2004 Oct 08

indexing performance

I've some trouble with my indexer, which builds on simpleindex.cc. The problem is that indexing process becomes very slow after we indexed 2000k docs (though the indexer works quite well with first 2000k docs). It took almost three weeks to index 8 million docs. However, we need to index about 20 million docs. I have to stop the indexer due to its performance. I think my question is

Ticket #282: omindex-assorted-enhancements.patch woes

2009 Feb 02

Ticket #282: omindex-assorted-enhancements.patch woes

I would really like to try out the features in the patch above. But I can't ever seem to get the resulting omindex.cc to "make". I tried updating to rev 10801 from the SVN then run /bootstrap but then I seem to get errors compiling everything when I try and do "make" (I'm using ubuntu 8.10). So I thought I'd try an apply the patch to the latest stable version

about index speed of xapian

2012 Nov 21

about index speed of xapian

hi, i use xapian to index a txt file, it's size is 268M. i take each line as a document, and each line has two field like 13445511 | 111115151. the recored size is 10000000. the XAPIAN_FLUSH_THRESHOLD set 1000000. it takes 1026544ms to index the file, it is more slower than lucene. The lucene speed is about 40000 records per second. code: try { Xapian::WritableDatabase

[GSOC 2014] Indexing INEX dataset

2014 Mar 17

[GSOC 2014] Indexing INEX dataset

Hi Olly, Wouldn't setting the weight of terms in title back to normal (e.g. 5 to 1) by below line, automatically adjust the wdfs and field lengths? indexer.index_text(title, 5, "S"); -> indexer.index_text(title, 1, "S"); if it does not then we should include that part in the patch too. I like to create a patch for xapian-letor for resolving common code of xapian.

"DatabaseCorruptError: Cannot open tables at consistent revisions"

2009 Apr 29

"DatabaseCorruptError: Cannot open tables at consistent revisions"

Ocassionally when I'm searching using Omega I get: "DatabaseCorruptError: Cannot open tables at consistent revisions" If I click reload it's all ok, is this the database being updated?, is there a way to avoid the message? Frank

notmuch: Xapian exception during database creation

2017 Dec 29

notmuch: Xapian exception during database creation

Running notmuch from git on Debian testing[1] with the mail and database sitting on a ZFS filesystem, adding mail to a new database: > agrajag-testing ~/s/notmuch % ./notmuch new > Found 605510 total files (that's not much mail). > add_file: A Xapian exception occurred36m 37s remaining). > A Xapian exception occurred adding message: Unexpected end of posting list for

wildcard support (left truncation)

2009 Feb 04

wildcard support (left truncation)

Dose Xapian support wildcards (left truncation)? E.g. *ildcard.doc or *.doc or Wild*.doc I read a post from Olly in 2005 that said it wasn't supported yet, I was wonder if there had been any progress or easy work around since. I mainly need when users want to search by the filename extension. Thanks, Frank

[GSOC 2014] Indexing INEX dataset

2014 Mar 11

[GSOC 2014] Indexing INEX dataset

On Tue, Mar 11, 2014 at 12:02:15PM +0100, Parth Gupta wrote: > During the indexing with omindex, only you need to make sure is indexing > with prefix 'S' for title as explained here in Letor documentation: > xapian-letor/docs/letor.rst > > Previously when I edited omindex.cc it was modified as can be seen >

Proposed changes to omindex

2006 Aug 11

Proposed changes to omindex

Proposed changes to omindex Currently Available Items ========================= 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp

"Value in posting list too large" error with 1.1.4 (chert and brass, not flint)

2010 Mar 07

"Value in posting list too large" error with 1.1.4 (chert and brass, not flint)

Hi, I've a program which: 1. Sets XAPIAN_FLUSH_THRESHOLD=1000 2. Opens a (new) database for write 3. Indexes a few thousand documents 4. Periodically also does queries on the database With 1.1.4, with certain document sets (basically a particular mail folder of mine), Enquire.get_mset() sometimes (but not always) triggers a "RangeError: Value in posting list too

Compact databases and removing stale records at the same time

2013 Jun 19

Compact databases and removing stale records at the same time

On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote: > On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote: > > The advantage of compact - it runs approximately 8 times as fast (we > > are CPU limited in each case - writing to tmpfs first, then rsyncing > > to the destination) and it takes approximately 75% of the space of a > > fresh database with maximum

omindex killed

2012 Dec 29

omindex killed

I'm finding that omindex is consistently ending prematurely when indexing certain files. The last output looks like this: [Entering directory /compounds/Acetic_acid] Indexing "/MATLAB/compounds/Acetic_acid/AACID_50T.TXT" as text/plain ... added. Indexing "/MATLAB/compounds/Acetic_acid/AACID_50T.pdf" as application/pdf ... "pdftotext -enc UTF-8

Storing the documents text: data record or value ?

2018 Jan 03

Storing the documents text: data record or value ?

Hi, Following the Recoll snippets generation performance problem caused by the new positions list storage scheme in Xapian 1.4, I am experimenting with generating snippets from the complete document text stored in the index. This increases the index size much less than I would have expected (around 10-15% apparently with my home directory data), which is good news obviously. I have tried

Flint failed to deliver indexing performance to Quartz.

2007 Jun 17

Flint failed to deliver indexing performance to Quartz.

Flint failed to deliver indexing performance to Quartz. I am proposing to remove Flint as default database and place Quartz database back as default. The catch is not that Flint database is smaller and faster during searches then Quartz database as developers were concerning when were measuring and neglecting to measure performance when creating the large indexes. The truth is that Flint

similar to: XAPIAN_FLUSH_THRESHOLD