On Mon, Aug 07, 2006 at 07:10:00PM +0200, Reini Urban wrote:
> omindex expands each and every document without checking the database
> if the document changed in the meantime.
> This costs a lot of IO and cpu.
> Why not store last_mod fully also in the database and check it before
> doing the parsing?
> "L" + my_itoa(unixtime)
Because it's a little awkward, and no one has done it yet :-)
The way to do it is in index_file() immediately after the duplicates
test. You'll need to fetch the document by unique id (which is the
only fiddly bit, really), then look up what the last mod time was on
previous index, and check it against a stat. (Assuming stat works on
your operating system; if not, best to disable this entirely.)
I'd be inclined to put last mod in the document data as a new data
field, rather than in a prefixed term, although that might not be the
most efficient way of doing it. I suspect that a unique term lookup
won't grab the L-part of the termlist anyway (in general), so you have
the choice between looking up the doc data and grabbing a page or two
of the termlist for the document (with skip_to()). Olly would be
better placed to say which was more efficient in terms of IO.
Downside to the data field approach is that it's probably more
code. The code is clearer, however. Hmm.
It probably needs an option to override this, in case atime gets
mangled for some reason (restore from backup, for instance).
James
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james@tartarus.org uncertaintydivision.org