thr3ads.net - Xapian discuss - [Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files [May 2009]

If this information is useful, please help other people find it:
Share via:

Srijon Biswas

2009-May-20 10:42 UTC

[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files

Hi.

I was searching around for some documentation on Omega (a query that I
posted just yesterday) and I came across this ticket.

I _think_ that the implementation may be incorrect here... please correct me
if I am wrong (I have just seen the final patch as linked in the ticket, not
really tried it out):

dir A:
- file A1 [content C1] [last modified M1]
- file A2 [content C2] [last modified M2]
- file A3 [content C3] [last modified M3]

Index dir A.

Then move A1 -> A2, create a new A1 with new content. So we get:

dir A:
- file A1 [content C4] [last modified M4]
- file A2 [content C1] [last modified M1]
- file A3 [content C3] [last modified M3]

Index dir A.

In the above scenario, as per the fix, am I correct in assuming that A2 will
not get updated (which it should), but A1 will? Please correct me if I am
wrong.
Maybe the test for changed content should depend on the md5sum and not on
the date (even though this does add more burden than just checking the last
mod date). Something roughly like this:

- Get the url for the file.
- Read the corresponding md5 value from the db if present.
- Create the md5 for this file (I know this does not work for text files
atleast as per current code but it need not be that way - see comment
below).
- If md5 matches, then no need to do anything, else continue as normal.

Also, right now the md5 is being taken for the raw file in all cases, and
"processed" text in only for text files (where the md5 is for content
that
has been changed a bit). It does not seem that taking the md5 of the
processed text is of any use at this point ( and where it does become
useful, maybe store two values - one md5 for raw file and another one for
the content of the file after passing through the mime type handler).

Thanks,
Srijon.

Olly Betts

2009-May-20 11:16 UTC

head link

[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files

On Wed, May 20, 2009 at 11:42:00AM +0100, Srijon Biswas
wrote:> I _think_ that the implementation may be incorrect here... please correct
me
> if I am wrong (I have just seen the final patch as linked in the ticket,
not
> really tried it out):
> 
> dir A:
> - file A1 [content C1] [last modified M1]
> - file A2 [content C2] [last modified M2]
> - file A3 [content C3] [last modified M3]
> 
> Index dir A.
> 
> Then move A1 -> A2, create a new A1 with new content. So we get:
> 
> dir A:
> - file A1 [content C4] [last modified M4]
> - file A2 [content C1] [last modified M1]
> - file A3 [content C3] [last modified M3]
> 
> Index dir A.
> 
> In the above scenario, as per the fix, am I correct in assuming that A2
will
> not get updated (which it should), but A1 will? Please correct me if I am
> wrong.
Yes, this is true.  Similarly if you restore an older file from backup,
or directly mess with timestamps after updating a file (e.g. with touch
--reference).

But then omindex is aimed at indexing web sites, and webservers will
also suffer from similar issues with "If-Modified-Since:" requests if
you do these things, so it's prudent to avoid doing these things in
web-served document trees anyway.

I bet for most users, the large speed gain outweighs these corner cases,
but they ought to be documented.
> Maybe the test for changed content should depend on the md5sum and not on
> the date (even though this does add more burden than just checking the last
> mod date). Something roughly like this:
Yes, it's quite a lot more work, but it would save some work.  A fuller
solution to ticket #250 would reduce the gain here, but there would
probably still be some:

http://trac.xapian.org/ticket/250
> Also, right now the md5 is being taken for the raw file in all cases, and
> "processed" text in only for text files (where the md5 is for
content that
> has been changed a bit). It does not seem that taking the md5 of the
> processed text is of any use at this point ( and where it does become
> useful, maybe store two values - one md5 for raw file and another one for
> the content of the file after passing through the mime type handler).
That's a bug - the handling of non-UTF-8 text patch came after the md5
one, and before that this was calculating the md5 sum of the "raw"
file.
I'll fix that in a moment.

Cheers,
    Olly

Xapian discuss - May 2009 - Ticket #342: Omega: Add option to avoid reindexing unchanged files

[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files

[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files