Srijon Biswas
2009-May-20 10:42 UTC
[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files
Hi. I was searching around for some documentation on Omega (a query that I posted just yesterday) and I came across this ticket. I _think_ that the implementation may be incorrect here... please correct me if I am wrong (I have just seen the final patch as linked in the ticket, not really tried it out): dir A: - file A1 [content C1] [last modified M1] - file A2 [content C2] [last modified M2] - file A3 [content C3] [last modified M3] Index dir A. Then move A1 -> A2, create a new A1 with new content. So we get: dir A: - file A1 [content C4] [last modified M4] - file A2 [content C1] [last modified M1] - file A3 [content C3] [last modified M3] Index dir A. In the above scenario, as per the fix, am I correct in assuming that A2 will not get updated (which it should), but A1 will? Please correct me if I am wrong. Maybe the test for changed content should depend on the md5sum and not on the date (even though this does add more burden than just checking the last mod date). Something roughly like this: - Get the url for the file. - Read the corresponding md5 value from the db if present. - Create the md5 for this file (I know this does not work for text files atleast as per current code but it need not be that way - see comment below). - If md5 matches, then no need to do anything, else continue as normal. Also, right now the md5 is being taken for the raw file in all cases, and "processed" text in only for text files (where the md5 is for content that has been changed a bit). It does not seem that taking the md5 of the processed text is of any use at this point ( and where it does become useful, maybe store two values - one md5 for raw file and another one for the content of the file after passing through the mime type handler). Thanks, Srijon.
Olly Betts
2009-May-20 11:16 UTC
[Xapian-discuss] Ticket #342: Omega: Add option to avoid reindexing unchanged files
On Wed, May 20, 2009 at 11:42:00AM +0100, Srijon Biswas wrote:> I _think_ that the implementation may be incorrect here... please correct me > if I am wrong (I have just seen the final patch as linked in the ticket, not > really tried it out): > > dir A: > - file A1 [content C1] [last modified M1] > - file A2 [content C2] [last modified M2] > - file A3 [content C3] [last modified M3] > > Index dir A. > > Then move A1 -> A2, create a new A1 with new content. So we get: > > dir A: > - file A1 [content C4] [last modified M4] > - file A2 [content C1] [last modified M1] > - file A3 [content C3] [last modified M3] > > Index dir A. > > In the above scenario, as per the fix, am I correct in assuming that A2 will > not get updated (which it should), but A1 will? Please correct me if I am > wrong.Yes, this is true. Similarly if you restore an older file from backup, or directly mess with timestamps after updating a file (e.g. with touch --reference). But then omindex is aimed at indexing web sites, and webservers will also suffer from similar issues with "If-Modified-Since:" requests if you do these things, so it's prudent to avoid doing these things in web-served document trees anyway. I bet for most users, the large speed gain outweighs these corner cases, but they ought to be documented.> Maybe the test for changed content should depend on the md5sum and not on > the date (even though this does add more burden than just checking the last > mod date). Something roughly like this:Yes, it's quite a lot more work, but it would save some work. A fuller solution to ticket #250 would reduce the gain here, but there would probably still be some: http://trac.xapian.org/ticket/250> Also, right now the md5 is being taken for the raw file in all cases, and > "processed" text in only for text files (where the md5 is for content that > has been changed a bit). It does not seem that taking the md5 of the > processed text is of any use at this point ( and where it does become > useful, maybe store two values - one md5 for raw file and another one for > the content of the file after passing through the mime type handler).That's a bug - the handling of non-UTF-8 text patch came after the md5 one, and before that this was calculating the md5 sum of the "raw" file. I'll fix that in a moment. Cheers, Olly