Bill Hutten
2009-Apr-01 00:36 UTC
[Xapian-discuss] Newbie question: How to extract 'date modified' from path when indexing?
Hi all: I've successfully set up Xapian/Omega as the search engine on a client website. So far, so good. :) However, the client would like to be able to search by date. This is not a problem - the START and END cgi parameters work fine, except that omindex is using (of course) the datestamp of the HTML files as the "Date Modified". This datestamp is not accurate, as the files have been moved between servers, backed up and restored, etc etc over time. The files are stored in a consistent structure, for instance file "foo.html" might be in "archives/2006/07/foo.html" In this example, I would like to be able to extract the 2006/07 value from the path during indexing and use that as the date that Xapian/Omega uses to search on. Can anyone give me a few pointers as to how I would accomplish this? Right now my indexing is simply done with omindex - I assume this will not be sufficient. Thanks for any help you can offer. - bill -- Bill Hutten bill at hutten.org
Deron Meranda
2009-Apr-01 03:55 UTC
[Xapian-discuss] Newbie question: How to extract 'date modified' from path when indexing?
On Tue, Mar 31, 2009 at 8:36 PM, Bill Hutten <bill at hutten.org> wrote:> I've successfully set up Xapian/Omega as the search engine on a client > website. ... > > The files are stored in a consistent structure, for instance file > "foo.html" might be in "archives/2006/07/foo.html" ?In this example, I > would like to be able to extract the 2006/07 value from the path during > indexing and use that as the date that Xapian/Omega uses to search on.Do you have access to the webserver files at all? Because the best solution is simply to change the timestamp of the underlying files. That would benefit not only your Xapian indexing, but also all the other HTTP goodness; such as working with whatever other types of spiders or indexers may be crawling the site, HTTP proxies and caches, etc. If it's Unix/Linux, changing the file timestamps would be quite easy. You want to look at the "time" command. Or I could provide you a little script to do that. As a second choice, if say this is an Apache webserver and you can add some configuration (either the main config file or the per-directory .htaccess files); then you can force Apache to lie about the file's date. This is easiest though if you only have a few directories (which if it's one directory per month is doable). Again, since the webserver would be sending out the correct date, it also benefits other spiders, indexers, HTTP caches, etc. As a last resort, you're going to have to modify the indexer itself to overrule what it learns from the HTTP date, and instead extract a date out of the URL pattern. -- Deron Meranda