I propose making a few changes to the way omega (and omindex) operate. I'm posting these to the list before doing so to check if they'll cause obvious problems for anyone. 1) Configuration handling for omega. Omega has a configuration file, which specifies where databases, templates and logfiles are to be found. It currently looks for this configuration file in its current working directory (which will usually be the directory the binary is located). If the configuration file is not found in this location (or is unreadable), it looks at /etc/omega.conf, and if this doesn't exist, it uses default values. Reading a configuration file from the current working directory seems bad practice to me, and could be a potential, albeit small, security risk; care needs to be taken to avoid serving the file to clients. I propose changing the configuration file search to read an environment variable "OMEGA_CONFIG_FILE". If this is set, the configuration will be read from the file whose path is in the environment variable. If this is not set, the configuration will be read from $sysconfdir/omega.conf (where $sysconfdir defaults to /etc, but can be set by parameters to ./configure). If the configuration file specified cannot be read, default values will be used. 2) Updating of omindex databases. Currently, when run with the "replace" duplicates option, omindex will index from scratch each document found, even if it is already in the index. In most situations, it would be more desirable to reindex only those files whose modification time has changed. I propose to implement this as a new duplicates option (call it "timestamp"), and make it the default duplicates option. The omega templates already support documents containing a field named "modtime", holding a time_t timestamp, but omindex doesn't produce such a field. The only change to the data stored in the database would be to add this field to the document contents. With the default templates, this would cause the last-modified time of the document to be displayed in search results, but this could easily be suppressed if desired. Actually, Olly suggested that it might be sensible to remove the duplicates options entirely, and simply default to the behaviour specified above. Does anyone actually use omindex with a --duplicates option other than "replace"? 3) Add database specific configuration files to omindex, which are used to specify how a database has been indexed. These configuration files could consist simply of the command line options used, or possibly equivalent information in an easy-to-parse format. The configuration file could be used by omega to configure the query parser, and other search options, appropriately to the database being searched. In addition to current options, these configuration files could specify which information to store in the 4) Finally, I propose changing the way in which omega and omindex map file locations to urls. Currently, the URL at which a document is displayed is stored in each document in the Xapian database. This has the obvious drawback that the index needs to be regenerated if a server is reconfigured (for example, change of hostname, or change of path within the server). Instead, omindex would store the local path of the document in the database, and would store no information about the URLs at which documents are available externally. Omega would be provided with a translation table in each database from local file prefix to external file prefix, and would use this to generate the external URLs. I've used this scheme with other systems, so I know it can be made to work, but it would require some changes to applications currently using omindex. Finally, is there a problem with making any of these changes whilst we're within the 0.8.x version cycle, or is the expectation that the workings of omega and related tools will be reasonably stable within this cycle, as the API of libxapian is. -- Richard Boulton <richard@tartarus.org>
On Fri, Dec 17, 2004 at 02:15:34PM +0000, Richard Boulton wrote:> 1) Configuration handling for omega.+1> 2) I propose to implement this a new duplicates option (call it > "timestamp"), and make it the default duplicates option.+1> Actually, Olly suggested that it might be sensible to remove the > duplicates options entirely, and simply default to the behaviour > specified above. Does anyone actually use omindex with a --duplicates > option other than "replace"?I doubt it very much. They're only there for some measure of backwards compatibility in case anyone actually liked the old way of working. --duplicates=ignore was designed to save time when you only add documents to the corpus. Shouldn't be needed with --duplicates=timestamp, and I can't think of a good reason to use replace instead of timestamp. --duplicates=duplicate is daft, but it was easy to add :-) I'd be happy to lose this option. It'd make the quickstart instructions a lot more obvious, too :)> 3) Add database specific configuration files to omindex, which are used > to specify how a database has been indexed. These configuration files > could consist simply of the command line options used, or possibly > equivalent information in an easy-to-parse format. The configuration > file could be used by omega to configure the query parser, and other > search options, appropriately to the database being searched.+1. At least. However: how would you cope with this with two databases with different indexing options? Specifically, is there anything sane we can do with different stemmers in use?> In addition to current options, these configuration files could specify > which information to store in the... ? :)> 4) Finally, I propose changing the way in which omega and omindex map > file locations to urls. Currently, the URL at which a document is > displayed is stored in each document in the Xapian database. This has > the obvious drawback that the index needs to be regenerated if a server > is reconfigured (for example, change of hostname, or change of path > within the server). > > Instead, omindex would store the local path of the document in the > database, and would store no information about the URLs at which > documents are available externally. Omega would be provided with a > translation table in each database from local file prefix to external > file prefix, and would use this to generate the external URLs. I've > used this scheme with other systems, so I know it can be made to work, > but it would require some changes to applications currently using > omindex.Hmm. What I think you're saying is that we do the following: index option: file-path url-path filename{file-suffix} indexes file: file-path/filename{file-suffix} mapping: file-suffix -> url-suffix [may have several of these] config: url-prefix final url: url-prefix/url-path/filename{url-suffix} stored in db: url-path/filename{url-suffix} So if you have (Apache terms) a DocumentRoot for http://example.com/ of /sites/example.com we might have (assuming that no mappings will just map file-suffix to url-suffix in every case): global config: -------------- url-prefix: http://example.com index config: ------------- file-path: /sites/example.com url-path: which will index the whole thing, no problems. index config: ------------- file-prefix: /sites/example.com/company url-path: company index config: ------------- file-path: /sites/press-area/ url-path: press to index two subparts. You can then do the root with --no-recurse. That's all fine. With some finesse, we can avoid having to specify lots of mappings when you don't have suffices in the URLs (which you shouldn't). What we're talking about is shifting the [BASEDIRECTORY] DIRECTORY split into a [URLPATH] DIRECTORY split. I can't think of any problems with that, and indeed it probably makes a lot more sense to people that aren't me (more accurately, me three years ago :-) than the current way of doing it. Better, URLPATH should be mandatory, and you can just put / in if you're doing the whole site.> Finally, is there a problem with making any of these changes whilst > we're within the 0.8.x version cycle, or is the expectation that the > workings of omega and related tools will be reasonably stable within > this cycle, as the API of libxapian is.I'd be inclined to hold off the db-specific config until 0.9.x, personally. The other changes - configuration location, which has always been broken, and duplicates, which will make life better without (hopefully) any drawbacks - I'd say go ahead now. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
On Fri, Dec 17, 2004 at 02:15:34PM +0000, Richard Boulton wrote:> I propose changing the configuration file search to read an environment > variable "OMEGA_CONFIG_FILE". If this is set, the configuration will be > read from the file whose path is in the environment variable. If this > is not set, the configuration will be read from $sysconfdir/omega.conf > (where $sysconfdir defaults to /etc, but can be set by parameters > to ./configure). If the configuration file specified cannot be read, > default values will be used.I'm not totally sold on this. Setting an environmental variable (at least for apache) requires admin access to the webserver configuration, or (assuming it's configured to allow you to) the creation of a .htaccess file, or some sort of wrapper around the CGI (e.g. a shell script which exports the variable and execs omega). If .htaccess exists, the server has to read it for anything served from that directory, which is potentially quite an overhead. If the only other option is to build from source and set sysconfdir in configure, if I want to use omega on a server where it's *already installed*, I'm forced to use a wrapper, .htaccess (if I'm able to), or to compile my own separate version, which then means I need to worry about any security patches. It also wastes my disk quota (or shared disk space). Heck, I may not even have access to a compiler on a box intended for hosting! I'm not convinced that looking for omega.conf where omega was run from is worse than this situation.> 4) Finally, I propose changing the way in which omega and omindex map > file locations to urls. Currently, the URL at which a document is > displayed is stored in each document in the Xapian database. This has > the obvious drawback that the index needs to be regenerated if a server > is reconfigured (for example, change of hostname, or change of path > within the server).Although omindex doesn't build the hostname in unless you tell it to by specifying it on the command line. On the flip-side, with the current scheme, I can move files around on disk (changing the pathnames) and the index will continue to work provided I reconfigure the http server to serve them with the same paths, with alias or using mod_rewrite. If the pathnames are built into the index, I have to rebuild in this situation. Also the work of translating paths is done at index time. Usually it's minor, but if you have a lot of mappings it may not be. And the pathnames will almost inevitably be longer than the URLs, which means a bigger index. I'm also not quite sure how this would work with content from scriptindex which came from a database and provides URLs through a CGI gateway or similar - there are no pathnames to specify. Similarly for crawled content. Similarly for indexing a newsfeed to produce nntp: or news: URLs. Would omega's default template look for both a url field and pathnames? In fact, as I think about this more - pathnames are just an artifact of how omindex gets the data to index (i.e. reading it from separate files on local disk), so it feels kind of wrong that omega would need to care about them...> Instead, omindex would store the local path of the document in the > database, and would store no information about the URLs at which > documents are available externally. Omega would be provided with a > translation table in each database from local file prefix to external > file prefix, and would use this to generate the external URLs. I've > used this scheme with other systems, so I know it can be made to work, > but it would require some changes to applications currently using > omindex.Both schemes can be made to work, but it's not really clear to me that either scheme is inherently better than the other. There are minor benefits either way. And the current scheme has the enormous advantage that's it's already implemented and debugged!> Finally, is there a problem with making any of these changes whilst > we're within the 0.8.x version cycle, or is the expectation that the > workings of omega and related tools will be reasonably stable within > this cycle, as the API of libxapian is.I think we should try to constrain incompatible changes to x.x.0 versions across the board. Ditto for major reworkings which have an increased risk of introducing bugs. But our resources are limited so we have to be reasonably pragmatic, and at least try to fix breakage quickly. I'd suggest seeing where we are in with releases when this stuff is ready to go in. Cheers, Olly