thr3ads.net - Xapian devel - [Xapian-devel] Omega changes [Dec 2004]

If this information is useful, please help other people find it:
Share via:

Richard Boulton

2004-Dec-17 14:15 UTC

[Xapian-devel] Omega changes

I propose making a few changes to the way omega (and omindex) operate.
I'm posting these to the list before doing so to check if they'll cause
obvious problems for anyone.

1) Configuration handling for omega.  Omega has a configuration file,
which specifies where databases, templates and logfiles are to be found.
It currently looks for this configuration file in its current working
directory (which will usually be the directory the binary is located).
If the configuration file is not found in this location (or is
unreadable), it looks at /etc/omega.conf, and if this doesn't exist, it
uses default values.

Reading a configuration file from the current working directory seems
bad practice to me, and could be a potential, albeit small, security
risk; care needs to be taken to avoid serving the file to clients.

I propose changing the configuration file search to read an environment
variable "OMEGA_CONFIG_FILE".  If this is set, the configuration will
be
read from the file whose path is in the environment variable.  If this
is not set, the configuration will be read from $sysconfdir/omega.conf
(where $sysconfdir defaults to /etc, but can be set by parameters
to ./configure).  If the configuration file specified cannot be read,
default values will be used.

2) Updating of omindex databases.  Currently, when run with the
"replace" duplicates option, omindex will index from scratch each
document found, even if it is already in the index.  In most situations,
it would be more desirable to reindex only those files whose
modification time has changed.  I propose to implement this as a new
duplicates option (call it "timestamp"), and make it the default
duplicates option.

The omega templates already support documents containing a field named
"modtime", holding a time_t timestamp, but omindex doesn't produce
such
a field.  The only change to the data stored in the database would be to
add this field to the document contents.  With the default templates,
this would cause the last-modified time of the document to be displayed
in search results, but this could easily be suppressed if desired.

Actually, Olly suggested that it might be sensible to remove the
duplicates options entirely, and simply default to the behaviour
specified above.  Does anyone actually use omindex with a --duplicates
option other than "replace"?

3) Add database specific configuration files to omindex, which are used
to specify how a database has been indexed.  These configuration files
could consist simply of the command line options used, or possibly
equivalent information in an easy-to-parse format.  The configuration
file could be used by omega to configure the query parser, and other
search options, appropriately to the database being searched.

In addition to current options, these configuration files could specify
which information to store in the 

4) Finally, I propose changing the way in which omega and omindex map
file locations to urls.  Currently, the URL at which a document is
displayed is stored in each document in the Xapian database.  This has
the obvious drawback that the index needs to be regenerated if a server
is reconfigured (for example, change of hostname, or change of path
within the server).

Instead, omindex would store the local path of the document in the
database, and would store no information about the URLs at which
documents are available externally.  Omega would be provided with a
translation table in each database from local file prefix to external
file prefix, and would use this to generate the external URLs.  I've
used this scheme with other systems, so I know it can be made to work,
but it would require some changes to applications currently using
omindex.


Finally, is there a problem with making any of these changes whilst
we're within the 0.8.x version cycle, or is the expectation that the
workings of omega and related tools will be reasonably stable within
this cycle, as the API of libxapian is.

-- 
Richard Boulton <richard@tartarus.org>

James Aylett

2004-Dec-17 15:05 UTC

head link

[Xapian-devel] Omega changes

On Fri, Dec 17, 2004 at 02:15:34PM +0000, Richard Boulton wrote:
> 1) Configuration handling for omega.
+1
> 2) I propose to implement this a new duplicates option (call it
> "timestamp"), and make it the default duplicates option.
+1
> Actually, Olly suggested that it might be sensible to remove the
> duplicates options entirely, and simply default to the behaviour
> specified above.  Does anyone actually use omindex with a --duplicates
> option other than "replace"?
I doubt it very much. They're only there for some measure of backwards
compatibility in case anyone actually liked the old way of working.

--duplicates=ignore was designed to save time when you only add
documents to the corpus. Shouldn't be needed with
--duplicates=timestamp, and I can't think of a good reason to use
replace instead of timestamp.

--duplicates=duplicate is daft, but it was easy to add :-)

I'd be happy to lose this option. It'd make the quickstart
instructions a lot more obvious, too :)
> 3) Add database specific configuration files to omindex, which are used
> to specify how a database has been indexed.  These configuration files
> could consist simply of the command line options used, or possibly
> equivalent information in an easy-to-parse format.  The configuration
> file could be used by omega to configure the query parser, and other
> search options, appropriately to the database being searched.
+1. At least.

However: how would you cope with this with two databases with
different indexing options? Specifically, is there anything sane we
can do with different stemmers in use?
 > In addition to current options, these configuration files could specify
> which information to store in the 
... ? :)
 > 4) Finally, I propose changing the way in which omega and omindex map
> file locations to urls.  Currently, the URL at which a document is
> displayed is stored in each document in the Xapian database.  This has
> the obvious drawback that the index needs to be regenerated if a server
> is reconfigured (for example, change of hostname, or change of path
> within the server).
> 
> Instead, omindex would store the local path of the document in the
> database, and would store no information about the URLs at which
> documents are available externally.  Omega would be provided with a
> translation table in each database from local file prefix to external
> file prefix, and would use this to generate the external URLs.  I've
> used this scheme with other systems, so I know it can be made to work,
> but it would require some changes to applications currently using
> omindex.
Hmm. What I think you're saying is that we do the following:

index option: file-path url-path filename{file-suffix}
indexes file: file-path/filename{file-suffix}
mapping:      file-suffix -> url-suffix [may have several of these]
config:       url-prefix

final url:    url-prefix/url-path/filename{url-suffix}
stored in db: url-path/filename{url-suffix}

So if you have (Apache terms) a DocumentRoot for http://example.com/
of /sites/example.com we might have (assuming that no mappings will
just map file-suffix to url-suffix in every case):

global config:
--------------
url-prefix: http://example.com

index config:
-------------
file-path: /sites/example.com
url-path:

which will index the whole thing, no problems.

index config:
-------------
file-prefix: /sites/example.com/company
url-path: company

index config:
-------------
file-path: /sites/press-area/
url-path: press

to index two subparts. You can then do the root with --no-recurse.

That's all fine. With some finesse, we can avoid having to specify
lots of mappings when you don't have suffices in the URLs (which you
shouldn't).

What we're talking about is shifting the [BASEDIRECTORY] DIRECTORY
split into a [URLPATH] DIRECTORY split. I can't think of any
problems with that, and indeed it probably makes a lot more sense to
people that aren't me (more accurately, me three years ago :-) than
the current way of doing it. Better, URLPATH should be mandatory, and
you can just put / in if you're doing the whole site.
> Finally, is there a problem with making any of these changes whilst
> we're within the 0.8.x version cycle, or is the expectation that the
> workings of omega and related tools will be reasonably stable within
> this cycle, as the API of libxapian is.
I'd be inclined to hold off the db-specific config until 0.9.x,
personally. The other changes - configuration location, which has
always been broken, and duplicates, which will make life better
without (hopefully) any drawbacks - I'd say go ahead now.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

Olly Betts

2004-Dec-17 16:12 UTC

head link

[Xapian-devel] Omega changes

On Fri, Dec 17, 2004 at 02:15:34PM +0000, Richard Boulton
wrote:> I propose changing the configuration file search to read an environment
> variable "OMEGA_CONFIG_FILE".  If this is set, the configuration
will be
> read from the file whose path is in the environment variable.  If this
> is not set, the configuration will be read from $sysconfdir/omega.conf
> (where $sysconfdir defaults to /etc, but can be set by parameters
> to ./configure).  If the configuration file specified cannot be read,
> default values will be used.
I'm not totally sold on this.

Setting an environmental variable (at least for apache) requires admin
access to the webserver configuration, or (assuming it's configured to
allow you to) the creation of a .htaccess file, or some sort of wrapper
around the CGI (e.g. a shell script which exports the variable and execs
omega).  If .htaccess exists, the server has to read it for anything
served from that directory, which is potentially quite an overhead.

If the only other option is to build from source and set sysconfdir in
configure, if I want to use omega on a server where it's *already
installed*, I'm forced to use a wrapper, .htaccess (if I'm able to), or
to compile my own separate version, which then means I need to worry
about any security patches.  It also wastes my disk quota (or shared
disk space).  Heck, I may not even have access to a compiler on a box
intended for hosting!

I'm not convinced that looking for omega.conf where omega was run from
is worse than this situation.
> 4) Finally, I propose changing the way in which omega and omindex map
> file locations to urls.  Currently, the URL at which a document is
> displayed is stored in each document in the Xapian database.  This has
> the obvious drawback that the index needs to be regenerated if a server
> is reconfigured (for example, change of hostname, or change of path
> within the server).
Although omindex doesn't build the hostname in unless you tell it to
by specifying it on the command line.

On the flip-side, with the current scheme, I can move files around on
disk (changing the pathnames) and the index will continue to work
provided I reconfigure the http server to serve them with the same
paths, with alias or using mod_rewrite.  If the pathnames are built
into the index, I have to rebuild in this situation.

Also the work of translating paths is done at index time.  Usually
it's minor, but if you have a lot of mappings it may not be.  And
the pathnames will almost inevitably be longer than the URLs, which
means a bigger index.

I'm also not quite sure how this would work with content from
scriptindex which came from a database and provides URLs through
a CGI gateway or similar - there are no pathnames to specify.
Similarly for crawled content.  Similarly for indexing a newsfeed
to produce nntp: or news: URLs.  Would omega's default template
look for both a url field and pathnames?

In fact, as I think about this more - pathnames are just an artifact of
how omindex gets the data to index (i.e. reading it from separate files
on local disk), so it feels kind of wrong that omega would need to care
about them...
> Instead, omindex would store the local path of the document in the
> database, and would store no information about the URLs at which
> documents are available externally.  Omega would be provided with a
> translation table in each database from local file prefix to external
> file prefix, and would use this to generate the external URLs.  I've
> used this scheme with other systems, so I know it can be made to work,
> but it would require some changes to applications currently using
> omindex.
Both schemes can be made to work, but it's not really clear to me that
either scheme is inherently better than the other.  There are minor
benefits either way.  And the current scheme has the enormous advantage
that's it's already implemented and debugged!
> Finally, is there a problem with making any of these changes whilst
> we're within the 0.8.x version cycle, or is the expectation that the
> workings of omega and related tools will be reasonably stable within
> this cycle, as the API of libxapian is.
I think we should try to constrain incompatible changes to x.x.0
versions across the board.  Ditto for major reworkings which have
an increased risk of introducing bugs.  But our resources are
limited so we have to be reasonably pragmatic, and at least try to fix
breakage quickly.

I'd suggest seeing where we are in with releases when this stuff is
ready to go in.

Cheers,
    Olly

Possibly Parallel Threads

Search for more maybe matching threads

Xapian devel - Dec 2004 - Omega changes

[Xapian-devel] Omega changes

[Xapian-devel] Omega changes

[Xapian-devel] Omega changes

Possibly Parallel Threads