On Wed, Mar 28, 2007 at 05:53:05PM +0100, Richard Boulton
wrote:> [Actually, I'm not sure that "text splitter" is the right
name for what
> the code in indextext.cc does - it doesn't just split text, but also
> does stemming, creates "R" terms, and possibly a few other things
I've
> missed. I'd call it a "TextProcessor" class, but someone
else might
> have a better name.]
Yeah, TextSplitter is a rubbish name (that's why I put "" around
it!)
> A cleaner separation and code organisation, to my mind, would be to make
> a new intermediate library which sits on top of Xapian, and provides
> language specific processing features. The stemming algorithm stuff
> would also be moved into this library. So, we would end up with:
>
> Xapian-Core: lowest level code - doesn't care about what the documents
> and terms it handles are.
>
> Xapian-Text: text handling code - contains routines to generate terms
> and documents from pieces of text, both for searching and for indexing.
>
> We would then move omega to use Xapian-Text instead of having its own
> text processing code, and then all applications built on Xapian could
> use this code if they want it, and just link directly to Xapian-Core if
> they only need the core library.
The conceptual split is useful to convey to users, but I'm dubious about
changing to building two separate shared libraries.
If you want an analogy, the ISO C stdio functions are conceptually a
layer above the POSIX open/read/write/lseek/tell/close functions, but
the implementations of both layers live in the same shared object on
most platforms.
There's an overhead for each shared library an application needs to open
so simplifying rather, more libraries => slower. KDE have had problems
with this I believe - there are tricks like prelinking which can help,
but I think it's better to tackle the root of the problem and consider
carefully whether it's worthwhile splitting up libraries.
While some Xapian using applications would only need "Xapian-Core",
probably the majority will want both, especially as the functionality in
the secondary library grows. Remember that the VM system on a modern OS
will avoid paging in most parts of the library which are unused.
Historically, we used to have a separate library for the queryparser,
but it just didn't make sense overall. The housekeeping required for
tracking different library versions doesn't give enough benefits to
justify the effort.
The main benefit I see from such a split would be a reduced number of
relocations if you only need one library, but that's better addressed by
reducing the number of relocations required by the library, which
benefits everyone. For example, the snowball generated code could
easily be improved in this regard from what I've read.
But now isn't a good time to be making such major changes anyway. We
need to focus on releasing 1.0, not destabilising SVN HEAD!
> Anyway - I probably have a little time over the next couple of days to
> dedicate to this, so comments would be welcomed. If nothing else, I can
> implement a patch for the "Rework Omega's indextext.cc as a
xapian-core
> "TextSplitter" class." task (so if anyone else is already
working on
> this, shout now!).
I've already got much of the design sketched out for this and related
QueryParser changes. I've not started coding it yet, but I am working
on the issue, so it would probably be better to pick something else on
the TODO list to tackle.
For example, the issue of character encoding in the Python (and other)
bindings needs resolving. Your Python knowledge is better than mine,
but I have a feeling that Python uses wide characters for unicode
internally, so we probably want to perform conversion to/from utf-8 when
calling Xapian - that will make it hard to put binary data in terms,
values, document data, etc but getting Unicode string handled is more
important for most users I believe. I'm hoping the conversions can be
achieved with suitable typemaps.
Cheers,
Olly