thr3ads.net - Xapian discuss - [Xapian-discuss] Positive experiences with Xapian [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Peter Van Dijk

2011-Aug-02 23:00 UTC

[Xapian-discuss] Positive experiences with Xapian

Hi Guys,

I just wanted to take a moment to give some positive feedback regarding my
experiences with Xapian recently.
I've been doing a fair amount of research into search engines recently, as
we have some fairly specific requirements with what we're attempting to do
with them. Long story short, after a few weeks of playing around with just
about everything under the sun (or at least, everything off the shelf,
sphinx, lucene, solr, mysql/postgres fulltext, etc, etc), we recently
settled on Xapian because of it's specific design characteristics, and that
it's really really easy to use and alter.

The main reason we struggled to find something suitable was because of our
large data requirements: In terms of raw data, we're looking at indexing
about 1TB raw (ie. excluding the size of any indexes or other metadata), for
about 30,000 individual users (We're "top heavy" in terms of
database design
- small number of users, but large amount of data). This gives rise to a few
different issues.

Before we even begin on the search aspects, one thing that's important to us
is data separation. We're not running a blog or forum where you can mix
everyone's data in together and accept that there might be some errors from
time to time -- it's highly critical that nobody ever sees anyone else's
data. Lucene can deal with this in a general sense, specifically that it
fits into the same niche that Xapian does in terms of how it integrates with
everything else, it's just a library that you can use to create search
indexes essentially. However, 'off the shelf' engines like solr and
sphinx
fundamentally fail to handle the situation where you want to have phsyically
(as in, file system level) separation of data. Ever tried creating 30,000
individual indexes in sphinx? or solr? i can tell you first hand that they
dont even come close to working. (Please note, i'm fully aware of the
argument that this could be considered "designing it wrong", however,
i've
been designing these sorts of things for a long, long time, and i like to
think i know what i'm getting myself into). Lucene can handle this sort of
thing in theory, but given we're a PHP / C shop, having to build and support
Java apps for us would just be a nightmare, not to mention that without
prototyping such a system there's no guarantee that it will even work.

Hypothetically, even if you could get such systems to work in
Lucene/Solr/Sphinx, the other significant design flaw as far as i'm
concerned is the fact that they're designed to run "in memory".
This just
flat out does not work for us at all. The /raw/ data set we're dealing with
is about 1TB. After you've cooked this, by indexing and whatever other
processes take place, it'll end up being a multiple of this. This would end
up meaning that, were we to try shove everything in memory for performance
reasons, you'd have to have stupidly massive amounts of ram DEDICATED to the
searchd. Xapian, on the other hand, does what i consider to be the
"right"
thing, and actually uses the OS to cache it's file accesses. This approach
is totally superior as far as we're concerned. It allows us to throw as much
memory at a box as we require for performance reasons, without having to get
into the insanity of managing an individual service that needs to consume
99% of the available memory.

Another quick note on the database based fulltext indexes - MyISAM fulltext
is just fundamentally unable to handle what we want to do from a performance
standpoint, end of story. I think we calculated it'd take us something like
3 months to build the indexes on a single development server. I'm aware
postgres is a different story, but at the end of the day, it's really not
suitable either for the same reasons. They're designed as databases not as
search engines.

In summary, Xapian ticks all of the boxes for us. It can integrate with just
about any modern language, it's easy to use, it "just works" and
is
generally bug free, and it's made a great foundation for us to build our own
search services off of. There's a hundred other design aspects i haven't
touched on here (general feature set comes to mind. stemming, and out of the
box search accuracy come to mind), but for the most part we haven't been let
down yet.

Leaving one negative bit for last, and it's not a huge one by any means - as
someone who's been building large scale web apps since the dawn of time, the
swig PHP classes are fairly awful. Don't get me wrong - it's not a huge
issue, and i fully understand why they are they way they are (it's a c++
library, and is not php specific, therefore swig is a good fit). I'd like to
try do something about it in the future, so if i come up with anything
worthwhile you'll be the first to know.

We haven't put the system into production yet, but at this stage i'm
really
looking forward to finishing off development and seeing what happens.


Regards,
Peter

Paul Boddie

2011-Aug-08 09:38 UTC

head link

[Xapian-discuss] Positive experiences with Xapian

On 03/08/11 01:00, Peter Van Dijk wrote:>
> Hypothetically, even if you could get such systems to work in
> Lucene/Solr/Sphinx, the other significant design flaw as far as i'm
> concerned is the fact that they're designed to run "in
memory".
I'm sorry, but I seriously doubt that Lucene was "designed to run in 
memory", presumably meaning that you have to load the index into memory 
to get it to work: the characteristics of the data format are 
specifically designed to work efficiently with disk I/O.

[...]
> searchd. Xapian, on the other hand, does what i consider to be the
"right"
> thing, and actually uses the OS to cache it's file accesses.
Filesystem caching works effectively enough with Lucene. In fact, when I 
tested the "load it into RAM" approach using a "RAM
directory" or
whatever it may be called, it offered no real benefit over letting the 
OS do the caching. This was with a large number of searches spread 
across an index.
> This approach
> is totally superior as far as we're concerned. It allows us to throw as
much
> memory at a box as we require for performance reasons, without having to
get
> into the insanity of managing an individual service that needs to consume
> 99% of the available memory.
You can argue that Java itself imposes ridiculous memory management 
limitations - I was using PyLucene in the era when they supported a 
GCJ-compiled library - but that's a separate issue.

I'm not using either Lucene or Xapian actively at the moment, and I 
can't really call myself a Lucene enthusiast either - I switched from 
Lucene to Xapian for various reasons, some of which I probably share 
with you - but no-one benefits from inaccurate information about 
supposed "competitors" when having accurate information about them
could
actually inform Xapian development.
> Another quick note on the database based fulltext indexes - MyISAM fulltext
> is just fundamentally unable to handle what we want to do from a
performance
> standpoint, end of story. I think we calculated it'd take us something
like
> 3 months to build the indexes on a single development server. I'm aware
> postgres is a different story, but at the end of the day, it's really
not
> suitable either for the same reasons. They're designed as databases not
as
> search engines.
There's one thing that database systems are very good at, if configured 
appropriately, and that's determining the most optimal querying 
approach. I perform huge numbers of searches on indexed text in batches, 
and in such situations a database system would probably employ more 
efficient techniques transparently, mostly because they provide such 
facilities generally. Indeed, the general data management functions 
offered by systems like PostgreSQL have a lot more to bring to the table 
than people would have you believe.

The only reason why I'm not playing with PostgreSQL's full-text support 
is that they omitted support for general regular-expression-based 
tokenisation in favour of a handful of hand-coded tokenisers, and I 
don't yet have the inclination to write one which provides such an 
obviously useful feature.

Paul

Peter Van Dijk

2011-Aug-09 00:56 UTC

head link

[Xapian-discuss] Fwd: Positive experiences with Xapian

On 8 August 2011 19:38, Paul Boddie <paul.boddie at biotek.uio.no> wrote:
> On 03/08/11 01:00, Peter Van Dijk wrote:
>
>>
>> Hypothetically, even if you could get such systems to work in
>> Lucene/Solr/Sphinx, the other significant design flaw as far as i'm
>> concerned is the fact that they're designed to run "in
memory".
>>
>
> I'm sorry, but I seriously doubt that Lucene was "designed to run
in
> memory", presumably meaning that you have to load the index into
memory to
> get it to work: the characteristics of the data format are specifically
> designed to work efficiently with disk I/O.
>
>Let me start by saying thanks for the feedback :)

You are right, they aren't specifically designed that way, but to get the
levels of performance out of them that we require, we needed significantly
more memory than Xapian, (which could be due to other factors, i admit)
and I probably shouldn't have included lucene in that statement at all.

Regarding Sphinx and Solr though, my explanation was a bit flawed - I wasn't
trying to imply that they are designed to be in-memory in the same way that
some database engines are - what i was more referring to is that they
require a second layer of cache that's separate to the OS/FS cache. Using
sphinx as an example, it really is designed to use significant amounts of
memory for caching in it's searchd process, and solr does the same sort of
thing i believe (even though i'm not intimiately familiar with it).
With Xapian we don't need to worry about it (ie. memory management for
individual processes and such) since it simply relies on the OS, which is
the optimal approach for what we want.

Don't get me wrong, though, they all run just fine off of disk for a large
majority of use cases, and i'm not trying to scare anyone away from them,
It's just that when your data requirements get big enough, and your
performance requirements are high, some of the cracks really start to show
in terms of how they all fit together. (and even then, i'm fairly sure our
data requirements aren't "that big" compared to a lot of other
stuff out
there)

For what it's worth, we've been using Sphinx in other systems for years
now
(and will continue using it), and it's great at what it does.

>  This approach
>> is totally superior as far as we're concerned. It allows us to
throw as
>> much
>> memory at a box as we require for performance reasons, without having
to
>> get
>> into the insanity of managing an individual service that needs to
consume
>> 99% of the available memory.
>>
>
> You can argue that Java itself imposes ridiculous memory management
> limitations - I was using PyLucene in the era when they supported a
> GCJ-compiled library - but that's a separate issue.
>
> I'm not using either Lucene or Xapian actively at the moment, and I
can't
> really call myself a Lucene enthusiast either - I switched from Lucene to
> Xapian for various reasons, some of which I probably share with you - but
> no-one benefits from inaccurate information about supposed
"competitors"
> when having accurate information about them could actually inform Xapian
> development.
>
My post was mainly intended to be a somewhat technical "thank you" to
anyone
involved with Xapian that might see it,
I see nothing as a competitor, as you put it. I don't advocate anything -
happy to let people make their own mind up, and use whatever tool is right
for the job.
Not to mention that we don't even have Xapian in production yet, so my
comments should all be taken with a grain of salt :)

Anyway, I'm far from an expert, but i just wanted to try to explain why it
works so well for us, and i figured some people might appreciate the
positive feedback.

>
>  Another quick note on the database based fulltext indexes - MyISAM
>> fulltext
>> is just fundamentally unable to handle what we want to do from a
>> performance
>> standpoint, end of story. I think we calculated it'd take us
something
>> like
>> 3 months to build the indexes on a single development server. I'm
aware
>> postgres is a different story, but at the end of the day, it's
really not
>> suitable either for the same reasons. They're designed as databases
not as
>> search engines.
>>
>
> There's one thing that database systems are very good at, if configured
> appropriately, and that's determining the most optimal querying
approach. I
> perform huge numbers of searches on indexed text in batches, and in such
> situations a database system would probably employ more efficient
techniques
> transparently, mostly because they provide such facilities generally.
> Indeed, the general data management functions offered by systems like
> PostgreSQL have a lot more to bring to the table than people would have you
> believe.
>
> The only reason why I'm not playing with PostgreSQL's full-text
support is
> that they omitted support for general regular-expression-based tokenisation
> in favour of a handful of hand-coded tokenisers, and I don't yet have
the
> inclination to write one which provides such an obviously useful feature.
>
Well i'm a MySQL nut from way back, so I would have loved to use an RDBMS of
any kind to solve my problems. It's just a shame it "doesnt work"
for us. I
never got so far as to play with Postgres' tokenisers, but i can see why
that'd be an issue for a lot of people. Though I think one of the other
notable things about Postgres is that it has a lot more work being done on
it in the fulltext search realm than MySQL.

Going back a few years, Sphinx held a lot of initial appeal for me - the
MySQL integration is a nice touch if you're working with a dev team that
uses MySQL daily.

Peter

Maybe Matching Threads

Search for more maybe matching threads

Xapian discuss - Aug 2011 - Positive experiences with Xapian

[Xapian-discuss] Positive experiences with Xapian

[Xapian-discuss] Positive experiences with Xapian

[Xapian-discuss] Fwd: Positive experiences with Xapian

Maybe Matching Threads