Hi Guys, I just wanted to take a moment to give some positive feedback regarding my experiences with Xapian recently. I've been doing a fair amount of research into search engines recently, as we have some fairly specific requirements with what we're attempting to do with them. Long story short, after a few weeks of playing around with just about everything under the sun (or at least, everything off the shelf, sphinx, lucene, solr, mysql/postgres fulltext, etc, etc), we recently settled on Xapian because of it's specific design characteristics, and that it's really really easy to use and alter. The main reason we struggled to find something suitable was because of our large data requirements: In terms of raw data, we're looking at indexing about 1TB raw (ie. excluding the size of any indexes or other metadata), for about 30,000 individual users (We're "top heavy" in terms of database design - small number of users, but large amount of data). This gives rise to a few different issues. Before we even begin on the search aspects, one thing that's important to us is data separation. We're not running a blog or forum where you can mix everyone's data in together and accept that there might be some errors from time to time -- it's highly critical that nobody ever sees anyone else's data. Lucene can deal with this in a general sense, specifically that it fits into the same niche that Xapian does in terms of how it integrates with everything else, it's just a library that you can use to create search indexes essentially. However, 'off the shelf' engines like solr and sphinx fundamentally fail to handle the situation where you want to have phsyically (as in, file system level) separation of data. Ever tried creating 30,000 individual indexes in sphinx? or solr? i can tell you first hand that they dont even come close to working. (Please note, i'm fully aware of the argument that this could be considered "designing it wrong", however, i've been designing these sorts of things for a long, long time, and i like to think i know what i'm getting myself into). Lucene can handle this sort of thing in theory, but given we're a PHP / C shop, having to build and support Java apps for us would just be a nightmare, not to mention that without prototyping such a system there's no guarantee that it will even work. Hypothetically, even if you could get such systems to work in Lucene/Solr/Sphinx, the other significant design flaw as far as i'm concerned is the fact that they're designed to run "in memory". This just flat out does not work for us at all. The /raw/ data set we're dealing with is about 1TB. After you've cooked this, by indexing and whatever other processes take place, it'll end up being a multiple of this. This would end up meaning that, were we to try shove everything in memory for performance reasons, you'd have to have stupidly massive amounts of ram DEDICATED to the searchd. Xapian, on the other hand, does what i consider to be the "right" thing, and actually uses the OS to cache it's file accesses. This approach is totally superior as far as we're concerned. It allows us to throw as much memory at a box as we require for performance reasons, without having to get into the insanity of managing an individual service that needs to consume 99% of the available memory. Another quick note on the database based fulltext indexes - MyISAM fulltext is just fundamentally unable to handle what we want to do from a performance standpoint, end of story. I think we calculated it'd take us something like 3 months to build the indexes on a single development server. I'm aware postgres is a different story, but at the end of the day, it's really not suitable either for the same reasons. They're designed as databases not as search engines. In summary, Xapian ticks all of the boxes for us. It can integrate with just about any modern language, it's easy to use, it "just works" and is generally bug free, and it's made a great foundation for us to build our own search services off of. There's a hundred other design aspects i haven't touched on here (general feature set comes to mind. stemming, and out of the box search accuracy come to mind), but for the most part we haven't been let down yet. Leaving one negative bit for last, and it's not a huge one by any means - as someone who's been building large scale web apps since the dawn of time, the swig PHP classes are fairly awful. Don't get me wrong - it's not a huge issue, and i fully understand why they are they way they are (it's a c++ library, and is not php specific, therefore swig is a good fit). I'd like to try do something about it in the future, so if i come up with anything worthwhile you'll be the first to know. We haven't put the system into production yet, but at this stage i'm really looking forward to finishing off development and seeing what happens. Regards, Peter
On 03/08/11 01:00, Peter Van Dijk wrote:> > Hypothetically, even if you could get such systems to work in > Lucene/Solr/Sphinx, the other significant design flaw as far as i'm > concerned is the fact that they're designed to run "in memory".I'm sorry, but I seriously doubt that Lucene was "designed to run in memory", presumably meaning that you have to load the index into memory to get it to work: the characteristics of the data format are specifically designed to work efficiently with disk I/O. [...]> searchd. Xapian, on the other hand, does what i consider to be the "right" > thing, and actually uses the OS to cache it's file accesses.Filesystem caching works effectively enough with Lucene. In fact, when I tested the "load it into RAM" approach using a "RAM directory" or whatever it may be called, it offered no real benefit over letting the OS do the caching. This was with a large number of searches spread across an index.> This approach > is totally superior as far as we're concerned. It allows us to throw as much > memory at a box as we require for performance reasons, without having to get > into the insanity of managing an individual service that needs to consume > 99% of the available memory.You can argue that Java itself imposes ridiculous memory management limitations - I was using PyLucene in the era when they supported a GCJ-compiled library - but that's a separate issue. I'm not using either Lucene or Xapian actively at the moment, and I can't really call myself a Lucene enthusiast either - I switched from Lucene to Xapian for various reasons, some of which I probably share with you - but no-one benefits from inaccurate information about supposed "competitors" when having accurate information about them could actually inform Xapian development.> Another quick note on the database based fulltext indexes - MyISAM fulltext > is just fundamentally unable to handle what we want to do from a performance > standpoint, end of story. I think we calculated it'd take us something like > 3 months to build the indexes on a single development server. I'm aware > postgres is a different story, but at the end of the day, it's really not > suitable either for the same reasons. They're designed as databases not as > search engines.There's one thing that database systems are very good at, if configured appropriately, and that's determining the most optimal querying approach. I perform huge numbers of searches on indexed text in batches, and in such situations a database system would probably employ more efficient techniques transparently, mostly because they provide such facilities generally. Indeed, the general data management functions offered by systems like PostgreSQL have a lot more to bring to the table than people would have you believe. The only reason why I'm not playing with PostgreSQL's full-text support is that they omitted support for general regular-expression-based tokenisation in favour of a handful of hand-coded tokenisers, and I don't yet have the inclination to write one which provides such an obviously useful feature. Paul
Peter Van Dijk
2011-Aug-09 00:56 UTC
[Xapian-discuss] Fwd: Positive experiences with Xapian
On 8 August 2011 19:38, Paul Boddie <paul.boddie at biotek.uio.no> wrote:> On 03/08/11 01:00, Peter Van Dijk wrote: > >> >> Hypothetically, even if you could get such systems to work in >> Lucene/Solr/Sphinx, the other significant design flaw as far as i'm >> concerned is the fact that they're designed to run "in memory". >> > > I'm sorry, but I seriously doubt that Lucene was "designed to run in > memory", presumably meaning that you have to load the index into memory to > get it to work: the characteristics of the data format are specifically > designed to work efficiently with disk I/O. > >Let me start by saying thanks for the feedback :) You are right, they aren't specifically designed that way, but to get the levels of performance out of them that we require, we needed significantly more memory than Xapian, (which could be due to other factors, i admit) and I probably shouldn't have included lucene in that statement at all. Regarding Sphinx and Solr though, my explanation was a bit flawed - I wasn't trying to imply that they are designed to be in-memory in the same way that some database engines are - what i was more referring to is that they require a second layer of cache that's separate to the OS/FS cache. Using sphinx as an example, it really is designed to use significant amounts of memory for caching in it's searchd process, and solr does the same sort of thing i believe (even though i'm not intimiately familiar with it). With Xapian we don't need to worry about it (ie. memory management for individual processes and such) since it simply relies on the OS, which is the optimal approach for what we want. Don't get me wrong, though, they all run just fine off of disk for a large majority of use cases, and i'm not trying to scare anyone away from them, It's just that when your data requirements get big enough, and your performance requirements are high, some of the cracks really start to show in terms of how they all fit together. (and even then, i'm fairly sure our data requirements aren't "that big" compared to a lot of other stuff out there) For what it's worth, we've been using Sphinx in other systems for years now (and will continue using it), and it's great at what it does.> This approach >> is totally superior as far as we're concerned. It allows us to throw as >> much >> memory at a box as we require for performance reasons, without having to >> get >> into the insanity of managing an individual service that needs to consume >> 99% of the available memory. >> > > You can argue that Java itself imposes ridiculous memory management > limitations - I was using PyLucene in the era when they supported a > GCJ-compiled library - but that's a separate issue. > > I'm not using either Lucene or Xapian actively at the moment, and I can't > really call myself a Lucene enthusiast either - I switched from Lucene to > Xapian for various reasons, some of which I probably share with you - but > no-one benefits from inaccurate information about supposed "competitors" > when having accurate information about them could actually inform Xapian > development. >My post was mainly intended to be a somewhat technical "thank you" to anyone involved with Xapian that might see it, I see nothing as a competitor, as you put it. I don't advocate anything - happy to let people make their own mind up, and use whatever tool is right for the job. Not to mention that we don't even have Xapian in production yet, so my comments should all be taken with a grain of salt :) Anyway, I'm far from an expert, but i just wanted to try to explain why it works so well for us, and i figured some people might appreciate the positive feedback.> > Another quick note on the database based fulltext indexes - MyISAM >> fulltext >> is just fundamentally unable to handle what we want to do from a >> performance >> standpoint, end of story. I think we calculated it'd take us something >> like >> 3 months to build the indexes on a single development server. I'm aware >> postgres is a different story, but at the end of the day, it's really not >> suitable either for the same reasons. They're designed as databases not as >> search engines. >> > > There's one thing that database systems are very good at, if configured > appropriately, and that's determining the most optimal querying approach. I > perform huge numbers of searches on indexed text in batches, and in such > situations a database system would probably employ more efficient techniques > transparently, mostly because they provide such facilities generally. > Indeed, the general data management functions offered by systems like > PostgreSQL have a lot more to bring to the table than people would have you > believe. > > The only reason why I'm not playing with PostgreSQL's full-text support is > that they omitted support for general regular-expression-based tokenisation > in favour of a handful of hand-coded tokenisers, and I don't yet have the > inclination to write one which provides such an obviously useful feature. >Well i'm a MySQL nut from way back, so I would have loved to use an RDBMS of any kind to solve my problems. It's just a shame it "doesnt work" for us. I never got so far as to play with Postgres' tokenisers, but i can see why that'd be an issue for a lot of people. Though I think one of the other notable things about Postgres is that it has a lot more work being done on it in the fulltext search realm than MySQL. Going back a few years, Sphinx held a lot of initial appeal for me - the MySQL integration is a nice touch if you're working with a dev team that uses MySQL daily. Peter