On Wed, Apr 04, 2012 at 10:26:19AM +0800, Jaguar Xiong
wrote:> I'm looking for a full-text-search engine/library and find xapian after
> googling. The application I'm working on get some specific
characteristic:
> 1.The current data-set are already quite huge: multi-tera bytes, even
> after compression.
The size isn't an insurmountable issue, but you'll need to think
about how to handle it. If there's enough RAM to cache a few percent
of the database, you can get good query performance unless your
query load is high (since there are a lot of terms which are never or
rarely searched for, and the odd disk read isn't a problem).
If your query load becomes high, then caching results outside of Xapian
can help a lot. You can often cache the chunk of rendered HTML
containing the results, ready to slot into the page, or cache JSON or
XML ready to send asynchronously. If you design your system with this
in mind, it can probably be implemented later when you need it.
You can also split the database over multiple machines (sharding
by document essentially) and search over multiple remote databases
together.
You can also use replication to update multiple copies of the
databases and spread the query load across these using a load balancer.
> 2.Existing data are mostly read-only.
> 3.New data is coming every minute. The daily total could be several
> gigabytes (before compressing).
> 4.Query rate are not huge, yet. But I do expect a real-time search,
> that's, new data is expected to be available for searching after a few
> minutes.
>
> Could you share some thought about xapian with regard to above aspects?
> Especially on incremental index update.
The best approach for this sort of archive system is typically to
have a small database which new documents get added to and is
searched along with the older documents. Then periodically (if
you have a quieter period overnight that can be a good time) you can
merge this with the older database using xapian-compact.
Cheers,
Olly