thr3ads.net - Xapian discuss - [Xapian-discuss] Incremental Index update [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Jaguar Xiong

2012-Apr-04 02:26 UTC

[Xapian-discuss] Incremental Index update

Hi there,
I'm looking for a full-text-search engine/library and find xapian after
googling. The application I'm working on get some specific characteristic:
1.The current data-set are already quite huge: multi-tera bytes, even
after compression.
2.Existing data are mostly read-only.
3.New data is coming every minute. The daily total could be several
gigabytes (before compressing).
4.Query rate are not huge, yet. But I do expect a real-time search,
that's, new data is expected to be available for searching after a few
minutes.

Could you share some thought about xapian with regard to above aspects?
Especially on incremental index update.

Best Regards!
Jaguar

Olly Betts

2012-Apr-05 00:30 UTC

head link

[Xapian-discuss] Incremental Index update

On Wed, Apr 04, 2012 at 10:26:19AM +0800, Jaguar Xiong
wrote:> I'm looking for a full-text-search engine/library and find xapian after
> googling. The application I'm working on get some specific
characteristic:
> 1.The current data-set are already quite huge: multi-tera bytes, even
> after compression.
The size isn't an insurmountable issue, but you'll need to think
about how to handle it.  If there's enough RAM to cache a few percent
of the database, you can get good query performance unless your
query load is high (since there are a lot of terms which are never or
rarely searched for, and the odd disk read isn't a problem).

If your query load becomes high, then caching results outside of Xapian
can help a lot.  You can often cache the chunk of rendered HTML
containing the results, ready to slot into the page, or cache JSON or
XML ready to send asynchronously.  If you design your system with this
in mind, it can probably be implemented later when you need it.

You can also split the database over multiple machines (sharding
by document essentially) and search over multiple remote databases
together.

You can also use replication to update multiple copies of the
databases and spread the query load across these using a load balancer.
> 2.Existing data are mostly read-only.
> 3.New data is coming every minute. The daily total could be several
> gigabytes (before compressing).
> 4.Query rate are not huge, yet. But I do expect a real-time search,
> that's, new data is expected to be available for searching after a few
> minutes.
> 
> Could you share some thought about xapian with regard to above aspects?
> Especially on incremental index update.
The best approach for this sort of archive system is typically to
have a small database which new documents get added to and is
searched along with the older documents.  Then periodically (if
you have a quieter period overnight that can be a good time) you can
merge this with the older database using xapian-compact.

Cheers,
    Olly

Xapian discuss - Apr 2012 - Incremental Index update

[Xapian-discuss] Incremental Index update

[Xapian-discuss] Incremental Index update