Marinos Yannikos
2009-Jan-19 12:46 UTC
[Xapian-discuss] storing documents in Xapian vs. external store (when other indexes are needed)
Hello, for a set of documents that are indexed with Xapian for fast search and also with external (hash/B-Tree etc., like tokyocabinet) indexes for fast access by value, is it a good idea to store the whole document in Xapian's DB and fetch it by Xapian's doc_id after searching in the external index, or the other way round, i.e. store the document somewhere else and use some external oid as the Xapian "document"? In other words/short version: is Xapian/Flint good for storing documents even if they are often fetched by doc_id? I can think of the following advantages/disadvantages for storing documents in Flint: + faster retrieval by doc_id and by query since no external index operation is needed - possibly slower retrieval by some other indexed value if fetching from Flint by doc_id is slower than the external storage solution (tokyocabinet etc.) - bigger DB, perhaps slower access - document changes are probably slower even if the indexed text is not changed Any opinions/suggestions? Am I on the wrong track for storing documents with several indexed values + fast text search? (I know that the problem fits an RDBMS well, but Xapian is so much faster) Regards, Marinos
Olly Betts
2009-Jan-20 00:23 UTC
[Xapian-discuss] storing documents in Xapian vs. external store (when other indexes are needed)
On Mon, Jan 19, 2009 at 01:46:58PM +0100, Marinos Yannikos wrote:> for a set of documents that are indexed with Xapian for fast search and > also with external (hash/B-Tree etc., like tokyocabinet) indexes for fast > access by value, is it a good idea to store the whole document in Xapian's > DB and fetch it by Xapian's doc_id after searching in the external index, > or the other way round, i.e. store the document somewhere else and use > some external oid as the Xapian "document"?My usual advice is to store the document externally if you need to access it externally (e.g. if you have an existing SQL-based system it's likely easier not to have to change it to pull the data out of Xapian instead). Otherwise you might as well put it in Xapian.> In other words/short version: is Xapian/Flint good for storing documents > even if they are often fetched by doc_id?Yes, the document data is stored in a Btree keyed by the document id.> - possibly slower retrieval by some other indexed value if fetching from > Flint by doc_id is slower than the external storage solution (tokyocabinet > etc.)I've not compared, but I'd expect it to be competitive. If you benchmark I'd be interested to see results.> - bigger DB, perhaps slower accessIt's a separate table, so shouldn't make a difference to matching. The OS will have more Xapian data to consider caching, but that's probably equivalent to the data from the external store it would have to consider caching if you used one (if that's on the same machine at least).> - document changes are probably slower even if the indexed text is not > changedThis is an issue for flint. Chert is already better at not rewriting unchanged data in this case. There's scope for further work - see this ticket: http://trac.xapian.org/ticket/250> Any opinions/suggestions? Am I on the wrong track for storing documents > with several indexed values + fast text search? (I know that the problem > fits an RDBMS well, but Xapian is so much faster)I think it's a sane option to consider for many uses. But if you need the relational aspects of an RDBMS or advanced SQL queries, you probably aren't going to be satisfied with using Xapian alone. Cheers, Olly