Hi Everyone, I'm evaluating Xapian for the following -hard- use-case: 1) document structure: avg. 100kb full-text, 5x meta-data a 100bytes, 3x bool. flags 2) big index, i.e. full-text volume ~ 1TB/disk (2x HD, mirrored) 3) low query-frequency (<1/sec) 4) 10 inserts/sec (on a 4core host) 5) *high-update frequency of meta-data* mostly onto the bool. flags: ~20-30/sec Requirements 3 and 4 are no problem, inserts can be cached and mostly steered towards bulk disk I/O when the load allows for it. The question is, if 5) can be achieved. It seems that an updateMyDoc(myDocId, meta-key, meta-value) implementation, invariably ends up running some variation of the following by the (Flint) backend: docid = query(myDocId) doc get_document(docid) // "updating" then maps to: * replace doc's meta-data in-memory * delete(mark-deleted ?) old doc in the index * re-insert the new doc The last two ops work on the index cache. The bottleneck seems to be the get_document operation which apparently causes (un-cached**) disk seeks. **Our RAM/Disk quotient is too small for the OS disk cache to be effective. Is there any way to make get_document "lazier" i.e. not do lookups in the persistent index - and do the meta-date replace "dirty" i.e. simply write the new value in the cache and don't make it persistent until flush() ? What are the performance dis-/advantages of modeling meta-data as prefix-terms vs. document values ? Did I leave out any important constraints/facts ? Otherwise: Any help, hints, experiences would be *greatly* appreciated. Thanks, --jan
Hi Everyone, I'm evaluating Xapian for the following -hard- use-case: 1) document structure: avg. 100kb full-text, 5x meta-data a 100bytes, 3x bool. flags 2) big index, i.e. full-text volume ~ 1TB/disk (2x HD, mirrored) 3) low query-frequency (<1/sec) 4) 10 inserts/sec (on a 4core host) 5) *high-update frequency of meta-data* mostly onto the bool. flags: ~20-30/sec Requirements 3 and 4 are no problem, inserts can be cached and mostly steered towards bulk disk I/O when the load allows for it. The question is, if 5) can be achieved. It seems that an updateMyDoc(myDocId, meta-key, meta-value) implementation, invariably ends up running some variation of the following by the (Flint) backend: docid = query(myDocId) doc get_document(docid) // "updating" then maps to: * replace doc's meta-data in-memory * delete(mark-deleted ?) old doc in the index * re-insert the new doc The last two ops work on the index cache. The bottleneck seems to be the get_document operation which apparently causes (un-cached**) disk seeks. **Our RAM/Disk quotient is too small for the OS disk cache to be effective. Is there any way to make get_document "lazier" i.e. not do lookups in the persistent index - and do the meta-date replace "dirty" i.e. simply write the new value in the cache and don't make it persistent until flush() ? What are the performance dis-/advantages of modeling meta-data as prefix-terms vs. document values ? Did I leave out any important constraints/facts ? Otherwise: Any help, hints, experiences would be *greatly* appreciated. Thanks, --jan -- <html><head> <title>DEREFER</title> <META HTTP-EQUIV="REFRESH" CONTENT="0; URL=http://www.gmx.net/de/?status=hinweis"> </head> <body bgcolor="#ffffff" link="#666666" vlink="#666666"> <table width="100%" height="100%" border="0"><tr><td align="center"><a href="http://www.gmx.net/de/?status=hinweis"><font face="Arial, Helvetica, sans-serif" size="2" color="#666666">Einen Moment bitte, die angeforderte Seite wird geladen...</font></a></td></tr></table> </body></html> Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate f?r nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
On Thu, Aug 13, 2009 at 09:18:40AM +0200, Jan wrote:> Is there any way to make get_document "lazier" i.e. not do lookups in > the persistent index - and do the meta-date replace "dirty" i.e. simply > write the new value in the cache and don't make it persistent until > flush() ?This patch helps in many cases (for apt-xapian-index, it improved a testcase of updating just values from about 40 seconds to less than one): http://oligarchy.co.uk/xapian/patches/xapian-flint-lazy-update-backport-for-1.0.patch It's quite likely to be in 1.0.15 (and more success stories would make that more likely). It's already in the 1.1.x development releases.> What are the performance dis-/advantages of modeling meta-data as > prefix-terms vs. document values ?It depends how you want to use it really. If you want to select one or a few of the possible values, a prefixed boolean term is good. But if you want to select potentially large ranges, or perform more complex tests than "is a member of" (e.g. geographical distance filtering) then values are more flexible. With 1.1.x, you can also use externally stored meta-data and Xapian::PostingSource. Cheers, Olly
Hi Olly,> This patch helps in many cases (for apt-xapian-index, it improved a > testcase of updating just values from about 40 seconds to less than one): > > > http://oligarchy.co.uk/xapian/patches/xapian-flint-lazy-update-backport-for-1.0.patch > >we (Jan and me) have tested the patch and the results are really amazing. For a simple test case the update performance of values increased by a factor of >50. Thanks a lot for the great work! Regards, Daniel
On Wed, Sep 02, 2009 at 11:14:18PM +0200, Daniel Etzold wrote:> > http://oligarchy.co.uk/xapian/patches/xapian-flint-lazy-update-backport-for-1.0.patch > > we (Jan and me) have tested the patch and the results are really amazing. > For a simple test case the update performance of values increased by a > factor of >50.Just to note, this patch was released in 1.0.15. Cheers, Olly