Greetings, Using xapian revision 13300 (chert db). Test chert database is about 4GB - 320,000 docs. Performance for typical one or more keyword searches is quick. For example, search for [upload site page] yields the query: Xapian::Query((upload:(pos=1) OR site:(pos=2) OR page:(pos=3))) Takes a second. However, searching for something like [co.uk] is mind-numbingly and _alarmingly_ slow. Xapian::Query((co:(pos=1) PHRASE 2 uk:(pos=2))) Looks like it interprets this search as a phrase. Takes over _40_ seconds. Typical phrase searches, such as ["your email"] take a few seconds longer than normal keyword searches (as expected), but nowhere near as slow as 40+s. I'm trying to get a handle on how best to improve the situation, so having something to compare against would be informative. I notice that gmane.org has about 70 million articles, yet the same search [co.uk] returns in 4s. Yes, these are plain text and relatively small docs, but still... I must be doing something wrong. If I may: What DB format is gmane.org using (chert/flint)? What's the DB size on disk? How many search servers is gmane.org using? Their approx. spec? Any comments would be appreciated. Thanks Henry
Arjen van der Meijden
2009-Aug-27 14:48 UTC
[Xapian-discuss] Xapian performance on gmane.org compared
On 27-8-2009 16:06 Henry wrote:> Using xapian revision 13300 (chert db). > Test chert database is about 4GB - 320,000 docs. > > Performance for typical one or more keyword searches is quick. For > example, search for [upload site page] yields the query: > Xapian::Query((upload:(pos=1) OR site:(pos=2) OR page:(pos=3))) > Takes a second. > > However, searching for something like [co.uk] is mind-numbingly and > _alarmingly_ slow. > Xapian::Query((co:(pos=1) PHRASE 2 uk:(pos=2))) > Looks like it interprets this search as a phrase. > Takes over _40_ seconds.You could have a look at the size of the result for non-phrased co and uk (i.e. "co AND uk"). We've seen pretty bad performance for some phrase queries in the flint-database, but then our machine used to be io-dependent. This should give you an idea of how many documents are loaded from disk for the initial selection and how fast that goes. But since the phrase-query touches another large table, you can't use it as more than a simple base line.> I'm trying to get a handle on how best to improve the situation, so > having something to compare against would be informative. I notice > that gmane.org has about 70 million articles, yet the same search > [co.uk] returns in 4s. Yes, these are plain text and relatively small > docs, but still...4GB is a "very small" database, i.e. it can fit in a amount of ram that is now becoming common for desktops. How much memory does your search-machine have? If it doesn't have at least 4GB, and you can spare a bit of money, increase it. If there are no other factors in play, and your query-performance is solely or largely caused by lacking I/O-performance, you could also install a ssd-drive. With our benchmark, we had all phrase-queries turn from io-limited into cpu-limited, simply because both the ram and ssd's in our server just were easily fast enough to keep up. Best regards, Arjen
Richard Boulton
2009-Aug-27 16:06 UTC
[Xapian-discuss] Xapian performance on gmane.org compared
2009/8/27 Henry <henka at cityweb.co.za>> Using xapian revision 13300 (chert db). > Test chert database is about 4GB - 320,000 docs. >Just to note - chert is still in development, and there are certainly issues with its performance (eg http://trac.xapian.org/ticket/326 ). Have you tried this with flint, and if so, how do the times compare? -- Richard
Richard Boulton
2009-Aug-27 16:10 UTC
[Xapian-discuss] Xapian performance on gmane.org compared
You might also like to try out the patch attached to ticket 394, which may speed up phrase searches for you significantly (let us know if you try it) http://trac.xapian.org/ticket/394 -- Richard
Quoting "Richard Boulton" <richard at tartarus.org>:> You might also like to try out the patch attached to ticket 394, which may > speed up phrase searches for you significantly (let us know if you try it) > > http://trac.xapian.org/ticket/394Quite an improvement: anecdotally, phrase searches improve by ~50% -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: PGP Digital Signature Url : http://lists.xapian.org/pipermail/xapian-discuss/attachments/20090827/31e85870/attachment.pgp
Quoting Henry <henka at cityweb.co.za>:> However, searching for something like [co.uk] > Xapian::Query((co:(pos=1) PHRASE 2 uk:(pos=2))) > Looks like it interprets this search as a phrase.Any idea why (by default) the string co.uk is being parsed as a _phrase_ instead of two keywords? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: PGP Digital Signature Url : http://lists.xapian.org/pipermail/xapian-discuss/attachments/20090827/4fd45d7d/attachment.pgp
Olly Betts
2009-Aug-28 01:17 UTC
[Xapian-discuss] Xapian performance on gmane.org compared
On Thu, Aug 27, 2009 at 04:06:06PM +0200, Henry wrote:> I'm trying to get a handle on how best to improve the situation, so > having something to compare against would be informative. I notice > that gmane.org has about 70 million articles, yet the same search > [co.uk] returns in 4s. Yes, these are plain text and relatively small > docs, but still...Note that gmane doesn't currently index positional information - the current search machine doesn't have enough disk space to!> If I may: > What DB format is gmane.org using (chert/flint)?As document on http://search.gmane.org, it's chert.> What's the DB size on disk?138GB.> How many search servers is gmane.org using? Their approx. spec?One, which also handles indexing - see "rain" in the list here: http://gmane.org/host.php I'm in the process of commissioning a replacement server ("plane" above) with a lot more disk space, but it isn't currently live. As Richard says, my patch in #394 should help, but note that you can tune the size of the "pond" by setting POND_SIZE in the environment. The default is 100000 which was sane for the situation I wrote it for, but higher or lower might be better (and I'd be interested to hear what works best for other situations so we can set it sanely automatically). There's no benefit in setting it higher than the number of documents matched by the AND query of the terms in the phrase. Cheers, Olly