chris
2009-Aug-15 10:58 UTC
[Xapian-discuss] search queries with less than 3 characters, memory goes nuts
Good morning list, we're evaluating xapian by using it with acts_as_xapian and ruby since around 2 months and it is really a great piece of software, big thank for giving us such a high quality turbo finder. But we're facing problem with queries like: Top+schwarz+40 As soon as mongrel hands over the query to xapian, memory usage of the webserver-process goes up 'till the box runs out of ram and if i give the box 50GB swap, it'll eat them too. I could narrow the problem down to queries that contain parts, which are less than 3 characters. if no such queries come in, the webserverprocess will never need more than 100mb no matter how complicated the query is or how long it is running. The behaviour is not 100% consistent, sometimes such queries just take "a few gigabytes" and even return results. But the webservers will still not free the used memory, which is why they eat up all ram after a few of these queries anyway. i dont understand this behaviour and more important, i dont know what to do against it, as its surprisingly difficult to prepare the query before handing it over to xapian, because of combinations like '+test -"abcdef+1-2"'. It also seems a bit too redundant to clean the query at all, as xapian is most surely doing this much better than i could. So my questions are: - why does xapian use countless gigabytes of ram if i feed it such a query? - is there a need to clean the query before? i mean, could someone do something nasty with it? (except the usual html-security things, which we take care of by escaping the query before display) - what can i do to prevent this? I'm thankful for any suggestions, ideas or even a 'finished' solution ;) Greets, Chris PS. We're using v1.0.12 on linux and the index has ~3mio documents with around 1k text each.
Olly Betts
2009-Aug-15 12:36 UTC
[Xapian-discuss] search queries with less than 3 characters, memory goes nuts
On Sat, Aug 15, 2009 at 12:58:53PM +0200, chris wrote:> As soon as mongrel hands over the query to xapian, memory usage of the > webserver-process goes up 'till the box runs out of ram and if i > give the box 50GB swap, it'll eat them too. > > I could narrow the problem down to queries that contain parts, which are > less than 3 characters.There's nothing special regarding term length, though shorter terms tend to match more documents.> So my questions are: > - why does xapian use countless gigabytes of ram if i feed it such > a query?I've never seen it do so before.> - is there a need to clean the query before? i mean, could someone do > something nasty with it? (except the usual html-security things, > which we take care of by escaping the query before display)There shouldn't be a need.> - what can i do to prevent this?My guess is that acts_as_xapian is asking Xapian to return all possible matches, is getting a few million, and is storing them in a space-inefficient way. The code here seems to show @limit defaults to "-1" which I assume means "maximum unsigned integer" by the time Xapian sees it: http://github.com/Overbryd/acts_as_xapian/blob/dc3517c66b18dbf66733aac3ba436c7bf4ffcab8/lib/acts_as_xapian.rb It would be useful to narrow down which layer is causing this. Can you try running some of these "bad" queries without the Ruby layers involved (examples/quest in xapian-core provides an easy way to run a query against a database). If that works OK, try it from just using the Ruby bindings (without acts_as_xapian) - you may find examples/simplesearch.rb useful for that. If the problem is in acts_as_xapian, you'll need to talk to its developers, or just pass a sane limit giving the number of matches you actually want. It's a good idea to do that anyway since asking for all possible matches will disable various matcher optimisations and slow down searches. Cheers, Olly