Adam Sjøgren
2009-Mar-05 15:42 UTC
[Xapian-discuss] Stopping wildcard expansion at some point
Hi. Half a year ago, I ran into this annoying problem: Sometimes a user did a search containing wildcards, and our server would run out of memory. I found that the problem was that my index has _a lot_ of terms that start with the same thing, and the code that expands the wildcards (Term::as_wildcarded_query() in queryparser/queryparser.lemony) runs more or less amok in this particular case. But I would still like to have the wildcard functionality for the prefixes that do not expand to a gazillion terms... Back then I looked at the code and popped into the #xapian IRC channel and talked a little to richardb about it. He suggested that I could try adding a counter to as_wildcarded_query() and perhaps throwing an exception if the wildcards were expanding "too much". I quickly made a rough patch to throw an exception after 1000 terms - and it made our systems and, therefore, me happy. I simply catch the error and display a nice error-message to the user about the query being to broad. Now I have finally gotten the time to create a patch that is a little bit better, as it allows you to configure the maximum expansion on the QueryParser object by calling $qp->set_max_wildcard_expansion($max). The patch is attached to this email. It was made against trunk as of now (r12109), but I will happily make a patch against a branch, if need be. Is this the right way to go? Should I rather try to do something else? Do I need to adjust something? I know some C from way back, but never got around to learn C++, so I hope you'll bear with any stupidities - any and all feedback is welcome. I know I haven't done tests, all the documentation nor Changes-files, but I thought I would fly this by the mailinglist sooner rather than later (especially given how long it took me to get to this point...) Best regards, Adam -- "Soon we'll have spent a whole month at sea, Adam Sj?gren splitting atoms for no apparent reason" asjo at koldfront.dk -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Add-set_max_wildcard_expansion-method-to-the-querypa.patch Type: text/x-diff Size: 0 bytes Desc: not available Url : http://lists.xapian.org/pipermail/xapian-discuss/attachments/20090305/b3fb9548/attachment.patch
Olly Betts
2009-Mar-06 01:24 UTC
[Xapian-discuss] Stopping wildcard expansion at some point
On Thu, Mar 05, 2009 at 04:42:06PM +0100, Adam Sj?gren wrote:> Now I have finally gotten the time to create a patch that is a little > bit better, as it allows you to configure the maximum expansion on the > QueryParser object by calling $qp->set_max_wildcard_expansion($max).> Is this the right way to go? Should I rather try to do something else?One potential problem - it might be good to push the wildcard expansion into the backend, as it may be able to optimise it better that way (it should at least be able to avoid generating a Query object per expanded term): http://trac.xapian.org/ticket/48 But adding this feature to QueryParser means that we have to expand the wildcard there, or at least calculate how many terms it would expand to which requires iterating all the terms at that point if a limit is set. So setting a limit would incur a cost even if it never kicked in. Perhaps this is fixing the symptom rather than the problem - if these are actually useful searches we're rejecting, rather than for example accidental invocation of the wildcard facility, or malicious users, it would be better to make these cases work more efficiently than impose a somewhat arbitrary limit. For example, by storing prefix terms: http://trac.xapian.org/ticket/207 Or perhaps we should allow a lower limit on the number of characters before the wildcard rather than a limit on the number of expansions (so if this limit were 3, a* and ab* wouldn't be allowed, but abc* would).> Do I need to adjust something?It should use Xapian::termcount rather than long. The error class should be QueryParserError - InvalidOperationError "indicates the API was used in an invalid way", which isn't the case here. The error message is likely to be shown to the user, so should really mention the wildcard expansion which was the problem, in case there's more than one in the query, or the user doesn't know about the wildcard syntax and accidentally invoked the feature. (Hmm, or would the limit be better per parsed query than per wildcard expansion?)> I know I haven't done tests, all the documentation nor Changes-files, > but I thought I would fly this by the mailinglist sooner rather than > later (especially given how long it took me to get to this point...)Sure. Cheers, Olly