Hi, Two questions which I'm unsure about: Stemming: I've turned on stemming, etc, but how can I confirm that it's being used in searches? What should I look/search for? Stopwords: I'm trying out xapian on a regional dataset (searching data from a *.co.us TLD, eg) . I've noticed that searching for [bob co.us] results in *very* slow search times (tens of seconds), since it seems to be searching for two extremely common (almost every document will have something.co.us in it) terms "co" and "us", and the not-so-common "bob". Searching only for "bob" is quick. Would it make sense to add "co" and "us" to the stopword list to prevent that kind of catastrophic slowdown in search time? Since the dataset is obviously about ".co.us" I feel it's kind of redundant to be searching for something you know is there... Thanks
On Mon, Nov 15, 2010 at 10:35:59AM +0200, goran kent wrote:> Stemming: I've turned on stemming, etc, but how can I confirm that > it's being used in searches? What should I look/search for?Look for Z-prefixed terms in the output of query.get_description().> Stopwords: I'm trying out xapian on a regional dataset (searching > data from a *.co.us TLD, eg) . I've noticed that searching for [bob > co.us] results in *very* slow search times (tens of seconds), since it > seems to be searching for two extremely common (almost every document > will have something.co.us in it) terms "co" and "us", and the > not-so-common "bob". Searching only for "bob" is quick. > > Would it make sense to add "co" and "us" to the stopword list to > prevent that kind of catastrophic slowdown in search time? Since the > dataset is obviously about ".co.us" I feel it's kind of redundant to > be searching for something you know is there...It often does make sense to choose stopwords based on the vocabulary of the text collection you are working with. And "us" would probably be a stopword in English anyway. But here bob.co.us is interpreted as a phrase, and stopwords are included in phrases by the QueryParser. In this case, I'm not sure you would want to ignore the ".co.us" part anyway - "bob.co.us" probably has a meaning sufficiently distinct from that of "bob" that you wouldn't want to conflate them. If you aren't already using Xapian 1.2, phrase searching should be faster with the new default chert backend. The patch in this ticket can also make a huge difference to slow phrase cases: http://trac.xapian.org/ticket/394 It really needs cleaning up and folding into trunk, but I've not had time to do so yet. If you try it, feedback would be much appreciated. Another option would be to treat '.' as a word character when between two letters, and so tokenise bob.co.us as a single term, but that's not supported by TermGenerator and QueryParser currently, so you'd have to patch Xapian or tokenise documents and queries yourself. Cheers, Olly
Also meant to ask: can I apply that patch to search-code only, or must it also go into the indexing code?
Am 15.11.2010 09:35, schrieb goran kent:> Would it make sense to add "co" and "us" to the stopword list to > prevent that kind of catastrophic slowdown in search time? Since the > dataset is obviously about ".co.us" I feel it's kind of redundant to > be searching for something you know is there...I'd simply cut off .co.us from search queries (if even present) and from the input to be indexed if it can be assumed to be present always. One thing that I tripped over while working on a Xapian-based search for data that isn't natural-language text: be aware that Xapian is treating some characters specially, for example if you throw a hyphen at the parser, it'll match the terms before and after it without hyphen (i.e. as one word) as well. This might not be what you want (if someone searches for "foo-bar.co.us" you might not want to show him results for "foobar.co.us"). Regards, Marinos
Thanks to all for comments. I'm inclined to silently strip out co.us if it's present in the query string. However, I'll be performing lots of tests to see what the effect will be and whether it broadly makes sense to do this from the end-user perspective. Cheers