thr3ads.net - Xapian discuss - [Xapian-discuss] Stopword addition and stemming [Nov 2010]

If this information is useful, please help other people find it:
Share via:

goran kent

2010-Nov-15 08:35 UTC

[Xapian-discuss] Stopword addition and stemming

Hi,

Two questions which I'm unsure about:

Stemming:  I've turned on stemming, etc, but how can I confirm that
it's being used in searches?  What should I look/search for?

Stopwords:  I'm trying out xapian on a regional dataset (searching
data from a *.co.us TLD, eg) .  I've noticed that searching for [bob
co.us] results in *very* slow search times (tens of seconds), since it
seems to be searching for two extremely common (almost every document
will have something.co.us in it) terms "co" and "us", and
the
not-so-common "bob".  Searching only for "bob" is quick.

Would it make sense to add "co" and "us" to the stopword
list to
prevent that kind of catastrophic slowdown in search time?  Since the
dataset is obviously about ".co.us" I feel it's kind of redundant
to
be searching for something you know is there...

Thanks

Olly Betts

2010-Nov-15 10:48 UTC

head link

[Xapian-discuss] Stopword addition and stemming

On Mon, Nov 15, 2010 at 10:35:59AM +0200, goran kent
wrote:> Stemming:  I've turned on stemming, etc, but how can I confirm that
> it's being used in searches?  What should I look/search for?
Look for Z-prefixed terms in the output of query.get_description().
> Stopwords:  I'm trying out xapian on a regional dataset (searching
> data from a *.co.us TLD, eg) .  I've noticed that searching for [bob
> co.us] results in *very* slow search times (tens of seconds), since it
> seems to be searching for two extremely common (almost every document
> will have something.co.us in it) terms "co" and "us",
and the
> not-so-common "bob".  Searching only for "bob" is
quick.
> 
> Would it make sense to add "co" and "us" to the
stopword list to
> prevent that kind of catastrophic slowdown in search time?  Since the
> dataset is obviously about ".co.us" I feel it's kind of
redundant to
> be searching for something you know is there...
It often does make sense to choose stopwords based on the vocabulary of
the text collection you are working with.  And "us" would probably be
a
stopword in English anyway.

But here bob.co.us is interpreted as a phrase, and stopwords are included
in phrases by the QueryParser.

In this case, I'm not sure you would want to ignore the ".co.us"
part
anyway - "bob.co.us" probably has a meaning sufficiently distinct from
that of "bob" that you wouldn't want to conflate them.

If you aren't already using Xapian 1.2, phrase searching should be faster
with the new default chert backend.

The patch in this ticket can also make a huge difference to slow phrase
cases:

http://trac.xapian.org/ticket/394

It really needs cleaning up and folding into trunk, but I've not had
time to do so yet.  If you try it, feedback would be much appreciated.

Another option would be to treat '.' as a word character when between
two letters, and so tokenise bob.co.us as a single term, but that's not
supported by TermGenerator and QueryParser currently, so you'd have to
patch Xapian or tokenise documents and queries yourself.

Cheers,
    Olly

goran kent

2010-Nov-15 12:18 UTC

head link

[Xapian-discuss] Stopword addition and stemming

Also meant to ask:  can I apply that patch to search-code only, or
must it also go into the indexing code?

Marinos Yannikos

2010-Nov-17 20:33 UTC

head link

[Xapian-discuss] Stopword addition and stemming

Am 15.11.2010 09:35, schrieb goran kent:> Would it make sense to add "co" and "us" to the
stopword list to
> prevent that kind of catastrophic slowdown in search time?  Since the
> dataset is obviously about ".co.us" I feel it's kind of
redundant to
> be searching for something you know is there...
I'd simply cut off .co.us from search queries (if even present) and from the
input to be indexed if it can be assumed to be present always.

One thing that I tripped over while working on a Xapian-based search for data 
that isn't natural-language text: be aware that Xapian is treating some 
characters specially, for example if you throw a hyphen at the parser, it'll
match the terms before and after it without hyphen (i.e. as one word) as well. 
This might not be what you want (if someone searches for
"foo-bar.co.us" you
might not want to show him results for "foobar.co.us").

Regards,
  Marinos

goran kent

2010-Nov-18 12:44 UTC

head link

[Xapian-discuss] Stopword addition and stemming

Thanks to all for comments.  I'm inclined to silently strip out co.us if
it's
present in the query string.

However, I'll be performing lots of tests to see what the effect will be and
whether it broadly makes sense to do this from the end-user perspective.

Cheers

Apparently Analagous Threads

Search for more possibly parallel threads

Xapian discuss - Nov 2010 - Stopword addition and stemming

[Xapian-discuss] Stopword addition and stemming

[Xapian-discuss] Stopword addition and stemming

[Xapian-discuss] Stopword addition and stemming

[Xapian-discuss] Stopword addition and stemming

[Xapian-discuss] Stopword addition and stemming

Apparently Analagous Threads