thr3ads.net - Xapian discuss - [Xapian-discuss] Stopper Problems [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Colin Bell

2007-Mar-07 16:09 UTC

[Xapian-discuss] Stopper Problems

Hi All

Thanks to everyone for a great bit of software. I have made good  
progress with it and I am currently just getting over some of the  
last hurdles.

I have made a stopper which I use for indexing (and retrieval) which  
doesn't seem to stop much. (I've pasted a trimmed example Below) .  
Loads of the words in the stopper keep coming through.

I use it as follows:

	QueryParser qp;
	qp.set_stemmer(Xapian::Stem("english"));
	qp.set_stemming_strategy(Xapian::QueryParser::STEM_ALL);
     qp.set_stopper(new MyStopper());
     qp.set_default_op(Xapian::Query::OP_AND);
	Xapian::Query query = qp.parse_query(text);


Can anyone see what I might be doing wrong. ?

Many thanks

class MyStopper : public Xapian::Stopper {
   public:
     bool operator()(const string &t) const {
	switch (t[0]) {
	    case 'b':
	    return (t == "b" || t == "bad" || t == "be"
|| t == "became" ||
t == "because" || t == "become" || t == "becomes"
|| t == "becoming"
|| t == "been" || t == "before" || t == "began" ||
t == "behind" || t
== "below" || t == "beside" || t == "besides" || t
== "best" || t ==
"better" || t == "between" || t == "big" || t ==
"bigg" || t ==
"bigger" || t == "biggest" || t == "both" || t ==
"bring" || t ==
"bringing" || t == "brings" || t == "brought" || t
== "but" || t ==
"by");
	    case '1':
	    return (t == "1" || t == "10th" || t ==
"11th" || t == "12th" ||
t == "13th" || t == "14th" || t == "15th" || t ==
"16th" || t ==
"17th" || t == "18th" || t == "19th" || t ==
"1st");
	    case '2':
	    return (t == "2" || t == "20th" || t ==
"21st" || t == "22nd" ||
t == "23rd" || t == "24th" || t == "25th" || t ==
"26th" || t ==
"27th" || t == "28th" || t == "29th" || t ==
"2nd");
	    case '3':
	    return (t == "3" || t == "30th" || t ==
"31st" || t == "3rd");
	    case '4':
	    return (t == "4" || t == "4th");
		default:
		return false;
		}
     }
};

Olly Betts

2007-Mar-07 18:05 UTC

head link

[Xapian-discuss] Stopper Problems

On Wed, Mar 07, 2007 at 04:08:57PM +0000, Colin Bell
wrote:> I have made a stopper which I use for indexing (and retrieval) which  
> doesn't seem to stop much. (I've pasted a trimmed example Below) .
> Loads of the words in the stopper keep coming through.
> 
> I use it as follows:
The code looks plausible to me.  It's not a complete program, so I
couldn't easily try running it.

Do you have some example queries where stop words aren't removed?

Note that the QueryParser doesn't stop words in phrases, or with
"+"
in front of them.  There are a few other cases too.  Essentially it
expects search time stop word removals rather than index time, but the
behaviour with index time stop word removal is to fail to match
phrases and "+" terms which contain/are stop words, which isn't
too
unreasonable.  If we can decide on a bettere way to handle these cases
when the word wasn't indexed, it wouldn't be hard to change this.

Cheers,
    Olly

Olly Betts

2007-Mar-08 15:51 UTC

head link

[Xapian-discuss] Stopper Problems

Please keep discussion on the lists - others may be interested too!

On Thu, Mar 08, 2007 at 09:09:21AM +0000, Colin Bell
wrote:> Thanks Olly ,as always, I really appreciate the time you take to  
> answer these questions. I couldn't post the whole Stopper because the  
> mailing list keeps putting it on hold because it says the body of the  
> message is too long. The stopper just contains more words for each  
> letter o the alphabet.
I was really just indicating why I hadn't actually tried the code, but a
cut-down example or a URL for the example would have worked.
> It turns out that it was the punctuation which was causing some of  
> the problems. If a word had a comma after it or if the word had an  
> apostrophe in it.
In the stopword list or the query?

Currently (in 0.9.x) apostrophes are treated as "phrase generators",
so
<doesn't> is the same as <doesn-t>.  This is really a
misfeature, as
it produces phrase searches where we don't need them (and some of the
slower cases of phrase searches too) and it's not useful to be able to
search for the two parts separately, except in one case - the possessive
<'s> (e.g. <Olly's>) which is better handled by the stemmers
(and the
latest Snowball English stemmer does this).  In SVN trunk, an apostrophe
between two word characters is included in the term.

So if your stopper would never get passed <doesn't> currently, it will
be offered <doesn> and <t> instead.  But a query string containing
a word with a comma after it should work as expected.  For example, the
query string "the, comma" should cause <the> and <comma>
to be passed to
the stopper.
> Is there anyway to adjust my stopper to stop any terms shorter than 3  
> chars ?
Just insert this before any other checks (assuming you mean strictly
shorter of course):

    if (t.size() < 3) return true;

Cheers,
    Olly

Xapian discuss - Mar 2007 - Stopper Problems

[Xapian-discuss] Stopper Problems

[Xapian-discuss] Stopper Problems

[Xapian-discuss] Stopper Problems