Hi All Thanks to everyone for a great bit of software. I have made good progress with it and I am currently just getting over some of the last hurdles. I have made a stopper which I use for indexing (and retrieval) which doesn't seem to stop much. (I've pasted a trimmed example Below) . Loads of the words in the stopper keep coming through. I use it as follows: QueryParser qp; qp.set_stemmer(Xapian::Stem("english")); qp.set_stemming_strategy(Xapian::QueryParser::STEM_ALL); qp.set_stopper(new MyStopper()); qp.set_default_op(Xapian::Query::OP_AND); Xapian::Query query = qp.parse_query(text); Can anyone see what I might be doing wrong. ? Many thanks class MyStopper : public Xapian::Stopper { public: bool operator()(const string &t) const { switch (t[0]) { case 'b': return (t == "b" || t == "bad" || t == "be" || t == "became" || t == "because" || t == "become" || t == "becomes" || t == "becoming" || t == "been" || t == "before" || t == "began" || t == "behind" || t == "below" || t == "beside" || t == "besides" || t == "best" || t == "better" || t == "between" || t == "big" || t == "bigg" || t == "bigger" || t == "biggest" || t == "both" || t == "bring" || t == "bringing" || t == "brings" || t == "brought" || t == "but" || t == "by"); case '1': return (t == "1" || t == "10th" || t == "11th" || t == "12th" || t == "13th" || t == "14th" || t == "15th" || t == "16th" || t == "17th" || t == "18th" || t == "19th" || t == "1st"); case '2': return (t == "2" || t == "20th" || t == "21st" || t == "22nd" || t == "23rd" || t == "24th" || t == "25th" || t == "26th" || t == "27th" || t == "28th" || t == "29th" || t == "2nd"); case '3': return (t == "3" || t == "30th" || t == "31st" || t == "3rd"); case '4': return (t == "4" || t == "4th"); default: return false; } } };
On Wed, Mar 07, 2007 at 04:08:57PM +0000, Colin Bell wrote:> I have made a stopper which I use for indexing (and retrieval) which > doesn't seem to stop much. (I've pasted a trimmed example Below) . > Loads of the words in the stopper keep coming through. > > I use it as follows:The code looks plausible to me. It's not a complete program, so I couldn't easily try running it. Do you have some example queries where stop words aren't removed? Note that the QueryParser doesn't stop words in phrases, or with "+" in front of them. There are a few other cases too. Essentially it expects search time stop word removals rather than index time, but the behaviour with index time stop word removal is to fail to match phrases and "+" terms which contain/are stop words, which isn't too unreasonable. If we can decide on a bettere way to handle these cases when the word wasn't indexed, it wouldn't be hard to change this. Cheers, Olly
Please keep discussion on the lists - others may be interested too! On Thu, Mar 08, 2007 at 09:09:21AM +0000, Colin Bell wrote:> Thanks Olly ,as always, I really appreciate the time you take to > answer these questions. I couldn't post the whole Stopper because the > mailing list keeps putting it on hold because it says the body of the > message is too long. The stopper just contains more words for each > letter o the alphabet.I was really just indicating why I hadn't actually tried the code, but a cut-down example or a URL for the example would have worked.> It turns out that it was the punctuation which was causing some of > the problems. If a word had a comma after it or if the word had an > apostrophe in it.In the stopword list or the query? Currently (in 0.9.x) apostrophes are treated as "phrase generators", so <doesn't> is the same as <doesn-t>. This is really a misfeature, as it produces phrase searches where we don't need them (and some of the slower cases of phrase searches too) and it's not useful to be able to search for the two parts separately, except in one case - the possessive <'s> (e.g. <Olly's>) which is better handled by the stemmers (and the latest Snowball English stemmer does this). In SVN trunk, an apostrophe between two word characters is included in the term. So if your stopper would never get passed <doesn't> currently, it will be offered <doesn> and <t> instead. But a query string containing a word with a comma after it should work as expected. For example, the query string "the, comma" should cause <the> and <comma> to be passed to the stopper.> Is there anyway to adjust my stopper to stop any terms shorter than 3 > chars ?Just insert this before any other checks (assuming you mean strictly shorter of course): if (t.size() < 3) return true; Cheers, Olly