hightman
2012-Jan-05 05:28 UTC
[Xapian-discuss] Enhance synonyms feature of the query parser (patch included)
Very few people seem to be using synonym in Xapian, I recently found some problems in the use of synonyms. Normally, I think we should not contain any prefix info in synonym table except that 'Z'. For example, I have the following synonyms and prefix info: db.add_synonym("search", "find"); db.add_synonym("Zsearch", "Zfind"); db.add_synonym("foo bar", "foobar"); qp.add_prefix("title", "T"); I think my expected results of query parser should be like this: "search something" ==> "(Zsearch:(pos=1) SYNONYM find:(pos=1)) AND Zsometh:(pos=2) "title:search" ==> "ZTsearch:(pos=1) SYNONYM Tfind:(pos=1)" "title:searching" ==> "ZTSearch:(pos=1) SYNONYM ZTfind:(pos=1)" "title:(foo bar)" ==> "(ZTfoo:(pos=1) AND ZTbar:(pos=2)) SYNONYM Tfoobar:(pos=1) ... In general, it is hoped can add prefix info to synonym term automatically, But it does not supportted in current xapian version. In addition, I have another question about prefix_info of the Term object, it is a vector list, but I don't know when there are multi prefixes for a term?? It leads me to worry about the modifier for multi words, because I only consider the first prefix. --- PATCH CONTENT BEGIN 'queryparser/queryparser.lemon' --- *** queryparser.lemony 2012-01-05 12:28:39.000000000 +0800 --- queryparser.lemony.new 2012-01-05 12:52:56.000000000 +0800 *************** *** 307,316 **** --- 307,318 ---- for (piter = prefixes.begin(); piter != prefixes.end(); ++piter) { // First try the unstemmed term: string term; + #ifndef HAVE_SYNONYMS_ENH if (!piter->empty()) { term += *piter; if (prefix_needs_colon(*piter, name[0])) term += ':'; } + #endif term += name; Xapian::Database db = state->get_database(); *************** *** 319,334 **** --- 321,347 ---- if (syn == end && stem != QueryParser::STEM_NONE) { // If that has no synonyms, try the stemmed form: term = 'Z'; + #ifndef HAVE_SYNONYMS_ENH if (!piter->empty()) { term += *piter; if (prefix_needs_colon(*piter, name[0])) term += ':'; } + #endif term += state->stem_term(name); syn = db.synonyms_begin(term); end = db.synonyms_end(term); } while (syn != end) { + #ifdef HAVE_SYNONYMS_ENH + string sterm = *syn; + if (!piter->empty()) { + if (sterm[0] == 'Z') sterm = "Z" + *piter + sterm.substr(1); + else sterm = *piter + sterm; + } + q = Query(query::OP_SYNONYM, q, Query(sterm, 1, pos)); + #else q = Query(Query::OP_SYNONYM, q, Query(*syn, 1, pos)); + #endif ++syn; } } *************** *** 1356,1362 **** --- 1369,1379 ---- Query::op default_op = state->default_op(); vector<Query> subqs; subqs.reserve(terms.size()); + #ifdef HAVE_SYNONYMS_ENH + if ((state->flags & QueryParser::FLAG_AUTO_MULTIWORD_SYNONYMS) == QueryParser::FLAG_AUTO_MULTIWORD_SYNONYMS) { + #else if (state->flags & QueryParser::FLAG_AUTO_MULTIWORD_SYNONYMS) { + #endif // Check for multi-word synonyms. Database db = state->get_database(); *************** *** 1432,1440 **** --- 1449,1467 ---- // Use the position of the first term for the synonyms. Xapian::termpos pos = (*begin)->pos; + #ifdef HAVE_SYNONYMS_ENH + string prefix; + const list<string> & prefixes = (*begin)->prefix_info->prefixes; + if (prefixes.begin() != prefixes.end()) + prefix = *(prefixes.begin()); + #endif begin = i; while (syn != end) { + #ifdef HAVE_SYNONYMS_ENH + subqs2.push_back(Query(prefix + *syn, 1, pos)); + #else subqs2.push_back(Query(*syn, 1, pos)); + #endif ++syn; } Query q_synonym_terms(Query::OP_SYNONYM, subqs2.begin(), subqs2.end()); --- PATCH CONTENT END ---
Olly Betts
2012-Jan-09 04:31 UTC
[Xapian-discuss] Enhance synonyms feature of the query parser (patch included)
On Thu, Jan 05, 2012 at 01:28:36PM +0800, hightman wrote:> Very few people seem to be using synonym in Xapian, I recently found > some problems in the use of synonyms.It's hard to get a good grasp on such things, but we certainly have had a number of questions about synonyms since they were added> Normally, I think we should not contain any prefix info in synonym > table except that 'Z'.I understand what you're suggesting - the problem with this change is that it makes it impossible to have different synonyms for different prefixes (which is currently easy to do). As an example, you might want to give "brown" synonyms of "beige" and "tan" in text, but not want to find books by Amy Tan when the user is searching for Dan Brown. You can have the same synonyms for different prefixes currently, though you need to add them for each which certainly isn't ideal. But trading a non-ideal way to achieve one set up for making another set up impossible doesn't seem desirable. I think what's probably needed is a way to say "instead of looking up the synonyms for this term, look them up for that term instead", via a functor object which gets passed a term to synonym and can return a different term to look up instead, or something like that. Cheers, Olly