Hi, I'm exploring using Xapian to to replace a database-backed people lookup. It's working well but I'd like to know if there's anything more I can do to increase accuracy. It handles partial phrases well, so long as the first part is complete (e.g. "Peter Bow" expands well). If instead I type "P Bow" it fails to work, as the expansion is done at the end. Is there a good way to handle this? I tried to add a wildcard in the string and skip the query parser, but ended up with zero results. Also sometimes (though not always) substring matches would help - the Ann examples in the notebook illustrate this. I've put an interactive Jupyter notebook with my code at https://colab.research.google.com/drive/1Y_G0hifgHWedy192KwwX7-akNj_BZxVA. If you're logged into a Google account you can run it. The dummy data I used, which you can use to re-run the notebook, is stored at https://gist.github.com/pbowyer/f8d28190fcb2a819c58d8293c602f31d Thanks, Peter
On Tue, Sep 17, 2019 at 01:27:08PM +0100, Peter Bowyer wrote:> It handles partial phrases well, so long as the first part is complete > (e.g. "Peter Bow" expands well). If instead I type "P Bow" it fails to > work, as the expansion is done at the end.The QueryParser::FLAG_PARTIAL feature aims to support a "search as you type" feature, so it only expands a potentially incomplete word at the end of the query (and the expansion won't happen if there's a space entered after that, so e.g. `Peter Bow ` is left alone).> Is there a good way to handle this? I tried to add a wildcard in the > string and skip the query parser, but ended up with zero results.If you mean something like Xapian::Query("Peter Bow*") that will try to search for the single literal term `Peter Bow*`, which indeed wouldn't match anything in most databases. If you really wanted to wildcard expand all words in a query string, you'd have to parse it yourself, turn each word into an OP_WILDCARD query and combine those. I'd think that's likely to create a lot of false matches though, and wildcards are relatively expensive so you might want to limit how many words get wildcarded in a single query to avoid problems.> Also sometimes (though not always) substring matches would help - the Ann > examples in the notebook illustrate this.There's expanded support for wildcards on git master, so you could create an OP_WILDCARD query for `*ann*`, though that seems even more likely to result in a lot of false matches and will tend to be more expensive too. Cheers, Olly
Incidentally, if you're actually aiming to match different forms of a name (Peter vs Pete, Ann vs Anne vs Annette) then you might find the synonym feature a better option than wildcarding. You'd need to give it a list of names to treat as synonyms, but it should have many fewer false positives, and can also handle cases which aren't just a substring - e.g. Robert vs Rob vs Bob vs Bobby, or look entirely different: e.g. Terence vs Terrence vs Spike or Margaret vs Peggy vs Daisy. Cheers, Olly
Hi Olly, On Wed, 18 Sep 2019 at 22:09, Olly Betts <olly at survex.com> wrote:> If you really wanted to wildcard expand all words in a query string, > you'd have to parse it yourself, turn each word into an OP_WILDCARD > query and combine those. >Now you say it, that makes total sense (and seems so obvious). Testing, it matches as I want it to for "p bow".> I'd think that's likely to create a lot of false matches though, and > wildcards are relatively expensive so you might want to limit how many > words get wildcarded in a single query to avoid problems. >I was aiming to work round the false matches by also doing an exact match of the word(s) with a higher ranking factor. So if what's being wildcarded has an exact match also, it ranks higher.> There's expanded support for wildcards on git master, so you could > create an OP_WILDCARD query for `*ann*`, though that seems even more > likely to result in a lot of false matches and will tend to be more > expensive too. >Yes, I'm discarding this idea. Others are more important. Best, Peter -- Maple Design Ltd http://www.mapledesign.co.uk +44 (0)330 122 0034 Reg. in England no. 05920531 Prices exclude VAT where applicable