Philip Neustrom
2008-Jan-15 09:24 UTC
[Xapian-discuss] Spelling based on frequency and not just distance
Hey all, After implementing the new spelling functionality on http://wikispot.org I noticed that terms like "wikipeda" weren't yielding spelling suggestions. Taking a quick look at the code, it looks like if we find an exact match, even if it has a frequency less than another match within the provided delta, we don't suggest anything. This is probably fine for sites with documents where you can be assured the data is properly spelled -- but not suitable for something like a wiki or the web in general. I did something simple, attached in a patch. Maybe someone has a better idea of how to weigh the different options, but my quick fix seemed to give much better results than the "give up on exact or edit-distance-closest match" code that was there already. --Philip Neustrom -------------- next part -------------- A non-text attachment was scrubbed... Name: spelling_frequency.diff Type: text/x-diff Size: 1622 bytes Desc: not available Url : http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20080115/892dd069/spelling_frequency.bin
Philip Neustrom
2008-Jan-15 12:57 UTC
[Xapian-discuss] Re: Spelling based on frequency and not just distance
The patch attached to this email is better than the previous. Hopefully somebody can come up with something better entirely, as I'm not totally happy with what I have -- it tends to suggest things like "plant" for "plants" and then "plan" for "plant" :) --Philip On Jan 15, 2008 1:24 AM, Philip Neustrom < philipn@gmail.com> wrote:> Hey all, > > After implementing the new spelling functionality on http://wikispot.org I > noticed that terms like "wikipeda" weren't yielding spelling suggestions. > Taking a quick look at the code, it looks like if we find an exact match, > even if it has a frequency less than another match within the provided > delta, we don't suggest anything. This is probably fine for sites with > documents where you can be assured the data is properly spelled -- but not > suitable for something like a wiki or the web in general. > > I did something simple, attached in a patch. Maybe someone has a better > idea of how to weigh the different options, but my quick fix seemed to give > much better results than the "give up on exact or edit-distance-closest > match" code that was there already. > > --Philip Neustrom >-------------- next part -------------- A non-text attachment was scrubbed... Name: spelling_frequency.diff Type: text/x-diff Size: 2638 bytes Desc: not available Url : http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20080115/a09ce691/spelling_frequency.bin
Olly Betts
2008-Jan-17 03:05 UTC
[Xapian-discuss] Spelling based on frequency and not just distance
On Tue, Jan 15, 2008 at 01:24:33AM -0800, Philip Neustrom wrote:> After implementing the new spelling functionality on http://wikispot.org I > noticed that terms like "wikipeda" weren't yielding spelling suggestions. > Taking a quick look at the code, it looks like if we find an exact match, > even if it has a frequency less than another match within the provided > delta, we don't suggest anything. This is probably fine for sites with > documents where you can be assured the data is properly spelled -- but not > suitable for something like a wiki or the web in general.I'm not sure I believe there's any non-trivial collection of documents without spelling mistakes, but certainly the current spelling correction can be problematic. The current scheme actually does OK when misspelling is rampant, then the incorrect spelling will typically return a useful set of results too. For example, searching Google for "wikipeda" finds "about 64,400" results, and a quick look suggest most are relevant. The sheer size of the index helps here too. It's not just misspellings in documents which are problematic with the current scheme. A genuine word which is also a typo for another word is too. In some cases, both words are common and not a lot can really be done anyway (e.g. "biking" vs "bikini"). In others, one word is common and the other sufficiently obscure that it's unlikely to be what the user meant to search for (e.g. "agent" vs "ahent" - I don't actually even know what "ahent" means - I'd guess it's related to "hent" - but it's a valid play in Scrabble!) Some heuristic based on the relative frequencies (and possibly something like the average frequency, which I don't think we currently know but could track easily enough) seems like a good approach. Another source of spelling information is logs - what offered spelling corrections users have previously accepted is obviously interesting, but also just what has been searched for and whether the user performed another search within a short time interval may be a source of useful information. It's hard to see how exactly to feed such information in though.> The patch attached to this email is better than the previous. Hopefully > somebody can come up with something better entirely, as I'm not totally > happy with what I have -- it tends to suggest things like "plant" for > "plants" and then "plan" for "plant" :)That sounds rather undesirable though. Probably the best thing to do is open a bug and attach your patch so it doesn't just get forgotten. Cheers, Olly