thr3ads.net - Xapian discuss - [Xapian-discuss] Spelling based on frequency and not just distance [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Philip Neustrom

2008-Jan-15 09:24 UTC

[Xapian-discuss] Spelling based on frequency and not just distance

Hey all,

After implementing the new spelling functionality on http://wikispot.org I
noticed that terms like "wikipeda" weren't yielding spelling
suggestions.
Taking a quick look at the code, it looks like if we find an exact match,
even if it has a frequency less than another match within the provided
delta, we don't suggest anything.  This is probably fine for sites with
documents where you can be assured the data is properly spelled -- but not
suitable for something like a wiki or the web in general.

I did something simple, attached in a patch.  Maybe someone has a better
idea of how to weigh the different options, but my quick fix seemed to give
much better results than the "give up on exact or edit-distance-closest
match" code that was there already.

--Philip Neustrom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spelling_frequency.diff
Type: text/x-diff
Size: 1622 bytes
Desc: not available
Url :
http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20080115/892dd069/spelling_frequency.bin

Philip Neustrom

2008-Jan-15 12:57 UTC

head link

[Xapian-discuss] Re: Spelling based on frequency and not just distance

The patch attached to this email is better than the previous.  Hopefully
somebody can come up with something better entirely, as I'm not totally
happy with what I have -- it tends to suggest things like "plant" for
"plants" and then "plan" for "plant" :)

--Philip

On Jan 15, 2008 1:24 AM, Philip Neustrom < philipn@gmail.com> wrote:
> Hey all,
>
> After implementing the new spelling functionality on http://wikispot.org I
> noticed that terms like "wikipeda" weren't yielding spelling
suggestions.
> Taking a quick look at the code, it looks like if we find an exact match,
> even if it has a frequency less than another match within the provided
> delta, we don't suggest anything.  This is probably fine for sites with
> documents where you can be assured the data is properly spelled -- but not
> suitable for something like a wiki or the web in general.
>
> I did something simple, attached in a patch.  Maybe someone has a better
> idea of how to weigh the different options, but my quick fix seemed to give
> much better results than the "give up on exact or
edit-distance-closest
> match" code that was there already.
>
> --Philip Neustrom
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: spelling_frequency.diff
Type: text/x-diff
Size: 2638 bytes
Desc: not available
Url :
http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20080115/a09ce691/spelling_frequency.bin

Olly Betts

2008-Jan-17 03:05 UTC

head link

[Xapian-discuss] Spelling based on frequency and not just distance

On Tue, Jan 15, 2008 at 01:24:33AM -0800, Philip Neustrom
wrote:> After implementing the new spelling functionality on http://wikispot.org I
> noticed that terms like "wikipeda" weren't yielding spelling
suggestions.
> Taking a quick look at the code, it looks like if we find an exact match,
> even if it has a frequency less than another match within the provided
> delta, we don't suggest anything.  This is probably fine for sites with
> documents where you can be assured the data is properly spelled -- but not
> suitable for something like a wiki or the web in general.
I'm not sure I believe there's any non-trivial collection of documents
without spelling mistakes, but certainly the current spelling correction
can be problematic.  The current scheme actually does OK when
misspelling is rampant, then the incorrect spelling will typically
return a useful set of results too.  

For example, searching Google for "wikipeda" finds "about
64,400"
results, and a quick look suggest most are relevant.  The sheer size of
the index helps here too.

It's not just misspellings in documents which are problematic with the
current scheme.  A genuine word which is also a typo for another word 
is too.  In some cases, both words are common and not a lot can really
be done anyway (e.g. "biking" vs "bikini").  In others, one
word is
common and the other sufficiently obscure that it's unlikely to be what
the user meant to search for (e.g. "agent" vs "ahent" - I
don't actually
even know what "ahent" means - I'd guess it's related to
"hent" - but
it's a valid play in Scrabble!)

Some heuristic based on the relative frequencies (and possibly something
like the average frequency, which I don't think we currently know but
could track easily enough) seems like a good approach.

Another source of spelling information is logs - what offered spelling
corrections users have previously accepted is obviously interesting, but
also just what has been searched for and whether the user performed
another search within a short time interval may be a source of useful
information.  It's hard to see how exactly to feed such information
in though.
> The patch attached to this email is better than the previous.  Hopefully
> somebody can come up with something better entirely, as I'm not totally
> happy with what I have -- it tends to suggest things like "plant"
for
> "plants" and then "plan" for "plant" :)
That sounds rather undesirable though.

Probably the best thing to do is open a bug and attach your patch so it
doesn't just get forgotten.

Cheers,
    Olly

Xapian discuss - Jan 2008 - Spelling based on frequency and not just distance

[Xapian-discuss] Spelling based on frequency and not just distance

[Xapian-discuss] Re: Spelling based on frequency and not just distance

[Xapian-discuss] Spelling based on frequency and not just distance