Nikita Smetanin
2011-Mar-20 14:53 UTC
[Xapian-devel] GSoC 2011: Improve Spelling Correction
Hello, I am Nikita Smetanin (ntz), russian student. I'm interested in fuzzy search algorithms (also known as similarity search and spelling correction), I have some articles and open-source implementations of related algorithms. I also have good experience in enterprise software development (Java/C++/C# and related stuff) and in small projects. I want to work on your project "Improve spelling correction", but I want to suggest some additions to that project: - One or several phonetic matching algorithms to improve name and surname search. - Alternative faster (than trigram) algorithm for correction candidate search. - More complicated word distance metric to improve result set relevance. - Something about improving stemming quality. - Language detection for automatic language-specific algorithms selection. I'll be happy to participate in this project during Google Summer of Code 2011 program and implement most of these ideas.
On Sun, Mar 20, 2011 at 07:53:56PM +0500, Nikita Smetanin wrote:> Hello, I am Nikita Smetanin (ntz), russian student. I'm interested in > fuzzy search algorithms (also known as similarity search and spelling > correction), I have some articles and open-source implementations of > related algorithms. I also have good experience in enterprise software > development (Java/C++/C# and related stuff) and in small projects. > > I want to work on your project "Improve spelling correction", but I > want to suggest some additions to that project:That's cool - I actually added a new sentence to the ideas page earlier to make this clearer (http://trac.xapian.org/wiki/GSoCProjectIdeas): Note that these are ideas - some are more fully formed than others, but don't be afraid to take them and extend or adapt them in your proposal to produce something you're more interesting in working on.> - One or several phonetic matching algorithms to improve name and > surname search.How would you apply these? Just as something which could be applied to a field known to contain a name (e.g. author) or something more complex?> - Alternative faster (than trigram) algorithm for correction candidate search. > - More complicated word distance metric to improve result set relevance. > - Something about improving stemming quality. > - Language detection for automatic language-specific algorithms selection. > > I'll be happy to participate in this project during Google Summer of > Code 2011 program and implement most of these ideas.Cool - I know you've discussed a lot of this on IRC already, but feel free to ask/discuss further. And if you get a chance to translate any of your papers into English, I'd be interested to read them. Cheers, Olly