Not sure how much this will interest people but I don''t have a blog so I''m posting something I threw together today cause I think it might be useful. In what little free time I have I''ve been wanting to put together a Rails/Ferret based restful dictionary. So I finally got a chance to get started today so the first thing I wanted to do was implement a metaphone analyzer and filter. Some links for more info on the metaphone algorithms: http://en.wikipedia.org/wiki/Metaphone http://en.wikipedia.org/wiki/Double_Metaphone The jist of it is that it breaks words down into its phonetic parts. For example, the word ''cool'' and ''kewl'' both become ''KL'' in the double-metaphone algorithm. Indexing dictionary words in this manner is almost essential so that users can find the proper spelling of a word by spelling it how it sounds. The first thing I did was create a MetaphoneFilter class that would run the metaphone algorithm over a token stream. It''s a fairly simple class, but does require the ''Text'' gem be installed. require ''ferret'' require ''text'' module Curtis module Analysis # TODO write tests! class MetaphoneFilter < Ferret::Analysis::TokenStream def initialize(token_stream, version = :double) @input = token_stream @version = version end def next t = @input.next return nil if t.nil? t.text = @version.eql?(:double) ? Text::Metaphone.double_metaphone(t.text) : Text::Metaphone.metaphone(t.text) end end end end Second I created a MetaphoneAnalyzer class that would use the MetaphoneFilter created above. The MetaphoneAnalyzer also makes use of the StemFilter so that words like "eat" and "eating" both equal to "eat". require ''ferret'' # TODO write tests module Curtis module Analysis class MetaphoneAnalyzer < Ferret::Analysis::Analyzer include Ferret::Analysis def initialize(version = :double, stop_words = ENGLISH_STOP_WORDS) @stop_words = stop_words @version = version end def token_stream(field, str) MetaphoneFilter.new(StemFilter.new(StopFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)), @stop_words)), @version) end end end end I saved both of these files, ''metaphone_filter.rb'' and ''metaphone_analyzer.rb'' to RAILS_ROOT/extras. Next I added the following line to my ''config/environments.rb'' file: config.load_paths += %W{ #{RAILS_ROOT}/extras } after that i fired up script/console to test it all out:>> require ''metaphone_analyzer''=> true>> include Curtis::Analysis=> Object>> ts = MetaphoneAnalyzer.new.token_stream(nil, "the quick brown fox jumpedover the lazy dog") => $<Curtis::Analysis::Metaphonefilter...... at version=:double>>> while token = ts.next >> p token >> end["KK", nil] ["PRN", nil] ["FKS", nil] ["JMP", "AMP"] ["AFR", nil] ["LS", nil] ["TK", nil] => nil As you can see it has been metaphoned. Now if someone were to search but inadvertently type ''qwick'' instead of ''quick'' it would still match because ''qwick'' metaphoned also becomes ''KK''. Still a lot to do, such as test it with AAF, and see how it interacts with using slop (which measures the Levenshtein distance, http://en.wikipedia.org/wiki/Levenshtein, between two terms) so that I can put in a "Did you mean xxx" feature (where xxx is a list of terms within a certain distance of the original query). Plus many other ideas also, such as thesaurus searching. Hopefully this has been informative. Wanted to show how to create new Analyzers and Filters for anyone who was curious (I know I was until today), as well as give a general idea for how I''m going to put them to use. I''d be happy to hear any questions or comments on the above. Oh, one last thing.. the MetaphoneAnalyzer and MetaphoneFilter default to the double-metaphone algorithm. Just pass pass nil (or anything other than :double when constructing the analyzer to use just the metaphone algorithm. Thanks, Curtis
On 25.11.2006, at 22:38, Curtis Hatter wrote:> As you can see it has been metaphoned. Now if someone were to > search but > inadvertently type ''qwick'' instead of ''quick'' it would still match > because > ''qwick'' metaphoned also becomes ''KK''.You can achieve almost the same result using Ferret''s built-in FuzzyQuery. It works even better for misspellings than phonetic algorithms, and it''s language-neutral. Consider: i = Ferret::I.new i << "the quick brown fox" i.search("quikc~").total_hits => 1 i.search("qwick~").total_hits => 1 Whereas metaphone yields: Text::Metaphone.double_metaphone("quick") => ["KK", nil] Text::Metaphone.double_metaphone("quikc") => ["KKK", nil] Cheers, Andy
On Saturday 25 November 2006 17:21, Andreas Korth wrote:> On 25.11.2006, at 22:38, Curtis Hatter wrote: > > As you can see it has been metaphoned. Now if someone were to > > search but > > inadvertently type ''qwick'' instead of ''quick'' it would still match > > because > > ''qwick'' metaphoned also becomes ''KK''. > > You can achieve almost the same result using Ferret''s built-in > FuzzyQuery. It works even better for misspellings than phonetic > algorithms, and it''s language-neutral. > > Consider: > > i = Ferret::I.new > i << "the quick brown fox" > > i.search("quikc~").total_hits > => 1 > i.search("qwick~").total_hits > => 1 > > Whereas metaphone yields: > > Text::Metaphone.double_metaphone("quick") > => ["KK", nil] > Text::Metaphone.double_metaphone("quikc") > => ["KKK", nil] >I''m looking at trying to use both. My reason: i = Ferret::I.new i << "The quick brown fox" i.search("qwik~").total_hits => 0 Where as double metaphoning "quick" or "qwik" both become "KK". What I''m thinking might be a good solution is to index the word and it''s double-metaphone equivalent. Then search for exact hits against the metaphone and fuzzy hits against the word field. Then sort based on score, with hopefully exact matches being 100. Still investigating the best solutions. Thanks for the ideas, Curtis Curtis
On 26.11.2006, at 22:35, Curtis Hatter wrote:>> i = Ferret::I.new >> i << "the quick brown fox" >> >> i.search("quikc~").total_hits >> => 1 >> i.search("qwick~").total_hits >> => 1 >> >> Whereas metaphone yields: >> >> Text::Metaphone.double_metaphone("quick") >> => ["KK", nil] >> Text::Metaphone.double_metaphone("quikc") >> => ["KKK", nil] >> > > I''m looking at trying to use both. My reason: > > i = Ferret::I.new > i << "The quick brown fox" > > i.search("qwik~").total_hits > => 0Which is OK I guess, since ''qwik'' and ''quick'' are quite different. Still, you can adjust the tolerance of FuzzyQuery if desired: i.search("qwik~0.4").total_hits => 1> Where as double metaphoning "quick" or "qwik" both become "KK".Yep. In the same way as ''bag'', ''pack'', ''back'', ''poke'' and ''pike'' all become ''PK''. I think the accurracy of this particular phonetic algorithm is disputable.> What I''m thinking might be a good solution is to index the word and > it''s > double-metaphone equivalent. Then search for exact hits against the > metaphone > and fuzzy hits against the word field. Then sort based on score, with > hopefully exact matches being 100.You should in any case index the actual terms, because the metaphones alone would make exact matches impossible. If you use FuzzySearch, you don''t need an extra field and you autmatically get a score based on how close the match is. Example: i = Ferret::I.new i << "quick" i << "quikc" i << "quack" i << "quake" i << "quark" i << "quid" i << "quiche" i.search_each("quikc~0.3") do |doc, score| printf "%6s %1.2f\n", i[doc][:id], score end quikc 0.88 quick 0.53 quake 0.53 quid 0.44 quack 0.35 quark 0.35 quiche 0.35 As you can see, the exact match ranks highest. -- Andy
> Yep. In the same way as ''bag'', ''pack'', ''back'', ''poke'' and ''pike'' all > become ''PK''. I think the accurracy of this particular phonetic > algorithm is disputable.true.. and had I not been introduced to guitar hero 2 this weekend I think I might have realized that myself. I haven''t good success with the Soundex algorithm either. Metaphone seemed good at first but I think you''ve convinced me otherwise. ''Pike'' and ''bag'' should not be the same.> You should in any case index the actual terms, because the metaphones > alone would make exact matches impossible. > > If you use FuzzySearch, you don''t need an extra field and you > autmatically get a score based on how close the match is. > > Example: > > i = Ferret::I.new > > i << "quick" > i << "quikc" > i << "quack" > i << "quake" > i << "quark" > i << "quid" > i << "quiche" > > i.search_each("quikc~0.3") do |doc, score| > printf "%6s %1.2f\n", i[doc][:id], score > end > > quikc 0.88 > quick 0.53 > quake 0.53 > quid 0.44 > quack 0.35 > quark 0.35 > quiche 0.35 > > As you can see, the exact match ranks highest.I think I''ll try this approach first and add in a phonetic algorithm if necessary. At least I discovered how to write filters for Ferret, which was much easier that I would have imagined. Thanks for the information, is nice to learn a bit more about the things Ferret can already do so well. Curtis
On 27.11.2006, at 04:26, Curtis Hatter wrote:>> Yep. In the same way as ''bag'', ''pack'', ''back'', ''poke'' and ''pike'' all >> become ''PK''. I think the accurracy of this particular phonetic >> algorithm is disputable. > > true.. and had I not been introduced to guitar hero 2 this weekend > I think I > might have realized that myself.Sounds like a lot of FN.> I think I''ll try this approach first and add in a phonetic > algorithm if > necessary.It depends on what you want to achieve. If you want to compensate for typos, FuzzyQuery is probably better. A simple letter-swap can easily trick a phonetic algorithm: metaphone => MTFN metahpone => MTPN FuzzyQuery will catch this even at a relatively low sensitivity of 0.75 (''metahpone~0.75'') I used FuzzyQuery to build a doublet detection into my Rails app. When a user creates a new Person record, a fuzzy search is run in the background as the form fields are filled out. Possible doublets are displayed next to the form in a "Did you mean..." fashion. For example, if the user enters "Rachel Welsh", the doublet detection would find "Raquel Welch". Before Ferret I tried to achieve this with MySQL''s SOUNDEX function, which didn''t work quite as well. (Although SOUNDEX, which is based on the algorithm of the same name, still works way better than metaphone.)> At least I discovered how to write filters for Ferret, which was > much easier > that I would have imagined.Yep. It''s great that Ferret can be extended and customized in so many ways.> Thanks for the information, is nice to learn a bit more about the > things > Ferret can already do so well.Ferret is the single best Ruby library I''ve come across in the past two years. It just rocks. Period. Thanks David for giving it to us! Cheers, Andreas