Hi, everyone, Just wondering if someone had come up with a good way to do fuzzy searches if you use MySQL as your database (we tried switching to PostgreSQL, but that ended up adding even more problems). Ferret sounds great, but reading through the discussion it looks like we need to solve the problem of write conflicts. I just wrote a post in ruby-talk about using KirbyBase maybe to solve this. Of course, I''m hoping someone has come up with an even better solution! ;-) Anyway, any advice would be appreciated. Thanks! Jen
Here''s a simple fuzzy-search algorithm you can plug in to whatever you
decide to use.
Most of this code was stolen from php: http://us2.php.net/similar_text (in
comments section).
O(m*n), where m and n are lengths of the two target strings.
You may want to rewrite it to make it look more Rubyish.
# this is the main algorithm.
# compute longest common subsequence btn strings s1 and s2
def lcs_length(s1, s2)
m = [s1.length,128].min
n = [s2.length,128].min
x = 1 + [m,n].max
# this table will be used to compute the LCS-Length,
# only the first 128 chars per string are considered
lcs_tbl = []
# initialize 2-dimensional array to 0
(0...x).each { || lcs_tbl << Array.new(x,0) }
(1..m).each { |i|
(1..n).each { |j|
if (s1[i-1] == s2[j-1])
lcs_tbl[i][j] = lcs_tbl[i-1][j-1] + 1
elsif (lcs_tbl[i-1][j] >= lcs_tbl[i][j-1])
lcs_tbl[i][j] = lcs_tbl[i-1][j]
else
lcs_tbl[i][j] = lcs_tbl[i][j-1]
end
}
}
return lcs_tbl[m][n]
end
def get_lcs(s1, s2, verbose=false)
# ok, now replace all spaces and weird characters with
''normal'' ascii-7
s1 = normalize_string(s1);
s2 = normalize_string(s2);
# check for pathological cases
return 0 if [s1.length, s2.length].max < 1
puts "comparing strings #{s1} and #{s2}" if verbose
lcs = lcs_length(s1,s2) # compute longest common subsequence
# average string length
ms = (s1.length + s2.length) / 2.0
(puts "ms: #{ms} lcs: #{lcs}";puts) if verbose
return (lcs*100)/ms
end
# code stolen from one of the ruby MLs.
def normalize_string(foo)
foo = foo.downcase.strip
foo.gsub!(/[ÀÁÂÃâäàãáäåāăąǎǟǡǻȁȃȧẵặ]/,''a'')
foo.gsub!(/[ëêéèẽēĕėẻȅȇẹȩęḙḛềếễểḕḗệḝ]/,''e'')
foo.gsub!(/[ÌÍÎĨÏiìíîĩīĭïỉǐịįȉȋḭɨḯ]/,''i'')
foo.gsub!(/[ÒÓÔÕÖòóôõōŏȯöỏőǒȍȏơǫọɵøồốỗổȱȫȭṍṏṑṓờớỡởợǭộǿ]/,''o'')
foo.gsub!(/[ÙÚÛŨÜùúûũūŭüủůűǔȕȗưụṳųṷṵṹṻǖǜǘǖǚừứữửự]/,''u'')
foo.gsub!(/[ỳýŷỹȳẏÿỷẙƴỵ]/,''y'')
foo.gsub!(/[œ]/,''oe'')
foo.gsub!(/[ÆǼǢæ]/,''ae'')
foo.gsub!(/[ñǹń]/,''n'')
foo.gsub!(/[Çç]/,''c'')
foo.gsub!(/[ß]/,''b'')
foo.gsub!(/[œ]/,''oe'')
foo.gsub!(/[ij]/,''ij'')
foo.gsub!(/[\s\''\"\\\/\?\.\=\+\&\%]/,''_'')
foo.gsub!(/_+/,''_'')
foo
end
# a little test
[
["a","a"],["ab","ab"],["abc","abc"],
["a","b"],["ab","bb"],["abc","abb"],
["a","1"],["ab","_b"],["abcd","a
cd"],
["abcd","abcde"],["abcd","abcdefghi"],["abcde","abcdf"],
["abcd","zbcd"],["abcd","efghiabcd"],["abcde","fabcd"],
["abc","acb"],["abcde","abzde"],["abcde","abde"],
["abc","ac"],["abcde","ab1de"],["abcde","ab
de"]
].each { |e|
puts get_lcs(e[0],e[1],true)
puts
}
jennyw wrote:> Hi, everyone,
>
> Just wondering if someone had come up with a good way to do fuzzy
> searches if you use MySQL as your database (we tried switching to
> PostgreSQL, but that ended up adding even more problems). Ferret sounds
> great, but reading through the discussion it looks like we need to solve
> the problem of write conflicts. I just wrote a post in ruby-talk about
> using KirbyBase maybe to solve this. Of course, I''m hoping someone
has
> come up with an even better solution! ;-)
>
> Anyway, any advice would be appreciated.
>
> Thanks!
>
> Jen
On 11/26/05, Lou Vanek <vanek-9H8CmIPm+GA@public.gmane.org> wrote:> Here''s a simple fuzzy-search algorithm you can plug in to whatever you decide to use. > Most of this code was stolen from php: http://us2.php.net/similar_text (in comments section). > O(m*n), where m and n are lengths of the two target strings. > You may want to rewrite it to make it look more Rubyish.I believe that Jen was looking for a solution for doing the fuzzy search on the MySQL server itself. Sincerely, Tom Lieber tom-V0YqjHVuocLQT0dZR+AlfA@public.gmane.org http://AllTom.com/
MySQL does not have a fuzzy search capability, nor does the stored procedure language have arrays, so fuzzy search embedded w/in the MySQL database is not possible to do efficiently (although you can fake arrays either with temp tables or gobs of string ops). Which leaves you with downloading the data and performing the match outside of the db. On the surface this appears inefficient, because, after all, the data has to be moved to the client each time a search is processed. But considering that the search itself is O(m*n), data transfer is not the overriding concern (unless you have an unbearably slow transfer speed). In fact, many fuzzy search algorithms have a much worse running-time order [O(n^3)], so pulling the data off the server so that a more efficient algorithm can be used may even be faster than doing the computation centrally on the server. Tom Lieber wrote:> On 11/26/05, Lou Vanek <vanek-9H8CmIPm+GA@public.gmane.org> wrote: > >>Here''s a simple fuzzy-search algorithm you can plug in to whatever you decide to use. >>Most of this code was stolen from php: http://us2.php.net/similar_text (in comments section). >>O(m*n), where m and n are lengths of the two target strings. >>You may want to rewrite it to make it look more Rubyish. > > > I believe that Jen was looking for a solution for doing the fuzzy > search on the MySQL server itself. > > Sincerely, > > Tom Lieber > tom-V0YqjHVuocLQT0dZR+AlfA@public.gmane.org > http://AllTom.com/
I may try that algorithm Lou posted. I''m going to choose between that and using Ferret to do this ... David Balmain said that locking in Ferret should be able to handle this without a server process except for really high concurrency applications. Since I''m dealing with less than 10,000 records, I don''t think this will be a big issue. Thanks! Jen
I ran across some odd behavior with multiple subclasses of ActiveRecord. Sometimes find() and find_all() would find subclass records and sometimes not, depending on whether a subclass had been loaded or not. I can imagine why this happens, and how it might be difficult to fix, but it was surprising and took me a while to figure out. Here''s a concrete example. Say I have three classes that inherit from ActiveRecord, and I have one instance of each in my ''animals'' table: Dog < Mammal < Animal < ActiveRecord::Base Run these command in order and you get: Mammal.count() = 1 Dog.count() = 1 Mammal.count() = 2! Here''s the SQL: SHOW FIELDS FROM animals SELECT COUNT(*) FROM animals WHERE ( (animals.`type` = ''Mammal'' ) ) SHOW FIELDS FROM animals SELECT COUNT(*) FROM animals WHERE ( (animals.`type` = ''Dog'' ) ) SELECT COUNT(*) FROM animals WHERE ( (animals.`type` = ''Mammal'' OR animals.`type` = ''Dog'' ) ) Sure looks like Mammal/Rails doesn''t know about Dog until I use Dog. There''s a workaround: Dog Mammal.count() = 2 Dog.count() = 1 Mammal.count() = 2 Not sure how easy it would be to fix, or if it even needs fixing. If anyone wants a simple example, download: http://www.butlerpress.com/sti.zip Scott
Scott Willson wrote:> Not sure how easy it would be to fix, or if it even needs fixing. If > anyone wants a simple example, download: > http://www.butlerpress.com/sti.zipWhat about explicitly using the ''model :x'' statements? --J -- Posted via http://www.ruby-forum.com/.