thr3ads.net - Rails - Fuzzy searching [Nov 2005]

If this information is useful, please help other people find it:
Share via:

jennyw

2005-Nov-26 19:15 UTC

Fuzzy searching

Hi, everyone,

Just wondering if someone had come up with a good way to do fuzzy 
searches if you use MySQL as your database (we tried switching to  
PostgreSQL, but that ended up adding even more problems). Ferret sounds 
great, but reading through the discussion it looks like we need to solve 
the problem of write conflicts. I just wrote a post in ruby-talk about 
using KirbyBase maybe to solve this. Of course, I''m hoping someone has 
come up with an even better solution! ;-)

Anyway, any advice would be appreciated.

Thanks!

Jen

Lou Vanek

2005-Nov-26 23:47 UTC

head link

Re: Fuzzy searching

Here''s a simple fuzzy-search algorithm you can plug in to whatever you
decide to use.
Most of this code was stolen from php: http://us2.php.net/similar_text (in
comments section).
O(m*n), where m and n are lengths of the two target strings.
You may want to rewrite it to make it look more Rubyish.


# this is the main algorithm.
# compute longest common subsequence btn strings s1 and s2
def lcs_length(s1, s2)

   m = [s1.length,128].min
   n = [s2.length,128].min
   x = 1 + [m,n].max

   # this table will be used to compute the LCS-Length,
   # only the first 128 chars per string are considered
   lcs_tbl = []
   # initialize 2-dimensional array to 0
   (0...x).each { || lcs_tbl << Array.new(x,0) }

   (1..m).each { |i|
	  (1..n).each { |j|
			 if (s1[i-1] == s2[j-1])
			   lcs_tbl[i][j] = lcs_tbl[i-1][j-1] + 1
			 elsif (lcs_tbl[i-1][j] >= lcs_tbl[i][j-1])
			   lcs_tbl[i][j] = lcs_tbl[i-1][j]
			 else
			   lcs_tbl[i][j] = lcs_tbl[i][j-1]
			 end
	  }
   }

   return lcs_tbl[m][n]
end




def get_lcs(s1, s2, verbose=false)

   # ok, now replace all spaces and weird characters with
''normal'' ascii-7
   s1 = normalize_string(s1);
   s2 = normalize_string(s2);

   # check for pathological cases
   return 0 if [s1.length, s2.length].max < 1

   puts "comparing strings #{s1} and #{s2}" if verbose
   lcs = lcs_length(s1,s2) # compute longest common subsequence

   # average string length
   ms = (s1.length + s2.length) / 2.0

   (puts "ms: #{ms} lcs: #{lcs}";puts) if verbose
   return (lcs*100)/ms
end




   # code stolen from one of the ruby MLs.
   def normalize_string(foo)
     foo = foo.downcase.strip
     foo.gsub!(/[ÀÁÂÃâäàãáäåāăąǎǟǡǻȁȃȧẵặ]/,''a'')
     foo.gsub!(/[ëêéèẽēĕėẻȅȇẹȩęḙḛềếễểḕḗệḝ]/,''e'')
     foo.gsub!(/[ÌÍÎĨÏiìíîĩīĭïỉǐịįȉȋḭɨḯ]/,''i'')
    
foo.gsub!(/[ÒÓÔÕÖòóôõōŏȯöỏőǒȍȏơǫọɵøồốỗổȱȫȭṍṏṑṓờớỡởợǭộǿ]/,''o'')
     foo.gsub!(/[ÙÚÛŨÜùúûũūŭüủůűǔȕȗưụṳųṷṵṹṻǖǜǘǖǚừứữửự]/,''u'')
     foo.gsub!(/[ỳýŷỹȳẏÿỷẙƴỵ]/,''y'')
     foo.gsub!(/[œ]/,''oe'')
     foo.gsub!(/[ÆǼǢæ]/,''ae'')
     foo.gsub!(/[ñǹń]/,''n'')
     foo.gsub!(/[Çç]/,''c'')
     foo.gsub!(/[ß]/,''b'')
     foo.gsub!(/[œ]/,''oe'')
     foo.gsub!(/[ĳ]/,''ij'')
    
foo.gsub!(/[\s\''\"\\\/\?\.\=\+\&\%]/,''_'')
     foo.gsub!(/_+/,''_'')
     foo
   end


   # a little test
   [
	 
["a","a"],["ab","ab"],["abc","abc"],
	 
["a","b"],["ab","bb"],["abc","abb"],
	 
["a","1"],["ab","_b"],["abcd","a
cd"],
	 
["abcd","abcde"],["abcd","abcdefghi"],["abcde","abcdf"],
	 
["abcd","zbcd"],["abcd","efghiabcd"],["abcde","fabcd"],
	 
["abc","acb"],["abcde","abzde"],["abcde","abde"],
	 
["abc","ac"],["abcde","ab1de"],["abcde","ab
de"]
   ].each { |e|
   	puts get_lcs(e[0],e[1],true)
	puts
   }



jennyw wrote:> Hi, everyone,
> 
> Just wondering if someone had come up with a good way to do fuzzy 
> searches if you use MySQL as your database (we tried switching to  
> PostgreSQL, but that ended up adding even more problems). Ferret sounds 
> great, but reading through the discussion it looks like we need to solve 
> the problem of write conflicts. I just wrote a post in ruby-talk about 
> using KirbyBase maybe to solve this. Of course, I''m hoping someone
has
> come up with an even better solution! ;-)
> 
> Anyway, any advice would be appreciated.
> 
> Thanks!
> 
> Jen

Tom Lieber

2005-Nov-27 03:23 UTC

head link

Re: Re: Fuzzy searching

On 11/26/05, Lou Vanek <vanek-9H8CmIPm+GA@public.gmane.org>
wrote:> Here''s a simple fuzzy-search algorithm you can plug in to whatever
you decide to use.
> Most of this code was stolen from php: http://us2.php.net/similar_text (in
comments section).
> O(m*n), where m and n are lengths of the two target strings.
> You may want to rewrite it to make it look more Rubyish.
I believe that Jen was looking for a solution for doing the fuzzy
search on the MySQL server itself.

Sincerely,

Tom Lieber
tom-V0YqjHVuocLQT0dZR+AlfA@public.gmane.org
http://AllTom.com/

Lou Vanek

2005-Nov-27 13:21 UTC

head link

Re: Fuzzy searching

MySQL does not have a fuzzy search capability, nor does the stored
procedure language have arrays, so fuzzy search embedded w/in
the MySQL database is not possible to do efficiently (although you
can fake arrays either with temp tables or gobs of string ops).
Which leaves you with downloading the data and performing the
match outside of the db. On the surface this appears inefficient,
because, after all, the data has to be moved to the client each
time a search is processed. But considering that the search itself
is O(m*n), data transfer is not the overriding concern (unless you
have an unbearably slow transfer speed).
In fact, many fuzzy search algorithms have a much worse running-time
order [O(n^3)], so pulling the data off the server so that a more
efficient algorithm can be used may even be faster than doing the
computation centrally on the server.

Tom Lieber wrote:
> On 11/26/05, Lou Vanek <vanek-9H8CmIPm+GA@public.gmane.org> wrote:
> 
>>Here''s a simple fuzzy-search algorithm you can plug in to
whatever you decide to use.
>>Most of this code was stolen from php: http://us2.php.net/similar_text
(in comments section).
>>O(m*n), where m and n are lengths of the two target strings.
>>You may want to rewrite it to make it look more Rubyish.
> 
> 
> I believe that Jen was looking for a solution for doing the fuzzy
> search on the MySQL server itself.
> 
> Sincerely,
> 
> Tom Lieber
> tom-V0YqjHVuocLQT0dZR+AlfA@public.gmane.org
> http://AllTom.com/

jennyw

2005-Nov-28 01:01 UTC

head link

Re: Re: Fuzzy searching

I may try that algorithm Lou posted. I''m going to choose between that 
and using Ferret to do this ... David Balmain said that locking in 
Ferret should be able to handle this without a server process except for 
really high concurrency applications. Since I''m dealing with less than 
10,000 records, I don''t think this will be a big issue.

Thanks!

Jen

Scott Willson

2005-Dec-15 02:41 UTC

head link

ActiveRecord Subclass Finders

I ran across some odd behavior with multiple subclasses of  
ActiveRecord. Sometimes find() and find_all() would find subclass  
records and sometimes not, depending on whether a subclass had been  
loaded or not. I can imagine why this happens, and how it might be  
difficult to fix, but it was surprising and took me a while to figure  
out.

Here''s a concrete example. Say I have three classes that inherit from  
ActiveRecord, and I have one instance of each in my ''animals''
table:
Dog < Mammal < Animal < ActiveRecord::Base

Run these command in order and you get:
Mammal.count() = 1
Dog.count() = 1
Mammal.count() = 2!

Here''s the SQL:
SHOW FIELDS FROM animals
SELECT COUNT(*) FROM animals WHERE ( (animals.`type` =
''Mammal'' ) )
SHOW FIELDS FROM animals
SELECT COUNT(*) FROM animals WHERE ( (animals.`type` = ''Dog'' )
)
SELECT COUNT(*) FROM animals WHERE ( (animals.`type` =
''Mammal'' OR
animals.`type` = ''Dog'' ) )

Sure looks like Mammal/Rails doesn''t know about Dog until I use Dog.  
There''s a workaround:
Dog
Mammal.count() = 2
Dog.count() = 1
Mammal.count() = 2

Not sure how easy it would be to fix, or if it even needs fixing. If  
anyone wants a simple example, download:
http://www.butlerpress.com/sti.zip

Scott

jonathan

2005-Dec-15 06:16 UTC

head link

Re: ActiveRecord Subclass Finders

Scott Willson wrote:> Not sure how easy it would be to fix, or if it even needs fixing. If
> anyone wants a simple example, download:
> http://www.butlerpress.com/sti.zip
What about explicitly using the ''model :x'' statements?

--J

-- 
Posted via http://www.ruby-forum.com/.

Reasonably Related Threads

Search for more possibly parallel threads

Rails - Nov 2005 - Fuzzy searching

Fuzzy searching

Re: Fuzzy searching

Re: Re: Fuzzy searching

Re: Fuzzy searching

Re: Re: Fuzzy searching

ActiveRecord Subclass Finders

Re: ActiveRecord Subclass Finders

Reasonably Related Threads