I am trying to do a search for a crazy problem. Anyways, the search can only contain one word and the database it''s searching contains one word per row. Basically if someone tries to search for a word I want to find the closest match. Ex: If someone types "cat" and "cat" is not in my database, but "bat" is, how could I select bat? Also, if someone types "cat" and I dont have anything remotely close to that then I want to return nothing. Lastly, these are not all english words either. Any help is greatly appreciated. -- Posted via http://www.ruby-forum.com/.
Ben Johnson <bjohnson@mediamanifest.com> writes:> I am trying to do a search for a crazy problem. Anyways, the search can > only contain one word and the database it''s searching contains one word > per row. Basically if someone tries to search for a word I want to find > the closest match. > > Ex: > > If someone types "cat" and "cat" is not in my database, but "bat" is, > how could I select bat? > > Also, if someone types "cat" and I dont have anything remotely close to > that then I want to return nothing. > > Lastly, these are not all english words either. > > Any help is greatly appreciated.a simple brute force solution would be. Search for cat. Then search for, ''.at'' (notice regexp), then ''c.t'', etc. definitely above can be made smarter. HTH. -- Surendra Singhi http://ssinghi.kreeti.com, http://www.kreeti.com Read my blog at: http://cuttingtheredtape.blogspot.com/ ,---- | "O thou my friend! The prosperity of Crime is like unto the lightning, | whose traitorous brilliancies embellish the atmosphere but for an | instant, in order to hurl into death''s very depths the luckless one | they have dazzled." -- Marquis de Sade `----
Ben Johnson wrote:> I am trying to do a search for a crazy problem. Anyways, the search can > only contain one word and the database it''s searching contains one word > per row. Basically if someone tries to search for a word I want to find > the closest match. > > Ex: > > If someone types "cat" and "cat" is not in my database, but "bat" is, > how could I select bat? > > Also, if someone types "cat" and I dont have anything remotely close to > that then I want to return nothing.Compute some kind of signatures for each word (preferably numeric), store them in the database with each word. These signatures should have such property, that for similar (whatever that means in your context) words they have similar values. Then just compute signature of the word you want to look for and search based on this. The problem of course is with finding a suitable signature generation algorithm. But the problem isn''t new and you should be able to find something on that topic (I have never done such a thing but have heard of this kind of solution). -- Posted via http://www.ruby-forum.com/.
Marcin Simonides wrote:> Ben Johnson wrote: >> I am trying to do a search for a crazy problem. Anyways, the search can >> only contain one word and the database it''s searching contains one word >> per row. Basically if someone tries to search for a word I want to find >> the closest match. >> >> Ex: >> >> If someone types "cat" and "cat" is not in my database, but "bat" is, >> how could I select bat? >> >> Also, if someone types "cat" and I dont have anything remotely close to >> that then I want to return nothing. > > Compute some kind of signatures for each word (preferably numeric), > store them in the database with each word. These signatures should have > such property, that for similar (whatever that means in your context) > words they have similar values. > Then just compute signature of the word you want to look for and search > based on this. > > The problem of course is with finding a suitable signature generation > algorithm. But the problem isn''t new and you should be able to find > something on that topic (I have never done such a thing but have heard > of this kind of solution).That would actually be perfect, because I could easily find the term that is closest to that number. The only problem is that I have absolutely no idea how to write that algorithm. Any idea what to search for on google or do you know where I can start with this? I searched Google for a good hour and basically ended up with nothing. Thanks. -- Posted via http://www.ruby-forum.com/.
Marcin Simonides wrote:> Ben Johnson wrote: >> I am trying to do a search for a crazy problem. Anyways, the search can >> only contain one word and the database it''s searching contains one word >> per row. Basically if someone tries to search for a word I want to find >> the closest match. >> >> Ex: >> >> If someone types "cat" and "cat" is not in my database, but "bat" is, >> how could I select bat? >> >> Also, if someone types "cat" and I dont have anything remotely close to >> that then I want to return nothing. > > Compute some kind of signatures for each word (preferably numeric), > store them in the database with each word. These signatures should have > such property, that for similar (whatever that means in your context) > words they have similar values. > Then just compute signature of the word you want to look for and search > based on this. > > The problem of course is with finding a suitable signature generation > algorithm. But the problem isn''t new and you should be able to find > something on that topic (I have never done such a thing but have heard > of this kind of solution).Also, I guess what I''m talking about is very similar to a spell check algorithm, but my words that I''m suggesting are not words from the dictionary, just random strings in a database. -- Posted via http://www.ruby-forum.com/.
Hi Ben, you might have a look at the fuzzy search capabilities of ferret ( http://ferret.davebalmain.com) or at least the fuzzy search algorithm that lucene / ferret are using. Cheers, Jan On 8/12/06, Ben Johnson <bjohnson@mediamanifest.com> wrote:> > Marcin Simonides wrote: > > Ben Johnson wrote: > >> I am trying to do a search for a crazy problem. Anyways, the search can > >> only contain one word and the database it''s searching contains one word > >> per row. Basically if someone tries to search for a word I want to find > >> the closest match. > >> > >> Ex: > >> > >> If someone types "cat" and "cat" is not in my database, but "bat" is, > >> how could I select bat? > >> > >> Also, if someone types "cat" and I dont have anything remotely close to > >> that then I want to return nothing. > > > > Compute some kind of signatures for each word (preferably numeric), > > store them in the database with each word. These signatures should have > > such property, that for similar (whatever that means in your context) > > words they have similar values. > > Then just compute signature of the word you want to look for and search > > based on this. > > > > The problem of course is with finding a suitable signature generation > > algorithm. But the problem isn''t new and you should be able to find > > something on that topic (I have never done such a thing but have heard > > of this kind of solution). > > That would actually be perfect, because I could easily find the term > that is closest to that number. The only problem is that I have > absolutely no idea how to write that algorithm. Any idea what to search > for on google or do you know where I can start with this? I searched > Google for a good hour and basically ended up with nothing. > > Thanks. > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060812/39ac0935/attachment-0001.html
On Saturday 12 August 2006 05:29, Ben Johnson wrote:> I am trying to do a search for a crazy problem. Anyways, the search > can only contain one word and the database it''s searching contains > one word per row. Basically if someone tries to search for a word I > want to find the closest match.If you don''t need "closest match", but "match from an indordinately large set" will do, then look at Bloom filters: http://dataspill.org/pages/projects/ruby-bloomfilter Michael -- Michael Schuerig mailto:michael@schuerig.de http://www.schuerig.de/michael/
> That would actually be perfect, because I could easily find the term > that is closest to that number. The only problem is that I have > absolutely no idea how to write that algorithm. Any idea what to search > for on google or do you know where I can start with this? I searched > Google for a good hour and basically ended up with nothing.The problem you describe has not yet been completely solved in computer sciences - at least to my knowledge. You are dealing with a set of words and search the "closest" match. Thus, first you have to define what "close" means. In mathematics and CS, the "function" that returns the numeric distance between two elements of a set is called the set''s "metric". An example is the euclidian metric sqrt(sqr(x) + sqr(y)). Try to search the internet for "metric" and "word metric" to find out about metrics in general. The problem that presents itself now is to index a metric space without much more special properties (as vector spaces would have). The only index structure I know is called "M Tree". Why index? Well, consider the following approach: Query all datasets and then for each data set compute the difference between the query word and the word in the data set. Sort the differences and select the 5 smallest. Sloooow for large sets. So the indexing has to take place in the database. I do not know any database that implements M trees as of now and extending PostgreSQL, for example, to support M Tree indexes would propably break your project''s scope. So what to do? Settle for a heuristic solution! The easiest thing here would be either to use the full text index engine of the RDBMS of your choice or fall back to a separate server process running on your server like Lucene or something that uses the BloomFilters or a similar approach. For each word you enter into your database, you add the word with the primary key of that entry to your word database server process. Another possibility would be to "factorize" similar words, i.e. similar sounding words and/or letters are changed to the same. For example, "similar" and "cimilar" could be considered the same. Hope That Helps, Manuel
Mysql does soundex, which isn''t exactly what you described, but might be close enough: http://www.madirish.net/tech.php?section=4&article=85 -- Posted via http://www.ruby-forum.com/.
Juan Felipe Garcia wrote:> Mysql does soundex, which isn''t exactly what you described, but might be > close enough: > http://www.madirish.net/tech.php?section=4&article=85I actually found this, which is pretty good: http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/ The only problem is performance. I have to run this algorithm against every word in the db. So as the first poster mentioned, if I could have an algorithm that basically stored each word as an integer I could simply find words that are similar with a range. So if I get "bat" and let''s say it returns 20. I could do: select word from words where rating < 30 and rating > 10 Anyways, that would be the optimal solution, if I could just find the algorithm. -- Posted via http://www.ruby-forum.com/.