I am looking for a library/function in R that can compare two phrases and give me a score, or somehow classify them as correct as possible. The "phrases" are obfuscated/messy. I am not concerned about which is "correct" (for example spell checking), I am only concerned in grouping them so that I know they are the closest match. Example: I have ROW1 and ROW2 like so: ROW1 ROW2 hamburger helper bigmc heartkcatta chicken nuggets chicke, nuggets, jss bigmac heartattack some sombody somehwere somebody somehwere repleh regrubmah I am looking for something that can tell me that the best match for hamburger helper is repleh regrubmah, and the same for each other row. So my goal is to write a program that foreach phrase in ROW1 runs this function against ROW2 and gives me the phrase that scored best. I have read over much of the NLP packages at http://cran.r-project.org/web/views/NaturalLanguageProcessing.html I thought lsa might be a good fit, but I am not sure. I have limited time, so I am hoping someone can point me in a direction of what I am looking for. I have been searching for "text classifiers", perhaps this problem is referred to as something else. Brian
On Sat, Nov 17, 2012 at 11:00 PM, Brian Feeny <bfeeny at mac.com> wrote:> I am looking for a library/function in R that can compare two phrases and give me a score, or somehow classify them as correct as possible. > > The "phrases" are obfuscated/messy. I am not concerned about which is "correct" (for example spell checking), I am only concerned in grouping them > so that I know they are the closest match. > > Example: > > I have ROW1 and ROW2 like so: > > ROW1 ROW2 > hamburger helper bigmc heartkcatta > chicken nuggets chicke, nuggets, jss > bigmac heartattack some sombody somehwere > somebody somehwere repleh regrubmah > > I am looking for something that can tell me that the best match for hamburger helper is repleh regrubmah, and the same for each other row. > > So my goal is to write a program that foreach phrase in ROW1 runs this function against ROW2 and gives me the phrase that scored best. > > I have read over much of the NLP packages at http://cran.r-project.org/web/views/NaturalLanguageProcessing.html > > I thought lsa might be a good fit, but I am not sure. I have limited time, so I am hoping someone can point me in a direction of what I am looking for. > > I have been searching for "text classifiers", perhaps this problem is referred to as something else. >This is outside my expertise, but if memory serves, you might benefit from googling the Levenshtein (spelling?) distance which allows this sort of fuzzy matching of strings. MW
Thank you Michael and David. I am onto agrep and adist and they look very useful for what I am wanting to do. My initial results are promising! Brian On Nov 17, 2012, at 6:20 PM, R. Michael Weylandt wrote:> On Sat, Nov 17, 2012 at 11:00 PM, Brian Feeny <bfeeny at mac.com> wrote: >> I am looking for a library/function in R that can compare two phrases and give me a score, or somehow classify them as correct as possible. >> >> The "phrases" are obfuscated/messy. I am not concerned about which is "correct" (for example spell checking), I am only concerned in grouping them >> so that I know they are the closest match. >> >> Example: >> >> I have ROW1 and ROW2 like so: >> >> ROW1 ROW2 >> hamburger helper bigmc heartkcatta >> chicken nuggets chicke, nuggets, jss >> bigmac heartattack some sombody somehwere >> somebody somehwere repleh regrubmah >> >> I am looking for something that can tell me that the best match for hamburger helper is repleh regrubmah, and the same for each other row. >> >> So my goal is to write a program that foreach phrase in ROW1 runs this function against ROW2 and gives me the phrase that scored best. >> >> I have read over much of the NLP packages at http://cran.r-project.org/web/views/NaturalLanguageProcessing.html >> >> I thought lsa might be a good fit, but I am not sure. I have limited time, so I am hoping someone can point me in a direction of what I am looking for. >> >> I have been searching for "text classifiers", perhaps this problem is referred to as something else. >> > > This is outside my expertise, but if memory serves, you might benefit > from googling the Levenshtein (spelling?) distance which allows this > sort of fuzzy matching of strings. > > MW