Hi, I have to match names where names can be recorded with errors or additions. Now I am searching for a string search function which returns always the "closest" match. E.g. searching for "Washington" it should return only Washington but not Washington, D.C. But it also could be that the list contains only "Hamburg" but the record I am searching for is "Hamburg OTM" and then we still want to find "Hamburg". Or maybe the list contains "Hamburg" and "Hamberg" but we are searching for "Hamburg" and thus only this should this one should be returned. agrep() returns all "close" matches but unfortunately does not return the degree of closeness otherwise selection would be easy. Is there such a function already implemented? Thanks a million for your help, Werner __________________________ verf?gt ?ber einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com
Richard.Cotton at hsl.gov.uk
2008-Aug-26 13:10 UTC
[R] String search: Return "closest" match
> I have to match names where names can be recorded with errors oradditions.> Now I am searching for a string search function which returns always > the "closest" match. E.g. searching for "Washington" it should > return only Washington but not Washington, D.C. But it also could be > that the list contains only "Hamburg" but the record I am searching > for is "Hamburg OTM" and then we still want to find "Hamburg". Or > maybe the list contains "Hamburg" and "Hamberg" but we are searching > for "Hamburg" and thus only this should this one should be returned. > > agrep() returns all "close" matches but unfortunately does not > return the degree of closeness otherwise selection would be easy. > Is there such a function already implemented?The Levenshtein distance is a common metric for determining how close two string are (in fact, agrep uses this). There's a function to calculate it on the R wiki. http://wiki.r-project.org/rwiki/doku.php?id=tips:data-strings:levenshtein You can use this to find the closest string. (If your set of cities is large, it may be quickest to use agrep to narrow the selection first, since the pure R implementation of levenshtein is likely to be slow.) Regards, Richie. Mathematical Sciences Unit HSL ------------------------------------------------------------------------ ATTENTION: This message contains privileged and confidential inform...{{dropped:20}}
That works perfectly, great. Thanks a lot for that Richard! Werner ----- Urspr?ngliche Mail ---- Von: "Richard.Cotton at hsl.gov.uk" <Richard.Cotton at hsl.gov.uk> An: Werner Wernersen <pensterfuzzer at yahoo.de> CC: r-help at stat.math.ethz.ch; r-help-bounces at r-project.org Gesendet: Dienstag, den 26. August 2008, 14:10:11 Uhr Betreff: Re: [R] String search: Return "closest" match> I have to match names where names can be recorded with errors oradditions.> Now I am searching for a string search function which returns always > the "closest" match. E.g. searching for "Washington" it should > return only Washington but not Washington, D.C. But it also could be > that the list contains only "Hamburg" but the record I am searching > for is "Hamburg OTM" and then we still want to find "Hamburg". Or > maybe the list contains "Hamburg" and "Hamberg" but we are searching > for "Hamburg" and thus only this should this one should be returned. > > agrep() returns all "close" matches but unfortunately does not > return the degree of closeness otherwise selection would be easy. > Is there such a function already implemented?The Levenshtein distance is a common metric for determining how close two string are (in fact, agrep uses this). There's a function to calculate it on the R wiki. http://wiki.r-project.org/rwiki/doku.php?id=tips:data-strings:levenshtein You can use this to find the closest string. (If your set of cities is large, it may be quickest to use agrep to narrow the selection first, since the pure R implementation of levenshtein is likely to be slow.) Regards, Richie. Mathematical Sciences Unit HSL ------------------------------------------------------------------------ ATTENTION: This message contains privileged and confidential information intended for the addressee(s) only. If this message was sent to you in error, you must not disseminate, copy or take any action in reliance on it and we request that you notify the sender immediately by return email. Opinions expressed in this message and any attachments are not necessarily those held by the Health and Safety Laboratory or any person connected with the organisation, save those by whom the opinions were expressed. Please note that any messages sent or received by the Health and Safety Laboratory email system may be monitored and stored in an information retrieval system. ------------------------------------------------------------------------ ------------------------------------------------------------------------ Scanned by MailMarshal - Marshal's comprehensive email content security solution. Download a free evaluation of MailMarshal at www.marshal.com ------------------------------------------------------------------------ __________________________________ gt ?ber einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com