Dear list, I spent about two hours searching on the message archive, with no avail. I have a list of people that have to pass an on-line test, but only a fraction of them do it. Moreover, as they input their names, the resulting string do not always match the names I have in my database. I would like to do two things: 1. Match any strings that are 90% the same Example: name1 <- "Harry Harrington" name2 <- "Harry Harington" I need a function that would declare those strings as a match (ideally having an argument that would allow introducing 80% instead of 90%) 2. Arrange a final table that would take me from: Table1 (the complete list of people from my database) No Name 1 Byron C. Andrew 2 Friedman Bob 3 Harrington Harry Table2 (the people having been tested) No Name Score 1 Harry Harington 13 2 Byron Andrew 28 to: No Name1 Name2 Score 1 Byron C. Andrew Byron Andrew 28 2 Friedman Bob 3 Harrington Harry Harry Harington 13 Thank you in advance, any help is highly appreciated. Adrian
On Wed, 5 Jan 2005 adi at roda.ro wrote:> Dear list, > > I spent about two hours searching on the message archive, with no avail. > I have a list of people that have to pass an on-line test, but only a fraction > of them do it. Moreover, as they input their names, the resulting string do not > always match the names I have in my database. > > I would like to do two things: > > 1. Match any strings that are 90% the same > Example: > name1 <- "Harry Harrington" > name2 <- "Harry Harington" > I need a function that would declare those strings as a match (ideally having an > argument that would allow introducing 80% instead of 90%)agrep() does something very similar to this. It has an edit distance rather than a % similarity, but you should be able to tune it to do what you want.> 2. Arrange a final table that would take me from: > > Table1 (the complete list of people from my database) > No Name > 1 Byron C. Andrew > 2 Friedman Bob > 3 Harrington Harry > > Table2 (the people having been tested) > No Name Score > 1 Harry Harington 13 > 2 Byron Andrew 28 > > to: > > No Name1 Name2 Score > 1 Byron C. Andrew Byron Andrew 28 > 2 Friedman Bob > 3 Harrington Harry Harry Harington 13 >This may not be very well-defined, since 90% agreement is not an equivalence relation. Assuming that sets of matches are either identical or disjoint you could construct a numeric variable in table 2 that indicates which row of table 1 to match, by using agrep() in a loop. -thomas
It sounds like what you want is a rudimentary spell-checker whose "word" is the input name, and whose "dictionary" is an array of your database names. Spell checking rules are designed to find missing repeats, transposed letters, extra letters... precisely the reasons you're not matching your names to your database. Anyway, as I don't believe R has something like this, what I would do is simply rewrite one of the dozens of Perl or C spell checkers to fit your needs (such as Aspell / Ispell), then invoke a script under R using the "system" call, passing in the student name and your database of names. And as R can use Perl-like regular expression (?regexpr), you could (if you really wanted to!) rewrite this into R after the fact, although this would likely be a waste of time since expression matching is what Perl is so good for. You'll also need to think about what this percentage argument is. It's not obvious to me what percentage of closeness "Robert" and "Robret" are vs. "Robert" and "RobQQto". ex: http://tomacorp.com/perl/lingua/style.html http://aspell.sourceforge.net/ Robert -----Original Message----- From: adi at roda.ro [mailto:adi at roda.ro] Sent: Wednesday, January 05, 2005 12:36 PM To: r-help at stat.math.ethz.ch Subject: [R] Tuning string matching Dear list, I spent about two hours searching on the message archive, with no avail. I have a list of people that have to pass an on-line test, but only a fraction of them do it. Moreover, as they input their names, the resulting string do not always match the names I have in my database. I would like to do two things: 1. Match any strings that are 90% the same Example: name1 <- "Harry Harrington" name2 <- "Harry Harington" I need a function that would declare those strings as a match (ideally having an argument that would allow introducing 80% instead of 90%) 2. Arrange a final table that would take me from: Table1 (the complete list of people from my database) No Name 1 Byron C. Andrew 2 Friedman Bob 3 Harrington Harry Table2 (the people having been tested) No Name Score 1 Harry Harington 13 2 Byron Andrew 28 to: No Name1 Name2 Score 1 Byron C. Andrew Byron Andrew 28 2 Friedman Bob 3 Harrington Harry Harry Harington 13 Thank you in advance, any help is highly appreciated. Adrian ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
This is a rather complex problem. I'm not aware of an R function / package that can do something like this, but in case you need to build it from scratch read http://support.sas.com/documentation/periodicals/obs/obswww15/index.html If you're familiar with SAS you could translate the code to R. HTH, b. -----Original Message----- From: adi at roda.ro Sent: Wednesday, January 05, 2005 12:36 PM To: r-help at stat.math.ethz.ch Subject: [R] Tuning string matching Dear list, I spent about two hours searching on the message archive, with no avail. I have a list of people that have to pass an on-line test, but only a fraction of them do it. Moreover, as they input their names, the resulting string do not always match the names I have in my database. I would like to do two things: 1. Match any strings that are 90% the same Example: name1 <- "Harry Harrington" name2 <- "Harry Harington" I need a function that would declare those strings as a match (ideally having an argument that would allow introducing 80% instead of 90%) 2. Arrange a final table that would take me from: Table1 (the complete list of people from my database) No Name 1 Byron C. Andrew 2 Friedman Bob 3 Harrington Harry Table2 (the people having been tested) No Name Score 1 Harry Harington 13 2 Byron Andrew 28 to: No Name1 Name2 Score 1 Byron C. Andrew Byron Andrew 28 2 Friedman Bob 3 Harrington Harry Harry Harington 13 Thank you in advance, any help is highly appreciated. Adrian ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html