thr3ads.net - R help - [R] Tuning string matching [Jan 2005]

If this information is useful, please help other people find it:
Share via:

adi@roda.ro

2005-Jan-05 17:35 UTC

[R] Tuning string matching

Dear list,

I spent about two hours searching on the message archive, with no avail.
I have a list of people that have to pass an on-line test, but only a fraction
of them do it. Moreover, as they input their names, the resulting string do not
always match the names I have in my database.

I would like to do two things:

1. Match any strings that are 90% the same
Example:
name1 <- "Harry Harrington"
name2 <- "Harry Harington"
I need a function that would declare those strings as a match (ideally having an
argument that would allow introducing 80% instead of 90%)

2. Arrange a final table that would take me from:

Table1 (the complete list of people from my database)
No Name
1  Byron C. Andrew
2  Friedman Bob
3  Harrington Harry

Table2 (the people having been tested)
No Name               Score
1  Harry Harington    13
2  Byron Andrew       28

to:

No Name1              Name2              Score
1  Byron C. Andrew    Byron Andrew       28
2  Friedman Bob
3  Harrington Harry   Harry Harington    13

Thank you in advance, any help is highly appreciated.
Adrian

Thomas Lumley

2005-Jan-05 18:54 UTC

head link

[R] Tuning string matching

On Wed, 5 Jan 2005 adi at roda.ro wrote:
> Dear list,
>
> I spent about two hours searching on the message archive, with no avail.
> I have a list of people that have to pass an on-line test, but only a
fraction
> of them do it. Moreover, as they input their names, the resulting string do
not
> always match the names I have in my database.
>
> I would like to do two things:
>
> 1. Match any strings that are 90% the same
> Example:
> name1 <- "Harry Harrington"
> name2 <- "Harry Harington"
> I need a function that would declare those strings as a match (ideally
having an
> argument that would allow introducing 80% instead of 90%)
agrep() does something very similar to this.  It has an edit distance 
rather than a % similarity, but you should be able to tune it to do what 
you want.
> 2. Arrange a final table that would take me from:
>
> Table1 (the complete list of people from my database)
> No Name
> 1  Byron C. Andrew
> 2  Friedman Bob
> 3  Harrington Harry
>
> Table2 (the people having been tested)
> No Name               Score
> 1  Harry Harington    13
> 2  Byron Andrew       28
>
> to:
>
> No Name1              Name2              Score
> 1  Byron C. Andrew    Byron Andrew       28
> 2  Friedman Bob
> 3  Harrington Harry   Harry Harington    13
>
This may not be very well-defined, since 90% agreement is not an 
equivalence relation.

Assuming that sets of matches are either identical or disjoint you could 
construct a numeric variable in table 2 that indicates which row of table 
1 to match, by using agrep() in a loop.


 	-thomas

McGehee, Robert

2005-Jan-05 19:36 UTC

head link

[R] Tuning string matching

It sounds like what you want is a rudimentary spell-checker whose
"word"
is the input name, and whose "dictionary" is an array of your database
names. Spell checking rules are designed to find missing repeats,
transposed letters, extra letters... precisely the reasons you're not
matching your names to your database.

Anyway, as I don't believe R has something like this, what I would do is
simply rewrite one of the dozens of Perl or C spell checkers to fit your
needs (such as Aspell / Ispell), then invoke a script under R using the
"system" call, passing in the student name and your database of names.
And as R can use Perl-like regular expression (?regexpr), you could (if
you really wanted to!) rewrite this into R after the fact, although this
would likely be a waste of time since expression matching is what Perl
is so good for.

You'll also need to think about what this percentage argument is. It's
not obvious to me what percentage of closeness "Robert" and
"Robret" are
vs. "Robert" and "RobQQto".

ex: http://tomacorp.com/perl/lingua/style.html
http://aspell.sourceforge.net/

Robert

-----Original Message-----
From: adi at roda.ro [mailto:adi at roda.ro] 
Sent: Wednesday, January 05, 2005 12:36 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Tuning string matching


Dear list,

I spent about two hours searching on the message archive, with no avail.
I have a list of people that have to pass an on-line test, but only a
fraction
of them do it. Moreover, as they input their names, the resulting string
do not
always match the names I have in my database.

I would like to do two things:

1. Match any strings that are 90% the same
Example:
name1 <- "Harry Harrington"
name2 <- "Harry Harington"
I need a function that would declare those strings as a match (ideally
having an
argument that would allow introducing 80% instead of 90%)

2. Arrange a final table that would take me from:

Table1 (the complete list of people from my database)
No Name
1  Byron C. Andrew
2  Friedman Bob
3  Harrington Harry

Table2 (the people having been tested)
No Name               Score
1  Harry Harington    13
2  Byron Andrew       28

to:

No Name1              Name2              Score
1  Byron C. Andrew    Byron Andrew       28
2  Friedman Bob
3  Harrington Harry   Harry Harington    13

Thank you in advance, any help is highly appreciated.
Adrian

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

bogdan romocea

2005-Jan-05 19:46 UTC

head link

[R] Tuning string matching

This is a rather complex problem. I'm not aware of an R function /
package that can do something like this, but in case you need to build
it from scratch read
http://support.sas.com/documentation/periodicals/obs/obswww15/index.html
If you're familiar with SAS you could translate the code to R.

HTH,
b.


-----Original Message-----
From: adi at roda.ro
Sent: Wednesday, January 05, 2005 12:36 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Tuning string matching


Dear list,

I spent about two hours searching on the message archive, with no
avail.
I have a list of people that have to pass an on-line test, but only a
fraction
of them do it. Moreover, as they input their names, the resulting
string do not
always match the names I have in my database.

I would like to do two things:

1. Match any strings that are 90% the same
Example:
name1 <- "Harry Harrington"
name2 <- "Harry Harington"
I need a function that would declare those strings as a match (ideally
having an
argument that would allow introducing 80% instead of 90%)

2. Arrange a final table that would take me from:

Table1 (the complete list of people from my database)
No Name
1  Byron C. Andrew
2  Friedman Bob
3  Harrington Harry

Table2 (the people having been tested)
No Name               Score
1  Harry Harington    13
2  Byron Andrew       28

to:

No Name1              Name2              Score
1  Byron C. Andrew    Byron Andrew       28
2  Friedman Bob
3  Harrington Harry   Harry Harington    13

Thank you in advance, any help is highly appreciated.
Adrian

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Jan 2005 - Tuning string matching

[R] Tuning string matching

[R] Tuning string matching

[R] Tuning string matching

[R] Tuning string matching

Apparently Analagous Threads