thr3ads.net - R help - [R] Is there a package that can do Fuzzy name matching to standardize names in a single column [Jun 2022]

If this information is useful, please help other people find it:
Share via:

Gregg Powell

2022-Jun-15 14:57 UTC

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

Have data sets where there are names, in the first column, client names in the
second, and Client start date in the third.?

There are thousands of these records with thousands of names/clients/client
start dates. The name is entered each time the person begins with a new client
such that each person has many entries in the name column. Often the names were
not entered in a consistent way. With and without middle initial, middle name,
or various abbreviations such as ",RN" at the end of the name.

Is there a package that can do fuzzy name matching so that the names in name
column get replaced with a "standardized" format - where some type of
machine learning can pick the most common spelling of each repeat name and
replace the different variations with the common spelling?

I included an example below. First table includes the names with the various
spellings. Second table depicts what I hope to achieve.

Again - this is on a large scale - there are something like 10,000 records with
names that need to be standardized.


Name

Client

Client Start Date

John Good

Client 1

1/1/2020

Joe Jackson

Client 2

6/1/2020

Bob A. Barker

Client 3

8/1/2020

John B. Good

Client 4

10/1/2020

Joe J. Jackson

Client 5

12/1/2020

Bob Allen Barker

Client 6

1/1/2021

John Good

Client 7

5/1/2021

Joe Jack Jackson

Client 8

8/1/2021

Bob Barker

Client 9

12/1/2021

?

?

?

Name

Client

Client Start Date

John Good

Client 1

1/1/2020

Joe J. Jackson

Client 2

6/1/2020

Bob A. Barker

Client 3

8/1/2020

John Good

Client 4

10/1/2020

Joe J. Jackson

Client 5

12/1/2020

Bob A. Barker

Client 6

1/1/2021

John Good

Client 7

5/1/2021

Joe J. Jackson

Client 8

8/1/2021

Bob A. Barker

Client 9

12/1/2021



THANKS!

Gregg Powell

Arizona, USA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 509 bytes
Desc: OpenPGP digital signature
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20220615/86c9cb17/attachment.sig>

Ashim Kapoor

2022-Jun-15 15:04 UTC

head link

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

Dear Gregg,

Check this out:

library(fuzzyjoin)
?stringdist_left_join

Best Regards,
Ashim

On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help
<r-help at r-project.org> wrote:>
> Have data sets where there are names, in the first column, client names in
the second, and Client start date in the third.
>
> There are thousands of these records with thousands of names/clients/client
start dates. The name is entered each time the person begins with a new client
such that each person has many entries in the name column. Often the names were
not entered in a consistent way. With and without middle initial, middle name,
or various abbreviations such as ",RN" at the end of the name.
>
> Is there a package that can do fuzzy name matching so that the names in
name column get replaced with a "standardized" format - where some
type of machine learning can pick the most common spelling of each repeat name
and replace the different variations with the common spelling?
>
> I included an example below. First table includes the names with the
various spellings. Second table depicts what I hope to achieve.
>
> Again - this is on a large scale - there are something like 10,000 records
with names that need to be standardized.
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John B. Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob Allen Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe Jack Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob Barker
>
> Client 9
>
> 12/1/2021
>
>
>
>
>
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe J. Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob A. Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe J. Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob A. Barker
>
> Client 9
>
> 12/1/2021
>
>
>
> THANKS!
>
> Gregg Powell
>
> Arizona, USA______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2022-Jun-15 15:38 UTC

head link

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

As these are English names and appear to be present always as **first
?? last** (you didn't specify but that's how your example shows it),
maybe something like the following might be a start:

1. Use strsplit() to split the names into their constituent parts.
2. Find the last *meaningful* part in each vector (e.g. Joe Smith Jr.
should exclude Jr. and choose Smith)
3. Split the names into the groups of identical unique last parts
4. Split each of the groups of last names into subgroups based on the
first one or more letters of first name so that, e.g. Joe and Joseph
would be in the same subgroup of Smith. Of course Joe and John would
be also, so you see the problem...

Other Issues:
Are Joe Smith and Joe Smith Jr. the same person?
Misspellings? Typos?  Is Arlene Smith the same as Alene Smith?

Some sort of clustering of the names might also be appropriate. See
https://cran.r-project.org/web/views/Cluster.html  for ideas.

Cheers,
Bert

On Wed, Jun 15, 2022 at 7:58 AM Gregg Powell via R-help
<r-help at r-project.org> wrote:>
> Have data sets where there are names, in the first column, client names in
the second, and Client start date in the third.
>
> There are thousands of these records with thousands of names/clients/client
start dates. The name is entered each time the person begins with a new client
such that each person has many entries in the name column. Often the names were
not entered in a consistent way. With and without middle initial, middle name,
or various abbreviations such as ",RN" at the end of the name.
>
> Is there a package that can do fuzzy name matching so that the names in
name column get replaced with a "standardized" format - where some
type of machine learning can pick the most common spelling of each repeat name
and replace the different variations with the common spelling?
>
> I included an example below. First table includes the names with the
various spellings. Second table depicts what I hope to achieve.
>
> Again - this is on a large scale - there are something like 10,000 records
with names that need to be standardized.
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John B. Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob Allen Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe Jack Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob Barker
>
> Client 9
>
> 12/1/2021
>
>
>
>
>
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe J. Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob A. Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe J. Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob A. Barker
>
> Client 9
>
> 12/1/2021
>
>
>
> THANKS!
>
> Gregg Powell
>
> Arizona, USA______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jan van der Laan

2022-Jun-15 19:31 UTC

head link

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

The reclin2 package (by me)? also has functionality for finding 
duplicate records. So it should handle finding names that are likely to 
be the same. See here for a vignette with an example where different 
variants of town names are clustered: 
https://cran.r-project.org/web/packages/reclin2/vignettes/deduplication.html 
.

In some of the cases where I used this, the quality of the matches 
improved immensely with a large number of manual preprocessing of the 
names. These were mostly done using regular expressions, for example 
using gsub. For example removing accents, hyphens, replacing common 
variants of names with the most common one.

HTH,

Jan




On 15-06-2022 16:57, Gregg Powell via R-help wrote:> Have data sets where there are names, in the first column, client names in
the second, and Client start date in the third.
>
> There are thousands of these records with thousands of names/clients/client
start dates. The name is entered each time the person begins with a new client
such that each person has many entries in the name column. Often the names were
not entered in a consistent way. With and without middle initial, middle name,
or various abbreviations such as ",RN" at the end of the name.
>
> Is there a package that can do fuzzy name matching so that the names in
name column get replaced with a "standardized" format - where some
type of machine learning can pick the most common spelling of each repeat name
and replace the different variations with the common spelling?
>
> I included an example below. First table includes the names with the
various spellings. Second table depicts what I hope to achieve.
>
> Again - this is on a large scale - there are something like 10,000 records
with names that need to be standardized.
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John B. Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob Allen Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe Jack Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob Barker
>
> Client 9
>
> 12/1/2021
>
>   
>
>   
>
>   
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe J. Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob A. Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe J. Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob A. Barker
>
> Client 9
>
> 12/1/2021
>
>
>
> THANKS!
>
> Gregg Powell
>
> Arizona, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Jun 2022 - Is there a package that can do Fuzzy name matching to standardize names in a single column

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column

[R] Is there a package that can do Fuzzy name matching to standardize names in a single column