Ashim Kapoor
2022-Jun-15 15:04 UTC
[R] Is there a package that can do Fuzzy name matching to standardize names in a single column
Dear Gregg, Check this out: library(fuzzyjoin) ?stringdist_left_join Best Regards, Ashim On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help <r-help at r-project.org> wrote:> > Have data sets where there are names, in the first column, client names in the second, and Client start date in the third. > > There are thousands of these records with thousands of names/clients/client start dates. The name is entered each time the person begins with a new client such that each person has many entries in the name column. Often the names were not entered in a consistent way. With and without middle initial, middle name, or various abbreviations such as ",RN" at the end of the name. > > Is there a package that can do fuzzy name matching so that the names in name column get replaced with a "standardized" format - where some type of machine learning can pick the most common spelling of each repeat name and replace the different variations with the common spelling? > > I included an example below. First table includes the names with the various spellings. Second table depicts what I hope to achieve. > > Again - this is on a large scale - there are something like 10,000 records with names that need to be standardized. > > > Name > > Client > > Client Start Date > > John Good > > Client 1 > > 1/1/2020 > > Joe Jackson > > Client 2 > > 6/1/2020 > > Bob A. Barker > > Client 3 > > 8/1/2020 > > John B. Good > > Client 4 > > 10/1/2020 > > Joe J. Jackson > > Client 5 > > 12/1/2020 > > Bob Allen Barker > > Client 6 > > 1/1/2021 > > John Good > > Client 7 > > 5/1/2021 > > Joe Jack Jackson > > Client 8 > > 8/1/2021 > > Bob Barker > > Client 9 > > 12/1/2021 > > > > > > > > Name > > Client > > Client Start Date > > John Good > > Client 1 > > 1/1/2020 > > Joe J. Jackson > > Client 2 > > 6/1/2020 > > Bob A. Barker > > Client 3 > > 8/1/2020 > > John Good > > Client 4 > > 10/1/2020 > > Joe J. Jackson > > Client 5 > > 12/1/2020 > > Bob A. Barker > > Client 6 > > 1/1/2021 > > John Good > > Client 7 > > 5/1/2021 > > Joe J. Jackson > > Client 8 > > 8/1/2021 > > Bob A. Barker > > Client 9 > > 12/1/2021 > > > > THANKS! > > Gregg Powell > > Arizona, USA______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Gregg Powell
2022-Jun-15 15:43 UTC
[R] Is there a package that can do Fuzzy name matching to standardize names in a single column
Hello Ashim and kind regards for you taking the time to answer back.> library(fuzzyjoin) > ?stringdist_left_join-this will join two tables, but what I am trying to do is just standardize the similarly spelled duplicate names in just the first column of a single table. I don't think fuzzyjoin will help me in that regard. Thanks. Gregg Arizona, USA ------- Original Message ------- On Wednesday, June 15th, 2022 at 8:04 AM, Ashim Kapoor <ashimkapoor at gmail.com> wrote:>>> Dear Gregg, >> Check this out: >> library(fuzzyjoin) > ?stringdist_left_join >> Best Regards, > Ashim >> On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help > r-help at r-project.org wrote: >> > Have data sets where there are names, in the first column, client names in the second, and Client start date in the third. > >> > There are thousands of these records with thousands of names/clients/client start dates. The name is entered each time the person begins with a new client such that each person has many entries in the name column. Often the names were not entered in a consistent way. With and without middle initial, middle name, or various abbreviations such as ",RN" at the end of the name. > >> > Is there a package that can do fuzzy name matching so that the names in name column get replaced with a "standardized" format - where some type of machine learning can pick the most common spelling of each repeat name and replace the different variations with the common spelling? > >> > I included an example below. First table includes the names with the various spellings. Second table depicts what I hope to achieve. > >> > Again - this is on a large scale - there are something like 10,000 records with names that need to be standardized. > >> > Name > >> > Client > >> > Client Start Date > >> > John Good > >> > Client 1 > >> > 1/1/2020 > >> > Joe Jackson > >> > Client 2 > >> > 6/1/2020 > >> > Bob A. Barker > >> > Client 3 > >> > 8/1/2020 > >> > John B. Good > >> > Client 4 > >> > 10/1/2020 > >> > Joe J. Jackson > >> > Client 5 > >> > 12/1/2020 > >> > Bob Allen Barker > >> > Client 6 > >> > 1/1/2021 > >> > John Good > >> > Client 7 > >> > 5/1/2021 > >> > Joe Jack Jackson > >> > Client 8 > >> > 8/1/2021 > >> > Bob Barker > >> > Client 9 > >> > 12/1/2021 > >> > Name > >> > Client > >> > Client Start Date > >> > John Good > >> > Client 1 > >> > 1/1/2020 > >> > Joe J. Jackson > >> > Client 2 > >> > 6/1/2020 > >> > Bob A. Barker > >> > Client 3 > >> > 8/1/2020 > >> > John Good > >> > Client 4 > >> > 10/1/2020 > >> > Joe J. Jackson > >> > Client 5 > >> > 12/1/2020 > >> > Bob A. Barker > >> > Client 6 > >> > 1/1/2021 > >> > John Good > >> > Client 7 > >> > 5/1/2021 > >> > Joe J. Jackson > >> > Client 8 > >> > 8/1/2021 > >> > Bob A. Barker > >> > Client 9 > >> > 12/1/2021 > >> > THANKS! > >> > Gregg Powell > >> > Arizona, USA______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 509 bytes Desc: OpenPGP digital signature URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20220615/940a006c/attachment.sig>