Hello dear R-helpers, I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two data frames by characters. In each data frame I got two different list of names, that is my main-key to be merged. To figure out what I'm saying, I build up a modified "?merge" example, with errors by purpose: # Data for authors: authors <- data.frame( surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")), nationality = c("US", "Australia", "US", "UK", "Australia"), deceased = c("yes", rep("no", 4))) "Venables" is without the final 's', and "Tierney, without "i". # Data for books: books <- data.frame( surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "Rippley", "McNeil", "R Core")), title = c("Exploratory Data Analysis", "Modern Applied Statistics ...", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation", "Interactive Data Analysis", "An Introduction to R"), other.author = c(NA, "Ripley", NA, NA, NA, NA, "Venables & Smith")) With "surname" column instead of "name" (differs from original example for more easy going merge). And the second "Ripley" with double "p". So, if I ask for: merge(authors, books, all=TRUE) I got: But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" and "Venable" to "Venables". I was wondering if there was any way to work around this problem. My orginal data have around 27,000 name entries, and if I take "all=FALSE", this database drops out to around 17,000, most because mispelling (or truncated expressions). If I take "all=TRUE", I got many of this <NA> cases like the example above. Has anyone experienced this? Any idea how I can get out? I'm thinking to take the longest match possible to each entry. For example, in "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and also auxiliary keys to this match, I think this could work. Thank you in advance. ----- Victor Delgado cedeplar.ufmg.br P.H.D. student www.fjp.mg.gov.br reseacher -- View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html Sent from the R help mailing list archive at Nabble.com.
You might try the 'soundex' function in the RecordLinkage package:> soundex('ripley')[1] "R140"> soundex('rippley')[1] "R140"> soundex('venable')[1] "V514"> soundex('venables')[1] "V514"> soundex('terney')[1] "T650"> soundex('tierney')[1] "T650" On Fri, Nov 2, 2012 at 2:20 PM, VictorDelgado <victor.maia at fjp.mg.gov.br> wrote:> Hello dear R-helpers, > > I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two > data frames by characters. > In each data frame I got two different list of names, that is my main-key to > be merged. > > To figure out what I'm saying, I build up a modified "?merge" example, with > errors by purpose: > > # Data for authors: > > authors <- data.frame( > surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")), > nationality = c("US", "Australia", "US", "UK", "Australia"), > deceased = c("yes", rep("no", 4))) > > "Venables" is without the final 's', and "Tierney, without "i". > > # Data for books: > > books <- data.frame( > surname = I(c("Tukey", "Venables", "Tierney", > "Ripley", "Rippley", "McNeil", "R Core")), > title = c("Exploratory Data Analysis", > "Modern Applied Statistics ...", > "LISP-STAT", > "Spatial Statistics", "Stochastic Simulation", > "Interactive Data Analysis", > "An Introduction to R"), > other.author = c(NA, "Ripley", NA, NA, NA, NA, > "Venables & Smith")) > > With "surname" column instead of "name" (differs from original example for > more easy going merge). And the second "Ripley" with double "p". > > So, if I ask for: > > merge(authors, books, all=TRUE) > > I got: > > > But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" > and "Venable" to "Venables". I was wondering if there was any way to work > around this problem. My orginal data have around 27,000 name entries, and if > I take "all=FALSE", this database drops out to around 17,000, most because > mispelling (or truncated expressions). If I take "all=TRUE", I got many of > this <NA> cases like the example above. > > Has anyone experienced this? Any idea how I can get out? I'm thinking to > take the longest match possible to each entry. For example, in > "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and > also auxiliary keys to this match, I think this could work. > > Thank you in advance. > > > > ----- > Victor Delgado > cedeplar.ufmg.br P.H.D. student > www.fjp.mg.gov.br reseacher > -- > View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote:> Hello dear R-helpers, > > I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two > data frames by characters. > In each data frame I got two different list of names, that is my main-key to > be merged. > > To figure out what I'm saying, I build up a modified "?merge" example, with > errors by purpose: > > # Data for authors: > > authors <- data.frame( > surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")), > nationality = c("US", "Australia", "US", "UK", "Australia"), > deceased = c("yes", rep("no", 4))) > > "Venables" is without the final 's', and "Tierney, without "i". > > # Data for books: > > books <- data.frame( > surname = I(c("Tukey", "Venables", "Tierney", > "Ripley", "Rippley", "McNeil", "R Core")), > title = c("Exploratory Data Analysis", > "Modern Applied Statistics ...", > "LISP-STAT", > "Spatial Statistics", "Stochastic Simulation", > "Interactive Data Analysis", > "An Introduction to R"), > other.author = c(NA, "Ripley", NA, NA, NA, NA, > "Venables & Smith"))In your example the authors list has better spelling. The 'agrep' functions by default will return matches that are 90% ( or more precisely Levenshtein distance of less than or equalt to 0.1) : books$altname <- NA altidx <- unlist( sapply(books$surname, agrep, authors$surname) ) books$altname[seq(altidx)] <- authors$surname[altidx] books #----------------------- surname title other.author altname 1 Tukey Exploratory Data Analysis <NA> Tukey 2 Venables Modern Applied Statistics ... Ripley Venable 3 Tierney LISP-STAT <NA> Terney 4 Ripley Spatial Statistics <NA> Ripley 5 Rippley Stochastic Simulation <NA> Ripley 6 McNeil Interactive Data Analysis <NA> McNeil 7 R Core An Introduction to R Venables & Smith <NA> If you then match 'books' to 'authors' with a merge on authors$surname and books$altname, you should get closer to your goals -- David.> > With "surname" column instead of "name" (differs from original example for > more easy going merge). And the second "Ripley" with double "p". > > So, if I ask for: > > merge(authors, books, all=TRUE) > > I got: > > > But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" > and "Venable" to "Venables". I was wondering if there was any way to work > around this problem. My orginal data have around 27,000 name entries, and if > I take "all=FALSE", this database drops out to around 17,000, most because > mispelling (or truncated expressions). If I take "all=TRUE", I got many of > this <NA> cases like the example above. > > Has anyone experienced this? Any idea how I can get out? I'm thinking to > take the longest match possible to each entry. For example, in > "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and > also auxiliary keys to this match, I think this could work. > > Thank you in advance. > > > > ----- > Victor Delgado > cedeplar.ufmg.br P.H.D. student > www.fjp.mg.gov.br reseacher > -- > View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Alameda, CA, USA