thr3ads.net - R help - [R] Merge data frame with mispelling characters [Nov 2012]

If this information is useful, please help other people find it:
Share via:

VictorDelgado

2012-Nov-02 18:20 UTC

[R] Merge data frame with mispelling characters

Hello dear R-helpers,

I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of
two
data frames by characters. 
In each data frame I got two different list of names, that is my main-key to
be merged.

To figure out what I'm saying, I build up a modified "?merge"
example, with
errors by purpose:

# Data for authors:

authors <- data.frame(
    surname = I(c("Tukey", "Venable", "Terney",
"Ripley", "McNeil")),
    nationality = c("US", "Australia", "US",
"UK", "Australia"),
    deceased = c("yes", rep("no", 4)))

"Venables" is without  the final 's', and "Tierney,
without "i".

# Data for books:

books <- data.frame(
    surname = I(c("Tukey", "Venables", "Tierney",
             "Ripley", "Rippley", "McNeil",
"R Core")),
    title = c("Exploratory Data Analysis",
              "Modern Applied Statistics ...",
              "LISP-STAT",
              "Spatial Statistics", "Stochastic Simulation",
              "Interactive Data Analysis",
              "An Introduction to R"),
    other.author = c(NA, "Ripley", NA, NA, NA, NA,
                     "Venables & Smith"))
 
With "surname" column instead of "name" (differs from
original example for
more easy going merge). And the second "Ripley" with double
"p".

So, if I ask for:

merge(authors, books, all=TRUE)

I got:


But we know that "Rippley" corresponds to "Ripley",
"Terney" to "Tierney"
and "Venable" to "Venables". I was wondering if there was
any way to work
around this problem. My orginal data have around 27,000 name entries, and if
I take "all=FALSE", this database drops out to around 17,000, most
because
mispelling (or truncated expressions). If I take "all=TRUE", I got
many of
this <NA> cases like the example above.

Has anyone experienced this? Any idea how I can get out? I'm thinking to
take the longest match possible to each entry. For example, in
"Venable"/"Venables" there is a 87.5% match. As I have name
and surname, and
also auxiliary keys to this match, I think this could work.

Thank you in advance.



-----
Victor Delgado
cedeplar.ufmg.br P.H.D. student
www.fjp.mg.gov.br reseacher
--
View this message in context:
http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
Sent from the R help mailing list archive at Nabble.com.

jim holtman

2012-Nov-02 19:02 UTC

head link

[R] Merge data frame with mispelling characters

You might try the 'soundex' function in the RecordLinkage package:
> soundex('ripley')
[1] "R140"> soundex('rippley')
[1] "R140"> soundex('venable')
[1] "V514"> soundex('venables')
[1] "V514"> soundex('terney')
[1] "T650"> soundex('tierney')[1] "T650"


On Fri, Nov 2, 2012 at 2:20 PM, VictorDelgado <victor.maia at
fjp.mg.gov.br> wrote:> Hello dear R-helpers,
>
> I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge
of two
> data frames by characters.
> In each data frame I got two different list of names, that is my main-key
to
> be merged.
>
> To figure out what I'm saying, I build up a modified "?merge"
example, with
> errors by purpose:
>
> # Data for authors:
>
> authors <- data.frame(
>     surname = I(c("Tukey", "Venable",
"Terney", "Ripley", "McNeil")),
>     nationality = c("US", "Australia", "US",
"UK", "Australia"),
>     deceased = c("yes", rep("no", 4)))
>
> "Venables" is without  the final 's', and "Tierney,
without "i".
>
> # Data for books:
>
> books <- data.frame(
>     surname = I(c("Tukey", "Venables",
"Tierney",
>              "Ripley", "Rippley", "McNeil",
"R Core")),
>     title = c("Exploratory Data Analysis",
>               "Modern Applied Statistics ...",
>               "LISP-STAT",
>               "Spatial Statistics", "Stochastic
Simulation",
>               "Interactive Data Analysis",
>               "An Introduction to R"),
>     other.author = c(NA, "Ripley", NA, NA, NA, NA,
>                      "Venables & Smith"))
>
> With "surname" column instead of "name" (differs from
original example for
> more easy going merge). And the second "Ripley" with double
"p".
>
> So, if I ask for:
>
> merge(authors, books, all=TRUE)
>
> I got:
>
>
> But we know that "Rippley" corresponds to "Ripley",
"Terney" to "Tierney"
> and "Venable" to "Venables". I was wondering if there
was any way to work
> around this problem. My orginal data have around 27,000 name entries, and
if
> I take "all=FALSE", this database drops out to around 17,000,
most because
> mispelling (or truncated expressions). If I take "all=TRUE", I
got many of
> this <NA> cases like the example above.
>
> Has anyone experienced this? Any idea how I can get out? I'm thinking
to
> take the longest match possible to each entry. For example, in
> "Venable"/"Venables" there is a 87.5% match. As I have
name and surname, and
> also auxiliary keys to this match, I think this could work.
>
> Thank you in advance.
>
>
>
> -----
> Victor Delgado
> cedeplar.ufmg.br P.H.D. student
> www.fjp.mg.gov.br reseacher
> --
> View this message in context:
http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

David Winsemius

2012-Nov-02 19:19 UTC

head link

[R] Merge data frame with mispelling characters

On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote:
> Hello dear R-helpers,
> 
> I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge
of two
> data frames by characters. 
> In each data frame I got two different list of names, that is my main-key
to
> be merged.
> 
> To figure out what I'm saying, I build up a modified "?merge"
example, with
> errors by purpose:
> 
> # Data for authors:
> 
> authors <- data.frame(
>    surname = I(c("Tukey", "Venable",
"Terney", "Ripley", "McNeil")),
>    nationality = c("US", "Australia", "US",
"UK", "Australia"),
>    deceased = c("yes", rep("no", 4)))
> 
> "Venables" is without  the final 's', and "Tierney,
without "i".
> 
> # Data for books:
> 
> books <- data.frame(
>    surname = I(c("Tukey", "Venables",
"Tierney",
>             "Ripley", "Rippley", "McNeil",
"R Core")),
>    title = c("Exploratory Data Analysis",
>              "Modern Applied Statistics ...",
>              "LISP-STAT",
>              "Spatial Statistics", "Stochastic
Simulation",
>              "Interactive Data Analysis",
>              "An Introduction to R"),
>    other.author = c(NA, "Ripley", NA, NA, NA, NA,
>                     "Venables & Smith"))
In your example the authors list has better spelling. The 'agrep'
functions by default will return matches that are 90% ( or more precisely
Levenshtein distance of less than or equalt to 0.1) :


books$altname <- NA
altidx <- unlist( sapply(books$surname, agrep, authors$surname) )
books$altname[seq(altidx)] <- authors$surname[altidx]
books
#-----------------------
   surname                         title     other.author altname
1    Tukey     Exploratory Data Analysis             <NA>   Tukey
2 Venables Modern Applied Statistics ...           Ripley Venable
3  Tierney                     LISP-STAT             <NA>  Terney
4   Ripley            Spatial Statistics             <NA>  Ripley
5  Rippley         Stochastic Simulation             <NA>  Ripley
6   McNeil     Interactive Data Analysis             <NA>  McNeil
7   R Core          An Introduction to R Venables & Smith    <NA>

If you then match 'books' to 'authors' with a merge on
authors$surname and books$altname, you should get closer to your goals

-- 
David. > 
> With "surname" column instead of "name" (differs from
original example for
> more easy going merge). And the second "Ripley" with double
"p".
> 
> So, if I ask for:
> 
> merge(authors, books, all=TRUE)
> 
> I got:
> 
> 
> But we know that "Rippley" corresponds to "Ripley",
"Terney" to "Tierney"
> and "Venable" to "Venables". I was wondering if there
was any way to work
> around this problem. My orginal data have around 27,000 name entries, and
if
> I take "all=FALSE", this database drops out to around 17,000,
most because
> mispelling (or truncated expressions). If I take "all=TRUE", I
got many of
> this <NA> cases like the example above.
> 
> Has anyone experienced this? Any idea how I can get out? I'm thinking
to
> take the longest match possible to each entry. For example, in
> "Venable"/"Venables" there is a 87.5% match. As I have
name and surname, and
> also auxiliary keys to this match, I think this could work.
> 
> Thank you in advance.
> 
> 
> 
> -----
> Victor Delgado
> cedeplar.ufmg.br P.H.D. student
> www.fjp.mg.gov.br reseacher
> --
> View this message in context:
http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Alameda, CA, USA

Maybe Matching Threads

Search for more reasonably related threads

R help - Nov 2012 - Merge data frame with mispelling characters

[R] Merge data frame with mispelling characters

[R] Merge data frame with mispelling characters

[R] Merge data frame with mispelling characters

Maybe Matching Threads