thr3ads.net - R help - [R] Matching names with non-English characters [May 2013]

If this information is useful, please help other people find it:
Share via:

Spencer Graves

2013-May-13 16:05 UTC

[R] Matching names with non-English characters

Hello:


       How can one match names containing non-English characters that 
appear differently in different but related data files?  For example, I 
have data on Ra?l Grijalva, who represents the third district of Arizona 
in the US House of Representatives.  This first name appears as
"Ra??l"
in data read from one file and "Raul" from another.


       The ideal would convert both "Ra??l" and "Ra?l" to
"Raul".  A
reasonable alternative would identify the non-English characters and 
match on everything else ("^Ra" and "l$" in this case).  The
files all
contain state and district, so "AZ-3" could be part of the solution. 
However, the file also contains data on Grijalva's predecessor in that 
office, Ben Quayle, so "AZ-3" is not enough.


       Thanks,
       Spencer


p.s.  My current data contains other similar cases, e.g.:


     Recipient     District
Ra??l Grijalva   AZ House 3
Tony C??rdenas   CA House 29
Linda S??nchez   CA House 38
Ra??l Labrador   ID House 1
Andr?? Carson    IN House 7
Bob Men??ndez    NJ Senate
Ben Ray Luj??n   NM House 3
Jos?? Serrano    NY House 15
Nydia Vel??zquez NY House 7
Rub??n Hinojosa  TX House 15


       These names all appear differently in another file I have. I've 
written an ugly function that can identify "nonstandard characters". 
I'm confident I can solve this problem.  However, I'm adding things like
this to the Ecdat package, and it would be more useful for others if I 
made better use of other capabilities in R.

Jeff Newmiller

2013-May-13 16:18 UTC

head link

[R] Matching names with non-English characters

Build a lookup table for your data.

I think it is a fools errand to think that you can automatically
"normalize" arbitrary Unicode characters to an ASCII form that
everyone will agree on.

BTW: To avoid propagating open joins your data should probably have some kind of
id for the term those Representatives are serving.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Spencer Graves <spencer.graves at structuremonitoring.com> wrote:
>Hello:
>
>
>       How can one match names containing non-English characters that 
>appear differently in different but related data files?  For example, I
>
>have data on Ra?l Grijalva, who represents the third district of
>Arizona 
>in the US House of Representatives.  This first name appears as
"Ra??l"
>
>in data read from one file and "Raul" from another.
>
>
>       The ideal would convert both "Ra??l" and "Ra?l"
to "Raul".  A
>reasonable alternative would identify the non-English characters and 
>match on everything else ("^Ra" and "l$" in this case). 
The files all
>contain state and district, so "AZ-3" could be part of the
solution.
>However, the file also contains data on Grijalva's predecessor in that 
>office, Ben Quayle, so "AZ-3" is not enough.
>
>
>       Thanks,
>       Spencer
>
>
>p.s.  My current data contains other similar cases, e.g.:
>
>
>     Recipient     District
>Ra??l Grijalva   AZ House 3
>Tony C??rdenas   CA House 29
>Linda S??nchez   CA House 38
>Ra??l Labrador   ID House 1
>Andr?? Carson    IN House 7
>Bob Men??ndez    NJ Senate
>Ben Ray Luj??n   NM House 3
>Jos?? Serrano    NY House 15
>Nydia Vel??zquez NY House 7
>Rub??n Hinojosa  TX House 15
>
>
>       These names all appear differently in another file I have. I've 
>written an ugly function that can identify "nonstandard
characters".
>I'm confident I can solve this problem.  However, I'm adding things
>like 
>this to the Ecdat package, and it would be more useful for others if I 
>made better use of other capabilities in R.
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Duncan Murdoch

2013-May-13 16:58 UTC

head link

[R] Matching names with non-English characters

On 13/05/2013 12:05 PM, Spencer Graves wrote:> Hello:
>
>
>         How can one match names containing non-English characters that
> appear differently in different but related data files?  For example, I
> have data on Ra?l Grijalva, who represents the third district of Arizona
> in the US House of Representatives.  This first name appears as
"Ra??l"
> in data read from one file and "Raul" from another.
>
>
>         The ideal would convert both "Ra??l" and "Ra?l"
to "Raul".
You shouldn't have both "Ra??l" and "Ra?l" in the same
file.  They are
different encodings for the same characters.  (The first looks like 
UTF-8, the second is your native encoding, presumably the Windows 
Latin-1 variant, CP-1252.  So your first problem is to identify the 
encodings of your input files, and read them all in to a common 
encoding.  Converting them to UTF-8 in R makes the most sense, because 
it includes the characters from all other encodings you're ever likely 
to see.

Having both "Ra?l" and "Raul" in the same file is a
different issue.
The second one is an error or a variant spelling.  In this case, you can 
use

iconv("Ra?l", to="ASCII//TRANSLIT")

on most platforms to find an ASCII approximation.  (This works on my 
Windows system; your mileage may vary.)    As Jeff said, this is an 
impossible problem in general, so you may well need some manual fixups 
at the end.

Duncan Murdoch
> A
> reasonable alternative would identify the non-English characters and
> match on everything else ("^Ra" and "l$" in this case).
The files all
> contain state and district, so "AZ-3" could be part of the
solution.
> However, the file also contains data on Grijalva's predecessor in that
> office, Ben Quayle, so "AZ-3" is not enough.
>
>
>         Thanks,
>         Spencer
>
>
> p.s.  My current data contains other similar cases, e.g.:
>
>
>       Recipient     District
> Ra??l Grijalva   AZ House 3
> Tony C??rdenas   CA House 29
> Linda S??nchez   CA House 38
> Ra??l Labrador   ID House 1
> Andr?? Carson    IN House 7
> Bob Men??ndez    NJ Senate
> Ben Ray Luj??n   NM House 3
> Jos?? Serrano    NY House 15
> Nydia Vel??zquez NY House 7
> Rub??n Hinojosa  TX House 15
>
>
>         These names all appear differently in another file I have. I've
> written an ugly function that can identify "nonstandard
characters".
> I'm confident I can solve this problem.  However, I'm adding things
like
> this to the Ecdat package, and it would be more useful for others if I
> made better use of other capabilities in R.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - May 2013 - Matching names with non-English characters

[R] Matching names with non-English characters

[R] Matching names with non-English characters

[R] Matching names with non-English characters