Is there R software available for doing approximate matching of personal
names?
I have data about the same people produced by different organizations and
the only matching key I have is the name. I know that commercial solutions
exist, and I know I code code this from scratch, but I'd prefer to build on
some existing free solution if it exists.
Unfortunately, the names are not standardized, and there is also a certain
level of error:
Danny Williams (nickname)
Dan Williams (nickname)
Daniel Williams (nickname)
Dan William (spelling error)
D. Williams (initials)
Daniel "Danny" Williams (formal + nickname)
Dan P. Williams (includes middle initial)
Williams, Daniel (different convention)
William Daniel (wrong order or missing comma + misspelling)
Is there any R software available to find likely matches, ideally with some
estimate of accuracy of match? Levenshtein distance as implemented in agrep
is a useful solution for some of these cases; I was wondering if there is
something that covers more cases.
For this particular application, I am not concerned with issues such as
variant latinizations/transliterations (e.g. Tsung-Dao Lee ~ T.D. Lee ~ Li
Zhengdao; Ghaddafi ~ Qaddhaffi), but of course if someone handles that as
well....
Thanks,
-s
[[alternative HTML version deleted]]