Werner Wernersen
2006-Jan-10 15:49 UTC
[R] matching country name tables from different sources
Hi, Before I reinvent the wheel I wanted to kindly ask you for your opinion if there is a simple way to do it. I want to merge a larger number of tables from different data sources in R and the matching criterium are country names. The tables are of different size and sometimes the country names do differ slightly. Has anyone done this or any recommendation on what commands I should look at to automize this task as much as possible? Thanks a lot for your effort in advance. All the best, Werner --------------------------------- Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! [[alternative HTML version deleted]]
Gabor Grothendieck
2006-Jan-10 16:29 UTC
[R] matching country name tables from different sources
If they were the same you could use merge. To figure out the correspondence automatically or semiautomatically, try this: x <- c("Canada", "US", "Mexico") y <- c("Kanada", "United States", "Mehico") result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) result[] <- sapply(result, nchar) # try both which.max and which.min and if you are lucky # one of them will give unique values and that is the one to use # In this case which.max does. apply(result, 1, which.max) # 1 2 3 # calculate longest common subsequence between 2 strings lcs2 <- function(s1,s2) { longest <- function(x,y) if (nchar(x) > nchar(y)) x else y # Make sure args are strings a <- as.character(s1); an <- nchar(s1)+1 b <- as.character(s2); bn <- nchar(s2)+1 # If one arg is an empty string, returns the length of the other if (nchar(a)==0) return(nchar(b)) if (nchar(b)==0) return(nchar(a)) # Initialize matrix for calculations m <- matrix("", nrow=an, ncol=bn) for (i in 2:an) for (j in 2:bn) m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") else longest(m[i-1,j], m[i,j-1]) # Returns the distance m[an,bn] } On 1/10/06, Werner Wernersen <pensterfuzzer at yahoo.de> wrote:> Hi, > > Before I reinvent the wheel I wanted to kindly ask you for your opinion if there is a simple way to do it. > > I want to merge a larger number of tables from different data sources in R and the matching criterium are country names. The tables are of different size and sometimes the country names do differ slightly. > > Has anyone done this or any recommendation on what commands I should look at to automize this task as much as possible? > > Thanks a lot for your effort in advance. > > All the best, > Werner > > > > --------------------------------- > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
bogdan romocea
2006-Jan-10 19:24 UTC
[R] matching country name tables from different sources
See http://en.wikipedia.org/wiki/Levenshtein_distance http://thread.gmane.org/gmane.comp.lang.r.general/31499> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Werner > Wernersen > Sent: Tuesday, January 10, 2006 2:00 PM > To: Gabor Grothendieck > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] matching country name tables from different sources > > Thanks for the nice code, Gabor! > > Unfortunately, it seems not to work for my purpose, > confuses lots of countries when I compare two lists of over > 150 countries each. > Do you have any other suggestions? > > > > Gabor Grothendieck <ggrothendieck at gmail.com> schrieb: If > they were the same you could use merge. To figure out > the correspondence automatically or semiautomatically, try this: > > x <- c("Canada", "US", "Mexico") > y <- c("Kanada", "United States", "Mehico") > result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) > result[] <- sapply(result, nchar) > # try both which.max and which.min and if you are lucky > # one of them will give unique values and that is the one to use > # In this case which.max does. > apply(result, 1, which.max) # 1 2 3 > > # calculate longest common subsequence between 2 strings > lcs2 <- function(s1,s2) { > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y > # Make sure args are strings > a <- as.character(s1); an <- nchar(s1)+1 > b <- as.character(s2); bn <- nchar(s2)+1 > > > # If one arg is an empty string, returns the length of the other > if (nchar(a)==0) return(nchar(b)) > if (nchar(b)==0) return(nchar(a)) > > > # Initialize matrix for calculations > m <- matrix("", nrow=an, ncol=bn) > > for (i in 2:an) > for (j in 2:bn) > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") > else > longest(m[i-1,j], m[i,j-1]) > > # Returns the distance > m[an,bn] > } > > > > On 1/10/06, Werner Wernersen > wrote: > > Hi, > > > > Before I reinvent the wheel I wanted to kindly ask you for > your opinion if there is a simple way to do it. > > > > I want to merge a larger number of tables from different > data sources in R and the matching criterium are country > names. The tables are of different size and sometimes the > country names do differ slightly. > > > > Has anyone done this or any recommendation on what > commands I should look at to automize this task as much as possible? > > > > Thanks a lot for your effort in advance. > > > > All the best, > > Werner > > > > > > > > --------------------------------- > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von > PC zu PC! > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > > > > > > --------------------------------- > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
McGehee, Robert
2006-Jan-10 19:28 UTC
[R] matching country name tables from different sources
I would throw a tolower() around s1 and s2 so that 'canada' matches with 'CANADA', and perhaps consider using a Levenshtein distance rather than the longest common subsequence. An algorithm for Levenshtein distance can be found here (courtesy of Stephen Upton) https://stat.ethz.ch/pipermail/r-help/2005-January/062254.html Robert -----Original Message----- From: Werner Wernersen [mailto:pensterfuzzer at yahoo.de] Sent: Tuesday, January 10, 2006 2:00 PM To: Gabor Grothendieck Cc: r-help at stat.math.ethz.ch Subject: Re: [R] matching country name tables from different sources Thanks for the nice code, Gabor! Unfortunately, it seems not to work for my purpose, confuses lots of countries when I compare two lists of over 150 countries each. Do you have any other suggestions? Gabor Grothendieck <ggrothendieck at gmail.com> schrieb: If they were the same you could use merge. To figure out the correspondence automatically or semiautomatically, try this: x <- c("Canada", "US", "Mexico") y <- c("Kanada", "United States", "Mehico") result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) result[] <- sapply(result, nchar) # try both which.max and which.min and if you are lucky # one of them will give unique values and that is the one to use # In this case which.max does. apply(result, 1, which.max) # 1 2 3 # calculate longest common subsequence between 2 strings lcs2 <- function(s1,s2) { longest <- function(x,y) if (nchar(x) > nchar(y)) x else y # Make sure args are strings a <- as.character(s1); an <- nchar(s1)+1 b <- as.character(s2); bn <- nchar(s2)+1 # If one arg is an empty string, returns the length of the other if (nchar(a)==0) return(nchar(b)) if (nchar(b)==0) return(nchar(a)) # Initialize matrix for calculations m <- matrix("", nrow=an, ncol=bn) for (i in 2:an) for (j in 2:bn) m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") else longest(m[i-1,j], m[i,j-1]) # Returns the distance m[an,bn] } On 1/10/06, Werner Wernersen wrote:> Hi, > > Before I reinvent the wheel I wanted to kindly ask you for youropinion if there is a simple way to do it.> > I want to merge a larger number of tables from different data sourcesin R and the matching criterium are country names. The tables are of different size and sometimes the country names do differ slightly.> > Has anyone done this or any recommendation on what commands I shouldlook at to automize this task as much as possible?> > Thanks a lot for your effort in advance. > > All the best, > Werner > > > > --------------------------------- > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html>--------------------------------- [[alternative HTML version deleted]] ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Gabor Grothendieck
2006-Jan-10 20:16 UTC
[R] matching country name tables from different sources
One other thing to try could be soundex. ITs normally used for last names but it might work here too. Google to find the soundex encoding rules. Reviewing the country names might suggest minor modifications to the soundex algorithm to improve it for your case. On 1/10/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> You can improve it somewhat by first accepting all the largest > matches and removing the rows and columns for those and > repeatedly doing that with what is left. > > On 1/10/06, Werner Wernersen <pensterfuzzer at yahoo.de> wrote: > > Thanks for the nice code, Gabor! > > > > Unfortunately, it seems not to work for my purpose, confuses lots of > > countries when I compare two lists of over 150 countries each. > > Do you have any other suggestions? > > > > > > > > Gabor Grothendieck <ggrothendieck at gmail.com> schrieb: > > If they were the same you could use merge. To figure out > > the correspondence automatically or semiautomatically, try this: > > > > x <- c("Canada", "US", "Mexico") > > y <- c("Kanada", "United States", "Mehico") > > result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) > > result[] <- sapply(result, nchar) > > # try both which.max and which.min and if you are lucky > > # one of them will give unique values and that is the one to use > > # In this case which.max does. > > apply(result, 1, which.max) # 1 2 3 > > > > # calculate longest common subsequence between 2 strings > > lcs2 <- function(s1,s2) { > > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y > > # Make sure args are strings > > a <- as.character(s1); an <- nchar(s1)+1 > > b <- as.character(s2); bn <- nchar(s2)+1 > > > > > > # If one arg is an empty string, returns the length of the other > > if (nchar(a)==0) return(nchar(b)) > > if (nchar(b)==0) return(nchar(a)) > > > > > > # Initialize matrix for calculations > > m <- matrix("", nrow=an, ncol=bn) > > > > for (i in 2:an) > > for (j in 2:bn) > > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) > > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") > > else > > longest(m[i-1,j], m[i,j-1]) > > > > # Returns the distance > > m[an,bn] > > } > > > > > > > > On 1/10/06, Werner Wernersen wrote: > > > Hi, > > > > > > Before I reinvent the wheel I wanted to kindly ask you for your opinion if > > there is a simple way to do it. > > > > > > I want to merge a larger number of tables from different data sources in R > > and the matching criterium are country names. The tables are of different > > size and sometimes the country names do differ slightly. > > > > > > Has anyone done this or any recommendation on what commands I should look > > at to automize this task as much as possible? > > > > > > Thanks a lot for your effort in advance. > > > > > > All the best, > > > Werner > > > > > > > > > > > > --------------------------------- > > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > ________________________________ > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC! > > Jetzt Yahoo! Messenger installieren! > > > > >
SAULEAU Erik-André
2006-Jan-11 08:44 UTC
[R] matching country name tables from different sources
dear all, yes but the problem with soundex for example is that it does not work when an error occur in the first place (Canada vs Kanada) as it keeps the fist character. It seems that you have to look after an approximate string matching algorithm (for example, a very good one if from Porter-Jaro and Winkler at the US Census bureau or have o look to the book of Navarro about classification of algorithm). HTH and an happy new year, erik. -----Message d'origine----- De: Gabor Grothendieck A: Werner Wernersen Cc: r-help@stat.math.ethz.ch Date: 10/01/2006 21:16 Objet: Re: [R] matching country name tables from different sources One other thing to try could be soundex. ITs normally used for last names but it might work here too. Google to find the soundex encoding rules. Reviewing the country names might suggest minor modifications to the soundex algorithm to improve it for your case. On 1/10/06, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:> You can improve it somewhat by first accepting all the largest > matches and removing the rows and columns for those and > repeatedly doing that with what is left. > > On 1/10/06, Werner Wernersen <pensterfuzzer@yahoo.de> wrote: > > Thanks for the nice code, Gabor! > > > > Unfortunately, it seems not to work for my purpose, confuses lots of > > countries when I compare two lists of over 150 countries each. > > Do you have any other suggestions? > > > > > > > > Gabor Grothendieck <ggrothendieck@gmail.com> schrieb: > > If they were the same you could use merge. To figure out > > the correspondence automatically or semiautomatically, try this: > > > > x <- c("Canada", "US", "Mexico") > > y <- c("Kanada", "United States", "Mehico") > > result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) > > result[] <- sapply(result, nchar) > > # try both which.max and which.min and if you are lucky > > # one of them will give unique values and that is the one to use > > # In this case which.max does. > > apply(result, 1, which.max) # 1 2 3 > > > > # calculate longest common subsequence between 2 strings > > lcs2 <- function(s1,s2) { > > longest <- function(x,y) if (nchar(x) > nchar(y)) x else y > > # Make sure args are strings > > a <- as.character(s1); an <- nchar(s1)+1 > > b <- as.character(s2); bn <- nchar(s2)+1 > > > > > > # If one arg is an empty string, returns the length of the other > > if (nchar(a)==0) return(nchar(b)) > > if (nchar(b)==0) return(nchar(a)) > > > > > > # Initialize matrix for calculations > > m <- matrix("", nrow=an, ncol=bn) > > > > for (i in 2:an) > > for (j in 2:bn) > > m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1)) > > paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "") > > else > > longest(m[i-1,j], m[i,j-1]) > > > > # Returns the distance > > m[an,bn] > > } > > > > > > > > On 1/10/06, Werner Wernersen wrote: > > > Hi, > > > > > > Before I reinvent the wheel I wanted to kindly ask you for youropinion if> > there is a simple way to do it. > > > > > > I want to merge a larger number of tables from different datasources in R> > and the matching criterium are country names. The tables are ofdifferent> > size and sometimes the country names do differ slightly. > > > > > > Has anyone done this or any recommendation on what commands Ishould look> > at to automize this task as much as possible? > > > > > > Thanks a lot for your effort in advance. > > > > > > All the best, > > > Werner > > > > > > > > > > > > --------------------------------- > > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zuPC!> > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > ________________________________ > > Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zuPC!> > Jetzt Yahoo! Messenger installieren! > > > > >______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html ************************************************************************ ********** Afin d'eviter toute propagation de virus informatique, et en complement des dispositifs en place, ce message (et ses pieces jointes s'il y en a) a ete automatiquement analyse par un antivirus de messagerie. ************************************************************************ ********** ********************************************************************************** Afin d'eviter toute propagation de virus informatique, et en complement des dispositifs en place, ce message (et ses pieces jointes s'il y en a) a ete automatiquement analyse par un antivirus de messagerie. ********************************************************************************** [[alternative HTML version deleted]]