Hi, I am a new user of R using R 2.8.1 in windows 2003. I have a csv file with single column which contain the 30,000 students names. There were typo errors while entering this student names. The actual list of names is < 1000. However we dont have that list for keyword search. I am interested in grouping/cluster these names as those which are similar letter to letter. Are there any text clustering algorithm in R which can group names of similar type in to segments of exactly matching , 90% matching, 80% matching,....etc. thanks in advance, regards, srinivas statistical analyst. [[alternative HTML version deleted]]
Simply doing a tabulation and isolating the cases with only one entry might have been a possibility if the count discrepancy weren't so high. It appears you have a greater degree of corruption than would be expected just from "typos". Have you looked at the packages referenced at: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html The Soundex algorithm is an old programming chestnut which I have seen implemented in R, but I understand there are improved versions. How well they perform on persons' names may depend strongly on cultural origins of your population. -- David Winsemius On Jan 22, 2009, at 6:03 AM, srinivasa raghavan wrote:> Hi, > > I am a new user of R using R 2.8.1 in windows 2003. I have a csv > file with > single column which contain the 30,000 students names. There were typo > errors while entering this student names. The actual list of names > is < > 1000. However we dont have that list for keyword search. > > I am interested in grouping/cluster these names as those which are > similar letter to letter. Are there any text clustering algorithm > in R > which can group names of similar type in to segments of exactly > matching , > 90% matching, 80% matching,....etc. > > thanks in advance, > > regards, > srinivas > statistical analyst. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an example (very clumsy with two loops, but you can tweak that with apply stuff). HTH, STG levenshtein <- function(string1, string2, case=TRUE, map=NULL) { ######## # levenshtein algorithm in R # # Author : Hans-Joerg Bibiko # Date : 29/06/2006 # Contact : bibiko at eva.mpg.de ######## # string1, string2 := strings to compare # case = TRUE := case sensitivity; case = FALSE := case insensitivity # map := character vector of c(regexp1, replacement1, regexp2, replacement2, ...) # example: # map <- c("[aeiou]","V","[^aeiou]","C") := replaces all vowels with V and all others with C # levenshtein("Bank","Bond", map=map) => 0 ######## if(!is.null(map)) { m <- matrix(map, ncol=2, byrow=TRUE) s <- c(ifelse(case, string1, tolower(string1)), ifelse(case, string2, tolower(string2))) for(i in 1:dim(m)[1]) s <- gsub(m[i,1], m[i,2], s) string1 <- s[1] string2 <- s[2] } if(ifelse(case, string1, tolower(string1)) == ifelse(case, string2, tolower(string2))) return(0) s1 <- strsplit(paste(" ", ifelse(case, string1, tolower(string1)), sep=""), NULL)[[1]] s2 <- strsplit(paste(" ", ifelse(case, string2, tolower(string2)), sep=""), NULL)[[1]] l1 <- length(s1) l2 <- length(s2) d <- matrix(nrow = l1, ncol = l2) for(i in 1:l1) d[i,1] <- i-1 for(i in 1:l2) d[1,i] <- i-1 for(i in 2:l1) for(j in 2:l2) d[i,j] <- min((d[i-1,j]+1) , (d[i,j-1]+1) , (d[i-1,j-1]+ifelse(s1[i] == s2[j], 0, 1))) d[l1,l2] } # end of function Hans-Joerg Bibiko's levenshtein # generate names set.seed(1) all.names<-character(10) for (i in 1:10) { all.names[i]<-paste(sample(letters, sample(4:10, 1), replace=T), collapse="") } all.names # generate matrix sims<-matrix(0, nrow=10, ncol=10) attr(sims, "dimnames")<-list(all.names, all.names) # fill matrix (clumsy) for (j in 1:9) { for (k in (j+1):10) { sims[j,k]<-sims[k,j]<-levenshtein(all.names[j], all.names[k]) } } plot(hclust(as.dist(sims)))
Srinivas, I don't know of a clustering algorithm, but you might check out agrep() from the base package and stringMatch() from the MiscPsycho package. These can help to identify similar text sequences, and it may be possible to group similar names by using these commands over and over again. Ed -- Ed Merkle, PhD Assistant Professor Dept. of Psychology Wichita State University Wichita, KS 67260> Date: Thu, 22 Jan 2009 16:33:03 +0530 > From: srinivasa raghavan <srinivasraghav at gmail.com> > Subject: [R] text vector clustering > To: r-help at r-project.org > Message-ID: > <e45b69190901220303u114028b1k43ef6f3ab7c7c104 at mail.gmail.com> > Content-Type: text/plain > > Hi, > > I am a new user of R using R 2.8.1 in windows 2003. I have a csv file with > single column which contain the 30,000 students names. There were typo > errors while entering this student names. The actual list of names is < > 1000. However we dont have that list for keyword search. > > I am interested in grouping/cluster these names as those which are > similar letter to letter. Are there any text clustering algorithm in R > which can group names of similar type in to segments of exactly matching , > 90% matching, 80% matching,....etc. > > thanks in advance, > > regards, > srinivas > statistical analyst.
Dear srinivas, You can try using trigrams, a special case of N-grams, often used in Natural Language Processing.> I am interested in grouping/cluster these names as those which are >similar letter to letter. Are there any text clustering algorithm in R >which can group names of similar type in to segments of exactly matching , >90% matching, 80% matching,....etc.As an example: # supose we have a list with locations # (here we got a matrix, second column is used to create the sample, not relevant) # locations with errors Poblacion_dist = matrix( c("MADRIZ", 0.3, "BARÇELONA", 0.25, "BILAO", 0.135, "SEVILA", 0.1, "VALENÇIA", 0.1, "CORUNA", 0.025, "ALACANTE",0.025, "VALLADOLI", 0.025, "SANTIAGO", 0.01, "SAN SEBASTIAN", 0.01, "CADIZ", 0.01, "ZARAGOZA", 0.01), ncol = 2, byrow=T) # True locations Poblacion = matrix( c("MADRID", 0.3, "BARCELONA", 0.25, "BILBAO", 0.135, "SEVILLA", 0.1, "VALENCIA", 0.1, "CORUÑA", 0.025, "ALICANTE",0.025, "VALLADOLID", 0.025, "SANTIAGO", 0.01, "SAN_SEBASTIAN", 0.01, "CADIZ", 0.01, "ZARAGOZA", 0.01), ncol = 2, byrow=T) muestrear = function(que, cuantas_veces){ sample(que[,1], prob = as.numeric(que[,2]), cuantas_veces) } Provincias = ((replicate(10,c(muestrear(Poblacion,1), c(muestrear(Poblacion_dist,1)))))) # now we have a list with 20 locations Provincias = Provincias[1:length(Provincias)] # next we need to process each location as a set of trigrams word2trigram = function(word){ trigramatrix = matrix(c(seq(1, nchar(word)-2), seq(1, nchar(word)-2)+2), ncol = 2, byrow = F) trigram = c() for (i in 1:nrow(trigramatrix)) { trigram = append(trigram,substr(word,trigramatrix[i,1],trigramatrix[i,2])) } return(trigram) } Prov2trigram = lapply(Provincias, word2trigram) # every trigram in the sample Trigrams = levels(factor((unlist(Prov2trigram)))) # we get how many times appears a trigram in a location ocrrnc.mtrx = matrix(rep(0,length(Trigrams)* length(Prov2trigram)), ncol = length(Prov2trigram)) for (i in 1:ncol(ocrrnc.mtrx)) { ocrrnc.mtrx[,i] = as.integer(table(append(Prov2trigram[[i]], Trigrams))-1) } # calculate cosine (often used in NLP) matrizCos = function(X){ X = t(X ) nterm = nrow(X ) modulo = c() cosen = matrix(rep(0,(nterm*nterm)),ncol = nterm) for (i in 1:nterm){ Vec = X [i,] modulo[i] = sqrt(Vec%*%Vec) cosen[,i] = (X %*% Vec) } cosen = (cosen/modulo)/matrix(rep(modulo,nterm),ncol = nterm,byrow=T) cosen[is.nan(cosen)] <- 0 return (cosen) } rslt.dst.mat = matrizCos(ocrrnc.mtrx) # and get the clusters attr(rslt.dst.mat , "dimnames")<-list(Provincias , Provincias ) plot(hclust(as.dist(1-rslt.dst.mat),method = 'med')) I hope this helps, Eduardo San Miguel Martin [[alternative HTML version deleted]]