thr3ads.net - R help - [R] text vector clustering [Jan 2009]

If this information is useful, please help other people find it:
Share via:

srinivasa raghavan

2009-Jan-22 11:03 UTC

[R] text vector clustering

Hi,

I am a new user of R using R 2.8.1 in windows 2003.  I have a  csv file with
single column which contain the 30,000 students names. There were typo
errors while entering this student names. The actual list of names is <
1000. However we dont have that list for keyword search.

 I am interested in grouping/cluster these names   as those which are
similar  letter to letter.  Are there any text clustering algorithm in R
which can group names of similar type in to segments of exactly matching ,
90% matching, 80% matching,....etc.

thanks in advance,

regards,
srinivas
statistical analyst.

	[[alternative HTML version deleted]]

David Winsemius

2009-Jan-22 14:59 UTC

head link

[R] text vector clustering

Simply doing a tabulation and isolating the cases with only one entry  
might have been a possibility if the count discrepancy weren't so  
high. It appears you have a greater degree of corruption than would be  
expected just from "typos".

Have you looked at the packages referenced at:

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

The Soundex algorithm is an old programming chestnut which I have seen  
implemented in R, but I understand there are improved versions. How  
well they perform on persons' names may depend strongly on cultural  
origins of your population.

-- 
David Winsemius

On Jan 22, 2009, at 6:03 AM, srinivasa raghavan wrote:
> Hi,
>
> I am a new user of R using R 2.8.1 in windows 2003.  I have a  csv  
> file with
> single column which contain the 30,000 students names. There were typo
> errors while entering this student names. The actual list of names  
> is <
> 1000. However we dont have that list for keyword search.
>
> I am interested in grouping/cluster these names   as those which are
> similar  letter to letter.  Are there any text clustering algorithm  
> in R
> which can group names of similar type in to segments of exactly  
> matching ,
> 90% matching, 80% matching,....etc.
>
> thanks in advance,
>
> regards,
> srinivas
> statistical analyst.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Stefan Th. Gries

2009-Jan-23 16:28 UTC

head link

[R] text vector clustering

Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an
example (very clumsy with two loops, but you can tweak that with apply
stuff).
HTH,
STG


levenshtein <- function(string1, string2, case=TRUE, map=NULL) {
	########
	# levenshtein algorithm in R
	#
	# Author  : Hans-Joerg Bibiko
	# Date    : 29/06/2006
	# Contact : bibiko at eva.mpg.de
	########
	# string1, string2 := strings to compare
	# case = TRUE := case sensitivity; case = FALSE := case insensitivity
	# map := character vector of c(regexp1, replacement1, regexp2,
replacement2, ...)
	#   example:
	#      map <-
c("[aeiou]","V","[^aeiou]","C") :=
replaces all vowels
with V and all others with C
	#      levenshtein("Bank","Bond", map=map)   =>  0
	########
	
	if(!is.null(map)) {
		m <- matrix(map, ncol=2, byrow=TRUE)
		s <- c(ifelse(case, string1, tolower(string1)), ifelse(case,
string2, tolower(string2)))
		for(i in 1:dim(m)[1]) s <- gsub(m[i,1], m[i,2], s)
		string1 <- s[1]
		string2 <- s[2]
	}

	if(ifelse(case, string1, tolower(string1)) == ifelse(case, string2,
tolower(string2))) return(0)

	s1 <- strsplit(paste(" ", ifelse(case, string1, tolower(string1)),
sep=""), NULL)[[1]]
	s2 <- strsplit(paste(" ", ifelse(case, string2, tolower(string2)),
sep=""), NULL)[[1]]
	
	l1 <- length(s1)
	l2 <- length(s2)
	
	d <- matrix(nrow = l1, ncol = l2)

	for(i in 1:l1) d[i,1] <- i-1
	for(i in 1:l2) d[1,i] <- i-1
	for(i in 2:l1) for(j in 2:l2) d[i,j] <- min((d[i-1,j]+1) ,
(d[i,j-1]+1) , (d[i-1,j-1]+ifelse(s1[i] == s2[j], 0, 1)))
	
	d[l1,l2]
} # end of function Hans-Joerg Bibiko's levenshtein

# generate names
set.seed(1)
all.names<-character(10)
for (i in 1:10) {
   all.names[i]<-paste(sample(letters, sample(4:10, 1), replace=T),
collapse="")
}
all.names

# generate matrix
sims<-matrix(0, nrow=10, ncol=10)
attr(sims, "dimnames")<-list(all.names, all.names)

# fill matrix (clumsy)
for (j in 1:9) {
   for (k in (j+1):10) {
      sims[j,k]<-sims[k,j]<-levenshtein(all.names[j], all.names[k])
   }
}
plot(hclust(as.dist(sims)))

Ed Merkle

2009-Jan-23 20:08 UTC

head link

[R] text vector clustering

Srinivas,

I don't know of a clustering algorithm, but you might check out agrep() 
from the base package and stringMatch() from the MiscPsycho package. 
These can help to identify similar text sequences, and it may be 
possible to group similar names by using these commands over and over again.

Ed

-- 
Ed Merkle, PhD
Assistant Professor
Dept. of Psychology
Wichita State University
Wichita, KS 67260

> Date: Thu, 22 Jan 2009 16:33:03 +0530
> From: srinivasa raghavan <srinivasraghav at gmail.com>
> Subject: [R] text vector clustering
> To: r-help at r-project.org
> Message-ID:
>         <e45b69190901220303u114028b1k43ef6f3ab7c7c104 at
mail.gmail.com>
> Content-Type: text/plain
> 
> Hi,
> 
> I am a new user of R using R 2.8.1 in windows 2003.  I have a  csv file
with
> single column which contain the 30,000 students names. There were typo
> errors while entering this student names. The actual list of names is <
> 1000. However we dont have that list for keyword search.
> 
>  I am interested in grouping/cluster these names   as those which are
> similar  letter to letter.  Are there any text clustering algorithm in R
> which can group names of similar type in to segments of exactly matching ,
> 90% matching, 80% matching,....etc.
> 
> thanks in advance,
> 
> regards,
> srinivas
> statistical analyst.

San Miguel Martín, Eduardo

2009-Jan-26 12:58 UTC

head link

[R] text vector clustering

Dear srinivas,
 
You can try using trigrams, a special case of N-grams, often used in Natural
Language Processing.
 > I am interested in grouping/cluster these names   as those which are
>similar  letter to letter.  Are there any text clustering algorithm in R
>which can group names of similar type in to segments of exactly matching ,
>90% matching, 80% matching,....etc. 
As an example:
 
# supose we have a list with locations
# (here we got a matrix, second column is used to create the sample, not
relevant)
 
# locations with errors
Poblacion_dist = matrix(
 c("MADRIZ", 0.3,
   "BARÇELONA", 0.25,
   "BILAO", 0.135,
   "SEVILA", 0.1,
   "VALENÇIA", 0.1,
   "CORUNA", 0.025,
   "ALACANTE",0.025,
   "VALLADOLI", 0.025,
   "SANTIAGO", 0.01,
   "SAN SEBASTIAN", 0.01,
   "CADIZ", 0.01,
   "ZARAGOZA", 0.01), 
 ncol = 2, byrow=T)
 
# True locations
Poblacion = matrix(
 c("MADRID", 0.3,
   "BARCELONA", 0.25,
   "BILBAO", 0.135,
   "SEVILLA", 0.1,
   "VALENCIA", 0.1,
   "CORUÑA", 0.025,
   "ALICANTE",0.025,
   "VALLADOLID", 0.025,
   "SANTIAGO", 0.01,
   "SAN_SEBASTIAN", 0.01,
   "CADIZ", 0.01,
   "ZARAGOZA", 0.01), 
 ncol = 2, byrow=T) 
 
muestrear = function(que, cuantas_veces){
   sample(que[,1], prob = as.numeric(que[,2]), cuantas_veces)
   }
 
Provincias = ((replicate(10,c(muestrear(Poblacion,1),
c(muestrear(Poblacion_dist,1))))))

 
# now we have a list with 20 locations 
Provincias = Provincias[1:length(Provincias)]
 
# next we need to process each location as a set of trigrams
word2trigram = function(word){
   trigramatrix =  matrix(c(seq(1, nchar(word)-2), seq(1, nchar(word)-2)+2),
ncol = 2, byrow = F)
   trigram = c()
   for (i in 1:nrow(trigramatrix)) {
       trigram =
append(trigram,substr(word,trigramatrix[i,1],trigramatrix[i,2]))
   }
   return(trigram)
}
Prov2trigram = lapply(Provincias, word2trigram)
 
# every trigram in the sample
Trigrams = levels(factor((unlist(Prov2trigram))))
 
# we get how many times appears a trigram in a location
ocrrnc.mtrx = matrix(rep(0,length(Trigrams)* length(Prov2trigram)), ncol =
length(Prov2trigram))
for (i in 1:ncol(ocrrnc.mtrx)) {
  ocrrnc.mtrx[,i] = as.integer(table(append(Prov2trigram[[i]], Trigrams))-1)
  }
 
# calculate cosine (often used in NLP)
matrizCos = function(X){
    X  = t(X )
    nterm = nrow(X )
    modulo = c()
    cosen = matrix(rep(0,(nterm*nterm)),ncol = nterm)
    for (i in 1:nterm){
        Vec = X [i,]
        modulo[i] = sqrt(Vec%*%Vec)
        cosen[,i] = (X  %*% Vec)
    }
    cosen = (cosen/modulo)/matrix(rep(modulo,nterm),ncol = nterm,byrow=T)
    cosen[is.nan(cosen)] <- 0
    return (cosen)
}
rslt.dst.mat = matrizCos(ocrrnc.mtrx)
 
# and get the clusters
attr(rslt.dst.mat , "dimnames")<-list(Provincias , Provincias )
plot(hclust(as.dist(1-rslt.dst.mat),method = 'med'))
 
I hope this helps,
Eduardo San Miguel Martin

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Jan 2009 - text vector clustering

[R] text vector clustering

[R] text vector clustering

[R] text vector clustering

[R] text vector clustering

[R] text vector clustering

Apparently Analagous Threads