Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line "pam<-pam(dist(matrixdata),k=7)" ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Thanks in advance, Nicolas BOUGET
n.bouget wrote:> Hi, > I want to know which distance is using in the function kmeans > and if we can change this distance. > Indeed, in the function pam, we can put a distance matrix in > parameter (by the line "pam<-pam(dist(matrixdata),k=7)" ) but > we can't do it in the function kmeans, we have to put the > matrix of data directly ... > Thanks in advance, > Nicolas BOUGETAs the name says, kmeans() calculates *means* (centres) of clusters. It does not any make sense to do that on distances ... Uwe Ligges> > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> n.bouget wrote: > > > Hi, > > I want to know which distance is using in the function kmeans > > and if we can change this distance. > > Indeed, in the function pam, we can put a distance matrix in > > parameter (by the line "pam<-pam(dist(matrixdata),k=7)" ) but > > we can't do it in the function kmeans, we have to put the > > matrix of data directly ...Yes but how can we choose the distance to calculate centers?> > Thanks in advance, > > Nicolas BOUGET > > As the name says, kmeans() calculates *means* (centres) ofclusters. It> does not any make sense to do that on distances ... > > Uwe Ligges > > > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html> >
>>>>> "n\" == n\ bouget <n> >>>>> on Fri, 28 May 2004 09:37:35 +0200 writes:n\> Hi, I want to know which distance is using in the n\> function kmeans and if we can change this distance. n\> Indeed, in the function pam, we can put a distance n\> matrix in parameter (by the line n\> "pam<-pam(dist(matrixdata),k=7)" ) but we can't do it in n\> the function kmeans, we have to put the matrix of data n\> directly ... Thanks in advance, Nicolas BOUGET It might be interesting to look at this from the pam() perspective: What exactly is pam() lacking that kmeans() does for you? Christian, are you suggesting that pam() could do the job if 1) there was a dist(., method="a la kmeans") 2) pam() allowed to be started by a user-specified set of medoids instead of the "Kaufman-Rousseeuw-optimal" ones ? Regards, Martin Maechler
n.bouget wrote:> Hi, > I want to know which distance is using in the function kmeans > and if we can change this distance. > Indeed, in the function pam, we can put a distance matrix in > parameter (by the line "pam<-pam(dist(matrixdata),k=7)" ) but > we can't do it in the function kmeans, we have to put the > matrix of data directly ... > Thanks in advance, > Nicolas BOUGETOne solution is to transform the data in a way, that the euclidean distance of the transformed values represents some other distance of the original values. This works at least for the Mahalanobis-Distance, when one applies a multivariate technique to a PCA transformed and re-scaled matrix, but I don't know if there are transformations for some other distance measures. Thomas P.
I don't exactly understand what you do, could you show me the program that you execute to do that?> n.bouget wrote: > > Hi, > > I want to know which distance is using in the function kmeans > > and if we can change this distance. > > Indeed, in the function pam, we can put a distance matrix in > > parameter (by the line "pam<-pam(dist(matrixdata),k=7)" ) but > > we can't do it in the function kmeans, we have to put the > > matrix of data directly ... > > Thanks in advance, > > Nicolas BOUGET > > One solution is to transform the data in a way, that theeuclidean> distance of the transformed values represents some otherdistance of the> original values. This works at least for theMahalanobis-Distance, when> one applies a multivariate technique to a PCA transformedand re-scaled> matrix, but I don't know if there are transformations forsome other> distance measures. > > Thomas P. >
My thread broke as I write this at home and there were no new messages on this subject after I got home. I hope this still reaches interested parties. There are several methods that find centroids (means) from distance data. Centroid clustering methods do so, and so does classic scaling a.k.a. metric multidimensional scaling a.k.a. principal co-ordinates analysis (in R function cmdscale the means are found in C function dblcen.c in R sources). Strictly this centroid finding only works with Euclidean distances, but these methods willingly handle any other dissimilarities (or distances). Sometimes this results in anomalies like upper levels being below lower levels in cluster diagrams or in negative eigenvalues in cmdscale. In principle, kmeans could do the same if she only wanted. Is it correct to use non-Euclidean dissimilarities when Euclidean distances were assumed? In my field (ecology) we know that Euclidean distances are often poor, and some other dissimilarities have better properties, and I think it is OK to break the rules (or `violate the assumptions'). Now we don't know what kind of dissimilarities were used in the original post (I think I never saw this specified), so we don't know if they can be euclidized directly using ideas of Petzold or Simpson. They might be semimetric or other sinful dissimilarities, too. These would be bad in the sense Uwe Ligges wrote: you wouldn't get centres of Voronoi polygons in original space, not even non-overlapping polygons. Still they might work better than the original space (who wants to be in the original space when there are better spaces floating around?) The following trick handles the problem euclidizing space implied by any dissimilarity meaasure (metric or semimetric). Here mdata is your original (rectangular) data matrix, and dis is any dissimilarity data: tmp <- cmdscale(dis, k=min(dim(mdata))-1, eig=TRUE) eucspace <- tmp$points[, tmp$eig > 0.01] The condition removes axes with negative or almost-zero eigenvalues that you will get with semimetric dissimilarities. Then just call kmeans with eucspace as argument. If your dis is Euclidean, this is only a rotation and kmeans of eucspace and mdata should be equal. For other types of dis (even for semimetric dissimilarity) this maps your dissimilarities onto Euclidean space which in effect is the same as performing kmeans with your original dissimilarity. Cheers, jari oksanen -- Jari Oksanen, Oulu, Finland