Dear R list, I'm trying to calculate Mahalanobis distances for 'Species' of 'iris' data as obtained below: Squared Distance to Species From Species: Setosa Versicolor Virginica Setosa 0 89.86419 179.38471 Versicolor 89.86419 0 17.20107 Virginica 179.38471 17.20107 0 These distances were obtained with proc 'CANDISC' of SAS, please, see Output 21.1.2: Iris Data: Squared Mahalanobis Distances from http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap21/sect19.htm From these distances my intention is to make a cluster analysis as below, using the package 'mclust': In prior mail, my basic question was: how to obtain this matrix with R from 'iris' data? Well, I think that the basic soluction to calculate this distances is: # # --- Begin R script 1 --- # x = as.matrix(iris[,1:4]) tra = iris[,5] man = manova(x ~ tra) # Mahalanobis E = summary(man)$SS[2] #Matrix E S = as.matrix(E$Residuals)/man$df.residual InvS = solve(S) ms = matrix(unlist(by(x, tra, mean)), byrow=T, ncol=ncol(x)) colnames(ms) = names(iris[1:4]) rownames(ms) = c('Set', 'Ver', 'Vir') D2.12 = (ms[1,] - ms[2,])%*%InvS%*%(ms[1,] - ms[2,]) print(D2.12) D2.13 = (ms[1,] - ms[3,])%*%InvS%*%(ms[1,] - ms[3,]) print(D2.13) D2.23 = (ms[2,] - ms[3,])%*%InvS%*%(ms[2,] - ms[3,]) print(D2.23) # # --- End R script 1 --- # Well, I would like to generalize a soluction to obtain the matrices like 'Mah' (below) or a complete matrix like in the Output 21.1.2. Somebody could help me? # # --- Begin R script 2 --- # Mah = c( 0, 89.86419, 0, 179.38471, 17.20107, 0) n = 3 D = matrix(0, n, n) nam = c('Set', 'Ver', 'Vir') rownames(D) = nam colnames(D) = nam k = 0 for (i in 1:n) { for (j in 1:i) { k = k+1 D[i,j] = Mah[k] D[j,i] = Mah[k] } } D=sqrt(D) #D2 -> D library(mclust) dendroS = hclust(as.dist(D), method='single') dendroC = hclust(as.dist(D), method='complete') win.graph(w = 3.5, h = 6) split.screen(c(2, 1)) screen(1) plot(dendroS, main='Single', sub='', xlab='', ylab='', col='blue') screen(2) plot(dendroC, main='Complete', sub='', xlab='', col='red') # # --- End R script 2 --- # I always need of this type of analysis and I'm not founding how to make it in the CRAN documentation (Archives, packages: mclust, cluster, fpc and mva). Regards, -- Jose Claudio Faria Brasil/Bahia/UESC/DCET Estatistica Experimental/Prof. Adjunto mails: joseclaudio.faria at terra.com.br
Dear Jose, normal mixture clustering (mclust) operates on points times variables data and not on a distance matrix. Therefore it doesn't make sense to compute Mahalanobis distances before using mclust. Furthermore, cluster analysis based on distance matrices (hclust or pam, say) operates on a point by point distance matrix (be it Mahalanobis or something else). You show a group by group matrix below, for which I don't see any purpose in cluster analysis. Have you looked at function mahalanobis? Christian On Fri, 8 Jul 2005, Jose Claudio Faria wrote:> Dear R list, > > I'm trying to calculate Mahalanobis distances for 'Species' of 'iris' data > as obtained below: > > Squared Distance to Species From Species: > > Setosa Versicolor Virginica > Setosa 0 89.86419 179.38471 > Versicolor 89.86419 0 17.20107 > Virginica 179.38471 17.20107 0 > > These distances were obtained with proc 'CANDISC' of SAS, please, > see Output 21.1.2: Iris Data: Squared Mahalanobis Distances from > http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap21/sect19.htm > > From these distances my intention is to make a cluster analysis as below, using > the package 'mclust': > > In prior mail, my basic question was: how to obtain this matrix with R > from 'iris' data? > > Well, I think that the basic soluction to calculate this distances is: > > # > # --- Begin R script 1 --- > # > x = as.matrix(iris[,1:4]) > tra = iris[,5] > > man = manova(x ~ tra) > > # Mahalanobis > E = summary(man)$SS[2] #Matrix E > S = as.matrix(E$Residuals)/man$df.residual > InvS = solve(S) > ms = matrix(unlist(by(x, tra, mean)), byrow=T, ncol=ncol(x)) > colnames(ms) = names(iris[1:4]) > rownames(ms) = c('Set', 'Ver', 'Vir') > D2.12 = (ms[1,] - ms[2,])%*%InvS%*%(ms[1,] - ms[2,]) > print(D2.12) > D2.13 = (ms[1,] - ms[3,])%*%InvS%*%(ms[1,] - ms[3,]) > print(D2.13) > D2.23 = (ms[2,] - ms[3,])%*%InvS%*%(ms[2,] - ms[3,]) > print(D2.23) > # > # --- End R script 1 --- > # > > Well, I would like to generalize a soluction to obtain > the matrices like 'Mah' (below) or a complete matrix like in the > Output 21.1.2. Somebody could help me? > > # > # --- Begin R script 2 --- > # > > Mah = c( 0, > 89.86419, 0, > 179.38471, 17.20107, 0) > > n = 3 > D = matrix(0, n, n) > > nam = c('Set', 'Ver', 'Vir') > rownames(D) = nam > colnames(D) = nam > > k = 0 > for (i in 1:n) { > for (j in 1:i) { > k = k+1 > D[i,j] = Mah[k] > D[j,i] = Mah[k] > } > } > > D=sqrt(D) #D2 -> D > > library(mclust) > dendroS = hclust(as.dist(D), method='single') > dendroC = hclust(as.dist(D), method='complete') > > win.graph(w = 3.5, h = 6) > split.screen(c(2, 1)) > screen(1) > plot(dendroS, main='Single', sub='', xlab='', ylab='', col='blue') > > screen(2) > plot(dendroC, main='Complete', sub='', xlab='', col='red') > # > # --- End R script 2 --- > # > > I always need of this type of analysis and I'm not founding how to make it in > the CRAN documentation (Archives, packages: mclust, cluster, fpc and mva). > > Regards, > -- > Jose Claudio Faria > Brasil/Bahia/UESC/DCET > Estatistica Experimental/Prof. Adjunto > mails: > joseclaudio.faria at terra.com.br > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >*** NEW ADDRESS! *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
Well, as I did not get a satisfactory reply to the original question I tried to make a basic function that, I find, solve the question. I think it is not the better function, but it is working. So, perhaps it can be useful to other people. # # Calculate the matrix of Mahalanobis Distances between groups # from data.frames # # by: Jos?? Cl??udio Faria # date: 10/7/05 13:23:48 # D2Mah = function(y, x) { stopifnot(is.data.frame(y), !missing(x)) stopifnot(dim(y)[1] != dim(x)[1]) y = as.matrix(y) x = as.factor(x) man = manova(y ~ x) E = summary(man)$SS[2] #Matrix E S = as.matrix(E$Residuals)/man$df.residual InvS = solve(S) mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y)) colnames(mds) = names(y) Objects = levels(x) rownames(mds) = Objects library(gtools) nObjects = nrow(mds) comb = combinations(nObjects, 2) tmpD2 = numeric() for (i in 1:dim(comb)[1]){ a = comb[i,1] b = comb[i,2] tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]) } # Thanks Gabor for the below tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects, Objects)) tmpMah[lower.tri(tmpMah)] = tmpD2 D2 = tmpMah + t(tmpMah) return(D2) } # # To try # D2M = D2Mah(iris[,1:4], iris[,5]) print(D2M) Thanks all for the complementary aid (specially to Gabor). Regards, -- Jose Claudio Faria Brasil/Bahia/UESC/DCET Estatistica Experimental/Prof. Adjunto mails: joseclaudio.faria at terra.com.br jc_faria at uesc.br jc_faria at uol.com.br tel: 73-3634.2779
I think you could simplify this by replacing everything after the nObjects = nrow(mds) line with just these two statements. f <- function(a,b) mapply(function(a,b) (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b) D2 <- outer(seq(nObjects), seq(nObjects), f) This also eliminates dependence on gtools and the complexity of dealing with triangular matrices. Regards. Here it is in full: D2Mah2 = function(y, x) { stopifnot(is.data.frame(y), !missing(x)) stopifnot(dim(y)[1] != dim(x)[1]) y = as.matrix(y) x = as.factor(x) man = manova(y ~ x) E = summary(man)$SS[2] #Matrix E S = as.matrix(E$Residuals)/man$df.residual InvS = solve(S) mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y)) library(gtools) nObjects = nrow(mds) ### changed part is next two statements f <- function(a,b) mapply(function(a,b) (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b) D2 <- outer(seq(nObjects), seq(nObjects), f) } # # test # D2M2 = D2Mah2(iris[,1:4], iris[,5]) print(D2M2) On 7/10/05, Jose Claudio Faria <joseclaudio.faria at terra.com.br> wrote:> Well, as I did not get a satisfactory reply to the original question I tried to > make a basic function that, I find, solve the question. > > I think it is not the better function, but it is working. > > So, perhaps it can be useful to other people. > > # > # Calculate the matrix of Mahalanobis Distances between groups > # from data.frames > # > # by: Jos?? Cl??udio Faria > # date: 10/7/05 13:23:48 > # > > D2Mah = function(y, x) { > > stopifnot(is.data.frame(y), !missing(x)) > stopifnot(dim(y)[1] != dim(x)[1]) > y = as.matrix(y) > x = as.factor(x) > man = manova(y ~ x) > E = summary(man)$SS[2] #Matrix E > S = as.matrix(E$Residuals)/man$df.residual > InvS = solve(S) > mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y)) > > colnames(mds) = names(y) > Objects = levels(x) > rownames(mds) = Objects > > library(gtools) > nObjects = nrow(mds) > comb = combinations(nObjects, 2) > > tmpD2 = numeric() > for (i in 1:dim(comb)[1]){ > a = comb[i,1] > b = comb[i,2] > tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]) > } > > # Thanks Gabor for the below > tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects, Objects)) > tmpMah[lower.tri(tmpMah)] = tmpD2 > D2 = tmpMah + t(tmpMah) > return(D2) > } > > # > # To try > # > D2M = D2Mah(iris[,1:4], iris[,5]) > print(D2M) > > Thanks all for the complementary aid (specially to Gabor). > > Regards, > -- > Jose Claudio Faria > Brasil/Bahia/UESC/DCET > Estatistica Experimental/Prof. Adjunto > mails: > joseclaudio.faria at terra.com.br > jc_faria at uesc.br > jc_faria at uol.com.br > tel: 73-3634.2779 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
And here is one more simplification using the buildin mahalanobis function: D2Mah3 = function(y, x) { stopifnot(is.data.frame(y), !missing(x)) stopifnot(dim(y)[1] != dim(x)[1]) y = as.matrix(y) x = as.factor(x) man = manova(y ~ x) E = summary(man)$SS[2] #Matrix E S = as.matrix(E$Residuals)/man$df.residual InvS = solve(S) mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y)) nObjects = nrow(mds) ### changed part is next two statements f <- function(a,b) mapply(function(a,b) mahalanobis(mds[a,],mds[b,],InvS,TRUE), a, b) D2 <- outer(seq(nObjects), seq(nObjects), f) } # # test # D2M3 = D2Mah3(iris[,1:4], iris[,5]) On 7/10/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> I think you could simplify this by replacing everything after the > nObjects = nrow(mds) line with just these two statements. > > f <- function(a,b) mapply(function(a,b) > (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b) > > D2 <- outer(seq(nObjects), seq(nObjects), f) > > This also eliminates dependence on gtools and the complexity > of dealing with triangular matrices. > > Regards. > > Here it is in full: > > D2Mah2 = function(y, x) { > > stopifnot(is.data.frame(y), !missing(x)) > stopifnot(dim(y)[1] != dim(x)[1]) > y = as.matrix(y) > x = as.factor(x) > man = manova(y ~ x) > E = summary(man)$SS[2] #Matrix E > S = as.matrix(E$Residuals)/man$df.residual > InvS = solve(S) > mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y)) > > library(gtools) > nObjects = nrow(mds) > > ### changed part is next two statements > f <- function(a,b) mapply(function(a,b) > (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b) > > D2 <- outer(seq(nObjects), seq(nObjects), f) > } > > # > # test > # > D2M2 = D2Mah2(iris[,1:4], iris[,5]) > print(D2M2) > > > > > On 7/10/05, Jose Claudio Faria <joseclaudio.faria at terra.com.br> wrote: > > Well, as I did not get a satisfactory reply to the original question I tried to > > make a basic function that, I find, solve the question. > > > > I think it is not the better function, but it is working. > > > > So, perhaps it can be useful to other people. > > > > # > > # Calculate the matrix of Mahalanobis Distances between groups > > # from data.frames > > # > > # by: Jos?? Cl??udio Faria > > # date: 10/7/05 13:23:48 > > # > > > > D2Mah = function(y, x) { > > > > stopifnot(is.data.frame(y), !missing(x)) > > stopifnot(dim(y)[1] != dim(x)[1]) > > y = as.matrix(y) > > x = as.factor(x) > > man = manova(y ~ x) > > E = summary(man)$SS[2] #Matrix E > > S = as.matrix(E$Residuals)/man$df.residual > > InvS = solve(S) > > mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y)) > > > > colnames(mds) = names(y) > > Objects = levels(x) > > rownames(mds) = Objects > > > > library(gtools) > > nObjects = nrow(mds) > > comb = combinations(nObjects, 2) > > > > tmpD2 = numeric() > > for (i in 1:dim(comb)[1]){ > > a = comb[i,1] > > b = comb[i,2] > > tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]) > > } > > > > # Thanks Gabor for the below > > tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects, Objects)) > > tmpMah[lower.tri(tmpMah)] = tmpD2 > > D2 = tmpMah + t(tmpMah) > > return(D2) > > } > > > > # > > # To try > > # > > D2M = D2Mah(iris[,1:4], iris[,5]) > > print(D2M) > > > > Thanks all for the complementary aid (specially to Gabor). > > > > Regards, > > -- > > Jose Claudio Faria > > Brasil/Bahia/UESC/DCET > > Estatistica Experimental/Prof. Adjunto > > mails: > > joseclaudio.faria at terra.com.br > > jc_faria at uesc.br > > jc_faria at uol.com.br > > tel: 73-3634.2779 > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > >
Hello, a proposed solution of Bill Venables is archieved on the S-News mailing list: http://www.biostat.wustl.edu/archives/html/s-news/2001-07/msg00035.html and if I remember it correctly (and if the variance matrix is estimated from the data), another similar way is simply to use the Euclidean distance of rescaled scores of a pricipal component analysis, e.g.: data(iris) dat <- iris[1:4] # without the species names z <- svd(scale(dat, scale=FALSE))$u cl <- hclust(dist(z), method="ward") plot(cl, labels=iris$Species) #### or alternatively: #### pc <- princomp(dat, cor=FALSE) pcdata <- as.data.frame(scale(pc$scores)) cl <- hclust(dist(pcdata), method="ward") plot(cl, labels=iris$Species) Hope it helps! Thomas P.