Dear R list,
I'm trying to calculate Mahalanobis distances for 'Species' of
'iris' data
as obtained below:
Squared Distance to Species From Species:
Setosa Versicolor Virginica
Setosa 0 89.86419 179.38471
Versicolor 89.86419 0 17.20107
Virginica 179.38471 17.20107 0
These distances were obtained with proc 'CANDISC' of SAS, please,
see Output 21.1.2: Iris Data: Squared Mahalanobis Distances from
http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap21/sect19.htm
From these distances my intention is to make a cluster analysis as below, using
the package 'mclust':
In prior mail, my basic question was: how to obtain this matrix with R
from 'iris' data?
Well, I think that the basic soluction to calculate this distances is:
#
# --- Begin R script 1 ---
#
x = as.matrix(iris[,1:4])
tra = iris[,5]
man = manova(x ~ tra)
# Mahalanobis
E = summary(man)$SS[2] #Matrix E
S = as.matrix(E$Residuals)/man$df.residual
InvS = solve(S)
ms = matrix(unlist(by(x, tra, mean)), byrow=T, ncol=ncol(x))
colnames(ms) = names(iris[1:4])
rownames(ms) = c('Set', 'Ver', 'Vir')
D2.12 = (ms[1,] - ms[2,])%*%InvS%*%(ms[1,] - ms[2,])
print(D2.12)
D2.13 = (ms[1,] - ms[3,])%*%InvS%*%(ms[1,] - ms[3,])
print(D2.13)
D2.23 = (ms[2,] - ms[3,])%*%InvS%*%(ms[2,] - ms[3,])
print(D2.23)
#
# --- End R script 1 ---
#
Well, I would like to generalize a soluction to obtain
the matrices like 'Mah' (below) or a complete matrix like in the
Output 21.1.2. Somebody could help me?
#
# --- Begin R script 2 ---
#
Mah = c( 0,
89.86419, 0,
179.38471, 17.20107, 0)
n = 3
D = matrix(0, n, n)
nam = c('Set', 'Ver', 'Vir')
rownames(D) = nam
colnames(D) = nam
k = 0
for (i in 1:n) {
for (j in 1:i) {
k = k+1
D[i,j] = Mah[k]
D[j,i] = Mah[k]
}
}
D=sqrt(D) #D2 -> D
library(mclust)
dendroS = hclust(as.dist(D), method='single')
dendroC = hclust(as.dist(D), method='complete')
win.graph(w = 3.5, h = 6)
split.screen(c(2, 1))
screen(1)
plot(dendroS, main='Single', sub='', xlab='',
ylab='', col='blue')
screen(2)
plot(dendroC, main='Complete', sub='', xlab='',
col='red')
#
# --- End R script 2 ---
#
I always need of this type of analysis and I'm not founding how to make it
in
the CRAN documentation (Archives, packages: mclust, cluster, fpc and mva).
Regards,
--
Jose Claudio Faria
Brasil/Bahia/UESC/DCET
Estatistica Experimental/Prof. Adjunto
mails:
joseclaudio.faria at terra.com.br
Dear Jose, normal mixture clustering (mclust) operates on points times variables data and not on a distance matrix. Therefore it doesn't make sense to compute Mahalanobis distances before using mclust. Furthermore, cluster analysis based on distance matrices (hclust or pam, say) operates on a point by point distance matrix (be it Mahalanobis or something else). You show a group by group matrix below, for which I don't see any purpose in cluster analysis. Have you looked at function mahalanobis? Christian On Fri, 8 Jul 2005, Jose Claudio Faria wrote:> Dear R list, > > I'm trying to calculate Mahalanobis distances for 'Species' of 'iris' data > as obtained below: > > Squared Distance to Species From Species: > > Setosa Versicolor Virginica > Setosa 0 89.86419 179.38471 > Versicolor 89.86419 0 17.20107 > Virginica 179.38471 17.20107 0 > > These distances were obtained with proc 'CANDISC' of SAS, please, > see Output 21.1.2: Iris Data: Squared Mahalanobis Distances from > http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap21/sect19.htm > > From these distances my intention is to make a cluster analysis as below, using > the package 'mclust': > > In prior mail, my basic question was: how to obtain this matrix with R > from 'iris' data? > > Well, I think that the basic soluction to calculate this distances is: > > # > # --- Begin R script 1 --- > # > x = as.matrix(iris[,1:4]) > tra = iris[,5] > > man = manova(x ~ tra) > > # Mahalanobis > E = summary(man)$SS[2] #Matrix E > S = as.matrix(E$Residuals)/man$df.residual > InvS = solve(S) > ms = matrix(unlist(by(x, tra, mean)), byrow=T, ncol=ncol(x)) > colnames(ms) = names(iris[1:4]) > rownames(ms) = c('Set', 'Ver', 'Vir') > D2.12 = (ms[1,] - ms[2,])%*%InvS%*%(ms[1,] - ms[2,]) > print(D2.12) > D2.13 = (ms[1,] - ms[3,])%*%InvS%*%(ms[1,] - ms[3,]) > print(D2.13) > D2.23 = (ms[2,] - ms[3,])%*%InvS%*%(ms[2,] - ms[3,]) > print(D2.23) > # > # --- End R script 1 --- > # > > Well, I would like to generalize a soluction to obtain > the matrices like 'Mah' (below) or a complete matrix like in the > Output 21.1.2. Somebody could help me? > > # > # --- Begin R script 2 --- > # > > Mah = c( 0, > 89.86419, 0, > 179.38471, 17.20107, 0) > > n = 3 > D = matrix(0, n, n) > > nam = c('Set', 'Ver', 'Vir') > rownames(D) = nam > colnames(D) = nam > > k = 0 > for (i in 1:n) { > for (j in 1:i) { > k = k+1 > D[i,j] = Mah[k] > D[j,i] = Mah[k] > } > } > > D=sqrt(D) #D2 -> D > > library(mclust) > dendroS = hclust(as.dist(D), method='single') > dendroC = hclust(as.dist(D), method='complete') > > win.graph(w = 3.5, h = 6) > split.screen(c(2, 1)) > screen(1) > plot(dendroS, main='Single', sub='', xlab='', ylab='', col='blue') > > screen(2) > plot(dendroC, main='Complete', sub='', xlab='', col='red') > # > # --- End R script 2 --- > # > > I always need of this type of analysis and I'm not founding how to make it in > the CRAN documentation (Archives, packages: mclust, cluster, fpc and mva). > > Regards, > -- > Jose Claudio Faria > Brasil/Bahia/UESC/DCET > Estatistica Experimental/Prof. Adjunto > mails: > joseclaudio.faria at terra.com.br > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >*** NEW ADDRESS! *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
Well, as I did not get a satisfactory reply to the original question I tried to
make a basic function that, I find, solve the question.
I think it is not the better function, but it is working.
So, perhaps it can be useful to other people.
#
# Calculate the matrix of Mahalanobis Distances between groups
# from data.frames
#
# by: Jos?? Cl??udio Faria
# date: 10/7/05 13:23:48
#
D2Mah = function(y, x) {
stopifnot(is.data.frame(y), !missing(x))
stopifnot(dim(y)[1] != dim(x)[1])
y = as.matrix(y)
x = as.factor(x)
man = manova(y ~ x)
E = summary(man)$SS[2] #Matrix E
S = as.matrix(E$Residuals)/man$df.residual
InvS = solve(S)
mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
colnames(mds) = names(y)
Objects = levels(x)
rownames(mds) = Objects
library(gtools)
nObjects = nrow(mds)
comb = combinations(nObjects, 2)
tmpD2 = numeric()
for (i in 1:dim(comb)[1]){
a = comb[i,1]
b = comb[i,2]
tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,])
}
# Thanks Gabor for the below
tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects, Objects))
tmpMah[lower.tri(tmpMah)] = tmpD2
D2 = tmpMah + t(tmpMah)
return(D2)
}
#
# To try
#
D2M = D2Mah(iris[,1:4], iris[,5])
print(D2M)
Thanks all for the complementary aid (specially to Gabor).
Regards,
--
Jose Claudio Faria
Brasil/Bahia/UESC/DCET
Estatistica Experimental/Prof. Adjunto
mails:
joseclaudio.faria at terra.com.br
jc_faria at uesc.br
jc_faria at uol.com.br
tel: 73-3634.2779
I think you could simplify this by replacing everything after the
nObjects = nrow(mds) line with just these two statements.
f <- function(a,b) mapply(function(a,b)
(mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)
D2 <- outer(seq(nObjects), seq(nObjects), f)
This also eliminates dependence on gtools and the complexity
of dealing with triangular matrices.
Regards.
Here it is in full:
D2Mah2 = function(y, x) {
stopifnot(is.data.frame(y), !missing(x))
stopifnot(dim(y)[1] != dim(x)[1])
y = as.matrix(y)
x = as.factor(x)
man = manova(y ~ x)
E = summary(man)$SS[2] #Matrix E
S = as.matrix(E$Residuals)/man$df.residual
InvS = solve(S)
mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
library(gtools)
nObjects = nrow(mds)
### changed part is next two statements
f <- function(a,b) mapply(function(a,b)
(mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)
D2 <- outer(seq(nObjects), seq(nObjects), f)
}
#
# test
#
D2M2 = D2Mah2(iris[,1:4], iris[,5])
print(D2M2)
On 7/10/05, Jose Claudio Faria <joseclaudio.faria at terra.com.br>
wrote:> Well, as I did not get a satisfactory reply to the original question I
tried to
> make a basic function that, I find, solve the question.
>
> I think it is not the better function, but it is working.
>
> So, perhaps it can be useful to other people.
>
> #
> # Calculate the matrix of Mahalanobis Distances between groups
> # from data.frames
> #
> # by: Jos?? Cl??udio Faria
> # date: 10/7/05 13:23:48
> #
>
> D2Mah = function(y, x) {
>
> stopifnot(is.data.frame(y), !missing(x))
> stopifnot(dim(y)[1] != dim(x)[1])
> y = as.matrix(y)
> x = as.factor(x)
> man = manova(y ~ x)
> E = summary(man)$SS[2] #Matrix E
> S = as.matrix(E$Residuals)/man$df.residual
> InvS = solve(S)
> mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
>
> colnames(mds) = names(y)
> Objects = levels(x)
> rownames(mds) = Objects
>
> library(gtools)
> nObjects = nrow(mds)
> comb = combinations(nObjects, 2)
>
> tmpD2 = numeric()
> for (i in 1:dim(comb)[1]){
> a = comb[i,1]
> b = comb[i,2]
> tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,])
> }
>
> # Thanks Gabor for the below
> tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects, Objects))
> tmpMah[lower.tri(tmpMah)] = tmpD2
> D2 = tmpMah + t(tmpMah)
> return(D2)
> }
>
> #
> # To try
> #
> D2M = D2Mah(iris[,1:4], iris[,5])
> print(D2M)
>
> Thanks all for the complementary aid (specially to Gabor).
>
> Regards,
> --
> Jose Claudio Faria
> Brasil/Bahia/UESC/DCET
> Estatistica Experimental/Prof. Adjunto
> mails:
> joseclaudio.faria at terra.com.br
> jc_faria at uesc.br
> jc_faria at uol.com.br
> tel: 73-3634.2779
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>
And here is one more simplification using the buildin mahalanobis
function:
D2Mah3 = function(y, x) {
stopifnot(is.data.frame(y), !missing(x))
stopifnot(dim(y)[1] != dim(x)[1])
y = as.matrix(y)
x = as.factor(x)
man = manova(y ~ x)
E = summary(man)$SS[2] #Matrix E
S = as.matrix(E$Residuals)/man$df.residual
InvS = solve(S)
mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
nObjects = nrow(mds)
### changed part is next two statements
f <- function(a,b) mapply(function(a,b)
mahalanobis(mds[a,],mds[b,],InvS,TRUE), a, b)
D2 <- outer(seq(nObjects), seq(nObjects), f)
}
#
# test
#
D2M3 = D2Mah3(iris[,1:4], iris[,5])
On 7/10/05, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> I think you could simplify this by replacing everything after the
> nObjects = nrow(mds) line with just these two statements.
>
> f <- function(a,b) mapply(function(a,b)
> (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)
>
> D2 <- outer(seq(nObjects), seq(nObjects), f)
>
> This also eliminates dependence on gtools and the complexity
> of dealing with triangular matrices.
>
> Regards.
>
> Here it is in full:
>
> D2Mah2 = function(y, x) {
>
> stopifnot(is.data.frame(y), !missing(x))
> stopifnot(dim(y)[1] != dim(x)[1])
> y = as.matrix(y)
> x = as.factor(x)
> man = manova(y ~ x)
> E = summary(man)$SS[2] #Matrix E
> S = as.matrix(E$Residuals)/man$df.residual
> InvS = solve(S)
> mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
>
> library(gtools)
> nObjects = nrow(mds)
>
> ### changed part is next two statements
> f <- function(a,b) mapply(function(a,b)
> (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)
>
> D2 <- outer(seq(nObjects), seq(nObjects), f)
> }
>
> #
> # test
> #
> D2M2 = D2Mah2(iris[,1:4], iris[,5])
> print(D2M2)
>
>
>
>
> On 7/10/05, Jose Claudio Faria <joseclaudio.faria at terra.com.br>
wrote:
> > Well, as I did not get a satisfactory reply to the original question I
tried to
> > make a basic function that, I find, solve the question.
> >
> > I think it is not the better function, but it is working.
> >
> > So, perhaps it can be useful to other people.
> >
> > #
> > # Calculate the matrix of Mahalanobis Distances between groups
> > # from data.frames
> > #
> > # by: Jos?? Cl??udio Faria
> > # date: 10/7/05 13:23:48
> > #
> >
> > D2Mah = function(y, x) {
> >
> > stopifnot(is.data.frame(y), !missing(x))
> > stopifnot(dim(y)[1] != dim(x)[1])
> > y = as.matrix(y)
> > x = as.factor(x)
> > man = manova(y ~ x)
> > E = summary(man)$SS[2] #Matrix E
> > S = as.matrix(E$Residuals)/man$df.residual
> > InvS = solve(S)
> > mds = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
> >
> > colnames(mds) = names(y)
> > Objects = levels(x)
> > rownames(mds) = Objects
> >
> > library(gtools)
> > nObjects = nrow(mds)
> > comb = combinations(nObjects, 2)
> >
> > tmpD2 = numeric()
> > for (i in 1:dim(comb)[1]){
> > a = comb[i,1]
> > b = comb[i,2]
> > tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,])
> > }
> >
> > # Thanks Gabor for the below
> > tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects,
Objects))
> > tmpMah[lower.tri(tmpMah)] = tmpD2
> > D2 = tmpMah + t(tmpMah)
> > return(D2)
> > }
> >
> > #
> > # To try
> > #
> > D2M = D2Mah(iris[,1:4], iris[,5])
> > print(D2M)
> >
> > Thanks all for the complementary aid (specially to Gabor).
> >
> > Regards,
> > --
> > Jose Claudio Faria
> > Brasil/Bahia/UESC/DCET
> > Estatistica Experimental/Prof. Adjunto
> > mails:
> > joseclaudio.faria at terra.com.br
> > jc_faria at uesc.br
> > jc_faria at uol.com.br
> > tel: 73-3634.2779
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> >
>
Hello, a proposed solution of Bill Venables is archieved on the S-News mailing list: http://www.biostat.wustl.edu/archives/html/s-news/2001-07/msg00035.html and if I remember it correctly (and if the variance matrix is estimated from the data), another similar way is simply to use the Euclidean distance of rescaled scores of a pricipal component analysis, e.g.: data(iris) dat <- iris[1:4] # without the species names z <- svd(scale(dat, scale=FALSE))$u cl <- hclust(dist(z), method="ward") plot(cl, labels=iris$Species) #### or alternatively: #### pc <- princomp(dat, cor=FALSE) pcdata <- as.data.frame(scale(pc$scores)) cl <- hclust(dist(pcdata), method="ward") plot(cl, labels=iris$Species) Hope it helps! Thomas P.