thr3ads.net - R help - [R] Help with Mahalanobis [Jul 2005]

If this information is useful, please help other people find it:
Share via:

Jose Claudio Faria

2005-Jul-08 17:57 UTC

[R] Help with Mahalanobis

Dear R list,

I'm trying to calculate Mahalanobis distances for 'Species' of
'iris' data
as obtained below:

Squared Distance to Species From Species:

               Setosa Versicolor Virginica
Setosa 	           0   89.86419 179.38471
Versicolor  89.86419          0  17.20107
Virginica  179.38471   17.20107         0

These distances were obtained with proc 'CANDISC' of SAS, please,
see Output 21.1.2: Iris Data: Squared Mahalanobis Distances from
http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap21/sect19.htm

 From these distances my intention is to make a cluster analysis as below, using
the package 'mclust':

In prior mail, my basic question was: how to obtain this matrix with R
from 'iris' data?

Well, I think that the basic soluction to calculate this distances is:

#
# --- Begin R script 1 ---
#
x   = as.matrix(iris[,1:4])
tra = iris[,5]

man = manova(x ~ tra)

# Mahalanobis
E    = summary(man)$SS[2] #Matrix E
S    = as.matrix(E$Residuals)/man$df.residual
InvS = solve(S)
ms = matrix(unlist(by(x, tra, mean)), byrow=T, ncol=ncol(x))
colnames(ms) = names(iris[1:4])
rownames(ms) = c('Set', 'Ver', 'Vir')
D2.12 = (ms[1,] - ms[2,])%*%InvS%*%(ms[1,] - ms[2,])
print(D2.12)
D2.13 = (ms[1,] - ms[3,])%*%InvS%*%(ms[1,] - ms[3,])
print(D2.13)
D2.23 = (ms[2,] - ms[3,])%*%InvS%*%(ms[2,] - ms[3,])
print(D2.23)
#
# --- End R script 1 ---
#

Well, I would like to generalize a soluction to obtain
the matrices like 'Mah' (below) or a complete matrix like in the
Output 21.1.2. Somebody could help me?

#
# --- Begin R script 2 ---
#

Mah = c(        0,
          89.86419,        0,
         179.38471, 17.20107, 0)

n = 3
D = matrix(0, n, n)

nam = c('Set', 'Ver', 'Vir')
rownames(D) = nam
colnames(D) = nam

k = 0
for (i in 1:n) {
    for (j in 1:i) {
       k      = k+1
       D[i,j] = Mah[k]
       D[j,i] = Mah[k]
    }
}

D=sqrt(D) #D2 -> D

library(mclust)
dendroS = hclust(as.dist(D), method='single')
dendroC = hclust(as.dist(D), method='complete')

win.graph(w = 3.5, h = 6)
split.screen(c(2, 1))
screen(1)
plot(dendroS, main='Single', sub='', xlab='',
ylab='', col='blue')

screen(2)
plot(dendroC, main='Complete', sub='', xlab='',
col='red')
#
# --- End R script 2 ---
#

I always need of this type of analysis and I'm not founding how to make it
in
the CRAN documentation (Archives, packages: mclust, cluster, fpc and mva).

Regards,
--
Jose Claudio Faria
Brasil/Bahia/UESC/DCET
Estatistica Experimental/Prof. Adjunto
mails:
  joseclaudio.faria at terra.com.br

Christian Hennig

2005-Jul-08 18:14 UTC

head link

[R] Help with Mahalanobis

Dear Jose,

normal mixture clustering (mclust) operates on points times variables data
and not on a distance matrix. Therefore
it doesn't make sense to compute Mahalanobis distances before using
mclust.
Furthermore, cluster analysis based on distance matrices (hclust or pam,
say) operates on a point by point distance matrix (be it Mahalanobis or
something else). You show a group by group matrix below, for which I don't
see any purpose in cluster analysis.
Have you looked at function mahalanobis?

Christian


On Fri, 8 Jul 2005, Jose Claudio Faria wrote:
> Dear R list,
>
> I'm trying to calculate Mahalanobis distances for 'Species' of
'iris' data
> as obtained below:
>
> Squared Distance to Species From Species:
>
>                Setosa Versicolor Virginica
> Setosa 	           0   89.86419 179.38471
> Versicolor  89.86419          0  17.20107
> Virginica  179.38471   17.20107         0
>
> These distances were obtained with proc 'CANDISC' of SAS, please,
> see Output 21.1.2: Iris Data: Squared Mahalanobis Distances from
>
http://www.id.unizh.ch/software/unix/statmath/sas/sasdoc/stat/chap21/sect19.htm
>
>  From these distances my intention is to make a cluster analysis as below,
using
> the package 'mclust':
>
> In prior mail, my basic question was: how to obtain this matrix with R
> from 'iris' data?
>
> Well, I think that the basic soluction to calculate this distances is:
>
> #
> # --- Begin R script 1 ---
> #
> x   = as.matrix(iris[,1:4])
> tra = iris[,5]
>
> man = manova(x ~ tra)
>
> # Mahalanobis
> E    = summary(man)$SS[2] #Matrix E
> S    = as.matrix(E$Residuals)/man$df.residual
> InvS = solve(S)
> ms = matrix(unlist(by(x, tra, mean)), byrow=T, ncol=ncol(x))
> colnames(ms) = names(iris[1:4])
> rownames(ms) = c('Set', 'Ver', 'Vir')
> D2.12 = (ms[1,] - ms[2,])%*%InvS%*%(ms[1,] - ms[2,])
> print(D2.12)
> D2.13 = (ms[1,] - ms[3,])%*%InvS%*%(ms[1,] - ms[3,])
> print(D2.13)
> D2.23 = (ms[2,] - ms[3,])%*%InvS%*%(ms[2,] - ms[3,])
> print(D2.23)
> #
> # --- End R script 1 ---
> #
>
> Well, I would like to generalize a soluction to obtain
> the matrices like 'Mah' (below) or a complete matrix like in the
> Output 21.1.2. Somebody could help me?
>
> #
> # --- Begin R script 2 ---
> #
>
> Mah = c(        0,
>           89.86419,        0,
>          179.38471, 17.20107, 0)
>
> n = 3
> D = matrix(0, n, n)
>
> nam = c('Set', 'Ver', 'Vir')
> rownames(D) = nam
> colnames(D) = nam
>
> k = 0
> for (i in 1:n) {
>     for (j in 1:i) {
>        k      = k+1
>        D[i,j] = Mah[k]
>        D[j,i] = Mah[k]
>     }
> }
>
> D=sqrt(D) #D2 -> D
>
> library(mclust)
> dendroS = hclust(as.dist(D), method='single')
> dendroC = hclust(as.dist(D), method='complete')
>
> win.graph(w = 3.5, h = 6)
> split.screen(c(2, 1))
> screen(1)
> plot(dendroS, main='Single', sub='', xlab='',
ylab='', col='blue')
>
> screen(2)
> plot(dendroC, main='Complete', sub='', xlab='',
col='red')
> #
> # --- End R script 2 ---
> #
>
> I always need of this type of analysis and I'm not founding how to make
it in
> the CRAN documentation (Archives, packages: mclust, cluster, fpc and mva).
>
> Regards,
> --
> Jose Claudio Faria
> Brasil/Bahia/UESC/DCET
> Estatistica Experimental/Prof. Adjunto
> mails:
>   joseclaudio.faria at terra.com.br
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>
*** NEW ADDRESS! ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Jose Claudio Faria

2005-Jul-10 16:24 UTC

head link

[R] Help with Mahalanobis

Well, as I did not get a satisfactory reply to the original question I tried to 
make a basic function that, I find, solve the question.

I think it is not the better function, but it is working.

So, perhaps it can be useful to other people.

#
# Calculate the matrix of Mahalanobis Distances between groups
# from data.frames
#
# by: Jos?? Cl??udio Faria
# date: 10/7/05 13:23:48
#

D2Mah = function(y, x) {

   stopifnot(is.data.frame(y), !missing(x))
   stopifnot(dim(y)[1] != dim(x)[1])
   y    = as.matrix(y)
   x    = as.factor(x)
   man  = manova(y ~ x)
   E    = summary(man)$SS[2] #Matrix E
   S    = as.matrix(E$Residuals)/man$df.residual
   InvS = solve(S)
   mds  = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))

   colnames(mds) = names(y)
   Objects       = levels(x)
   rownames(mds) = Objects

   library(gtools)
   nObjects = nrow(mds)
   comb     = combinations(nObjects, 2)

   tmpD2 = numeric()
   for (i in 1:dim(comb)[1]){
     a = comb[i,1]
     b = comb[i,2]
     tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,])
   }

   # Thanks Gabor for the below
   tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects, Objects))
   tmpMah[lower.tri(tmpMah)] = tmpD2
   D2 = tmpMah + t(tmpMah)
   return(D2)
}

#
# To try
#
D2M = D2Mah(iris[,1:4], iris[,5])
print(D2M)

Thanks all for the complementary aid (specially to Gabor).

Regards,
-- 
Jose Claudio Faria
Brasil/Bahia/UESC/DCET
Estatistica Experimental/Prof. Adjunto
mails:
  joseclaudio.faria at terra.com.br
  jc_faria at uesc.br
  jc_faria at uol.com.br
tel: 73-3634.2779

Gabor Grothendieck

2005-Jul-10 17:11 UTC

head link

[R] Help with Mahalanobis

I think you could simplify this by replacing everything after the
nObjects = nrow(mds) line with just these two statements.

  f <- function(a,b) mapply(function(a,b)
    (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)

  D2 <- outer(seq(nObjects), seq(nObjects), f)

This also eliminates dependence on gtools and the complexity
of dealing with triangular matrices.

Regards.

Here it is in full:

D2Mah2 = function(y, x) {

  stopifnot(is.data.frame(y), !missing(x))
  stopifnot(dim(y)[1] != dim(x)[1])
  y    = as.matrix(y)
  x    = as.factor(x)
  man  = manova(y ~ x)
  E    = summary(man)$SS[2] #Matrix E
  S    = as.matrix(E$Residuals)/man$df.residual
  InvS = solve(S)
  mds  = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))

  library(gtools)
  nObjects = nrow(mds)

  ### changed part is next two statements
  f <- function(a,b) mapply(function(a,b)
    (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)

  D2 <- outer(seq(nObjects), seq(nObjects), f)
}

#
# test
#
D2M2 = D2Mah2(iris[,1:4], iris[,5])
print(D2M2)




On 7/10/05, Jose Claudio Faria <joseclaudio.faria at terra.com.br>
wrote:> Well, as I did not get a satisfactory reply to the original question I
tried to
> make a basic function that, I find, solve the question.
> 
> I think it is not the better function, but it is working.
> 
> So, perhaps it can be useful to other people.
> 
> #
> # Calculate the matrix of Mahalanobis Distances between groups
> # from data.frames
> #
> # by: Jos?? Cl??udio Faria
> # date: 10/7/05 13:23:48
> #
> 
> D2Mah = function(y, x) {
> 
>   stopifnot(is.data.frame(y), !missing(x))
>   stopifnot(dim(y)[1] != dim(x)[1])
>   y    = as.matrix(y)
>   x    = as.factor(x)
>   man  = manova(y ~ x)
>   E    = summary(man)$SS[2] #Matrix E
>   S    = as.matrix(E$Residuals)/man$df.residual
>   InvS = solve(S)
>   mds  = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
> 
>   colnames(mds) = names(y)
>   Objects       = levels(x)
>   rownames(mds) = Objects
> 
>   library(gtools)
>   nObjects = nrow(mds)
>   comb     = combinations(nObjects, 2)
> 
>   tmpD2 = numeric()
>   for (i in 1:dim(comb)[1]){
>     a = comb[i,1]
>     b = comb[i,2]
>     tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,])
>   }
> 
>   # Thanks Gabor for the below
>   tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects, Objects))
>   tmpMah[lower.tri(tmpMah)] = tmpD2
>   D2 = tmpMah + t(tmpMah)
>   return(D2)
> }
> 
> #
> # To try
> #
> D2M = D2Mah(iris[,1:4], iris[,5])
> print(D2M)
> 
> Thanks all for the complementary aid (specially to Gabor).
> 
> Regards,
> --
> Jose Claudio Faria
> Brasil/Bahia/UESC/DCET
> Estatistica Experimental/Prof. Adjunto
> mails:
>  joseclaudio.faria at terra.com.br
>  jc_faria at uesc.br
>  jc_faria at uol.com.br
> tel: 73-3634.2779
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Gabor Grothendieck

2005-Jul-10 23:08 UTC

head link

[R] Help with Mahalanobis

And here is one more simplification using the buildin mahalanobis
function:

D2Mah3 = function(y, x) {

 stopifnot(is.data.frame(y), !missing(x))
 stopifnot(dim(y)[1] != dim(x)[1])
 y    = as.matrix(y)
 x    = as.factor(x)
 man  = manova(y ~ x)
 E    = summary(man)$SS[2] #Matrix E
 S    = as.matrix(E$Residuals)/man$df.residual
 InvS = solve(S)
 mds  = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))

 nObjects = nrow(mds)

 ### changed part is next two statements
 f <- function(a,b) mapply(function(a,b)
   mahalanobis(mds[a,],mds[b,],InvS,TRUE), a, b)

 D2 <- outer(seq(nObjects), seq(nObjects), f)
}

#
# test
#
D2M3 = D2Mah3(iris[,1:4], iris[,5])


On 7/10/05, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> I think you could simplify this by replacing everything after the
> nObjects = nrow(mds) line with just these two statements.
> 
>  f <- function(a,b) mapply(function(a,b)
>    (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)
> 
>  D2 <- outer(seq(nObjects), seq(nObjects), f)
> 
> This also eliminates dependence on gtools and the complexity
> of dealing with triangular matrices.
> 
> Regards.
> 
> Here it is in full:
> 
> D2Mah2 = function(y, x) {
> 
>  stopifnot(is.data.frame(y), !missing(x))
>  stopifnot(dim(y)[1] != dim(x)[1])
>  y    = as.matrix(y)
>  x    = as.factor(x)
>  man  = manova(y ~ x)
>  E    = summary(man)$SS[2] #Matrix E
>  S    = as.matrix(E$Residuals)/man$df.residual
>  InvS = solve(S)
>  mds  = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
> 
>  library(gtools)
>  nObjects = nrow(mds)
> 
>  ### changed part is next two statements
>  f <- function(a,b) mapply(function(a,b)
>    (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,]), a,b)
> 
>  D2 <- outer(seq(nObjects), seq(nObjects), f)
> }
> 
> #
> # test
> #
> D2M2 = D2Mah2(iris[,1:4], iris[,5])
> print(D2M2)
> 
> 
> 
> 
> On 7/10/05, Jose Claudio Faria <joseclaudio.faria at terra.com.br>
wrote:
> > Well, as I did not get a satisfactory reply to the original question I
tried to
> > make a basic function that, I find, solve the question.
> >
> > I think it is not the better function, but it is working.
> >
> > So, perhaps it can be useful to other people.
> >
> > #
> > # Calculate the matrix of Mahalanobis Distances between groups
> > # from data.frames
> > #
> > # by: Jos?? Cl??udio Faria
> > # date: 10/7/05 13:23:48
> > #
> >
> > D2Mah = function(y, x) {
> >
> >   stopifnot(is.data.frame(y), !missing(x))
> >   stopifnot(dim(y)[1] != dim(x)[1])
> >   y    = as.matrix(y)
> >   x    = as.factor(x)
> >   man  = manova(y ~ x)
> >   E    = summary(man)$SS[2] #Matrix E
> >   S    = as.matrix(E$Residuals)/man$df.residual
> >   InvS = solve(S)
> >   mds  = matrix(unlist(by(y, x, mean)), byrow=T, ncol=ncol(y))
> >
> >   colnames(mds) = names(y)
> >   Objects       = levels(x)
> >   rownames(mds) = Objects
> >
> >   library(gtools)
> >   nObjects = nrow(mds)
> >   comb     = combinations(nObjects, 2)
> >
> >   tmpD2 = numeric()
> >   for (i in 1:dim(comb)[1]){
> >     a = comb[i,1]
> >     b = comb[i,2]
> >     tmpD2[i] = (mds[a,] - mds[b,])%*%InvS%*%(mds[a,] - mds[b,])
> >   }
> >
> >   # Thanks Gabor for the below
> >   tmpMah = matrix(0, nObjects, nObjects, dimnames=list(Objects,
Objects))
> >   tmpMah[lower.tri(tmpMah)] = tmpD2
> >   D2 = tmpMah + t(tmpMah)
> >   return(D2)
> > }
> >
> > #
> > # To try
> > #
> > D2M = D2Mah(iris[,1:4], iris[,5])
> > print(D2M)
> >
> > Thanks all for the complementary aid (specially to Gabor).
> >
> > Regards,
> > --
> > Jose Claudio Faria
> > Brasil/Bahia/UESC/DCET
> > Estatistica Experimental/Prof. Adjunto
> > mails:
> >  joseclaudio.faria at terra.com.br
> >  jc_faria at uesc.br
> >  jc_faria at uol.com.br
> > tel: 73-3634.2779
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> >
>

Thomas Petzoldt

2005-Jul-13 17:11 UTC

head link

[R] Help with Mahalanobis

Hello,

a proposed solution of Bill Venables is archieved on the S-News mailing
list:

http://www.biostat.wustl.edu/archives/html/s-news/2001-07/msg00035.html

and if I remember it correctly (and if the variance matrix is estimated
from the data), another similar way is simply to use the Euclidean
distance of rescaled scores of a pricipal component analysis, e.g.:

data(iris)

dat <- iris[1:4] # without the species names

z <- svd(scale(dat, scale=FALSE))$u
cl <- hclust(dist(z), method="ward")
plot(cl, labels=iris$Species)

#### or alternatively: ####

pc <- princomp(dat, cor=FALSE)

pcdata <- as.data.frame(scale(pc$scores))
cl <- hclust(dist(pcdata), method="ward")
plot(cl, labels=iris$Species)


Hope it helps!

Thomas P.

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Jul 2005 - Help with Mahalanobis

[R] Help with Mahalanobis

[R] Help with Mahalanobis

[R] Help with Mahalanobis

[R] Help with Mahalanobis

[R] Help with Mahalanobis

[R] Help with Mahalanobis

Possibly Parallel Threads