thr3ads.net - R help - [R] Similarity matrix [Apr 2001]

If this information is useful, please help other people find it:
Share via:

Frank E Harrell Jr

2001-Apr-10 20:55 UTC

[R] Similarity matrix

I frequently use hclust on a similarity matrix.  In R only a
distance matrix is allowed.  Is there a simple reliable
transformation of a similarity matrix that will result
in a distance matrix making hclust work the same as
S-Plus with a similarity matrix?  Venables & Ripley 3rd
edition implies that a simple reversal of values
will suffice.  Thanks -Frank
-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Prof Brian D Ripley

2001-Apr-11 07:04 UTC

head link

[R] Similarity matrix

On Tue, 10 Apr 2001, Frank E Harrell Jr wrote:
> I frequently use hclust on a similarity matrix.  In R only a
> distance matrix is allowed.  Is there a simple reliable
> transformation of a similarity matrix that will result
> in a distance matrix making hclust work the same as
> S-Plus with a similarity matrix?  Venables & Ripley 3rd
We'd have to know how S-PLUS works!
> edition implies that a simple reversal of values
> will suffice.
Not quite.  We say the scale is reversed, but not that it is linearly
reversed, because I don't know.  Of course it only matters for
average-link clustering (in hclust).

The usual way to do this is to scale similarities to [0, 1] and take
D = sqrt(1-S) I believe, but I don't know why.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  stats.ox.ac.uk/~ripley
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Kaspar Pflugshaupt

2001-Apr-11 08:04 UTC

head link

[R] Similarity matrix

On Tuesday 10 April 2001 22:55, Frank E Harrell Jr wrote:
> I frequently use hclust on a similarity matrix.  In R only a
> distance matrix is allowed.  Is there a simple reliable
> transformation of a similarity matrix that will result
> in a distance matrix making hclust work the same as
> S-Plus with a similarity matrix?  Venables & Ripley 3rd
> edition implies that a simple reversal of values
> will suffice.  Thanks -Frank

Legendre & Legendre (Numerical Ecology, 2nd ed., Elsevier) give a choice of

  D=1-S, D= sqrt(1-S), or D=sqrt(1-S^2)      (p. 252)

and list the respective properties of the first two (Table 7.2, p. 275). 
Basically, the properties of the resulting distance coefficient will depend 
on the kind of similarity coefficient you used (of which the book offers an 
amazing variety). 

Cheers

Kaspar Pflugshaupt



-- 

Kaspar Pflugshaupt
Geobotanical Institute
ETH Zurich, Switzerland
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Frank E Harrell Jr

2001-Apr-11 11:53 UTC

head link

[R] Similarity matrix

Thanks very much to Brian Ripley, Kaspar Pflugshaupt, and Jari Oksanen
for addressing this issue.

The S-Plus online help sheds no light on the issue.  The S-Plus
statistics manual has a lot of information on clustering, but
only focuses on distance measures, as similarity measures
are only allowed in a minority of the clustering functions.

Brian Ripley did the test that I should have done to show
that hclust is using a simple translation from similarity
to distance.

The kinds of similarities I routinely use are
- pairwise squared Spearman rank correlation coefficients
- pairwise proportion of the time that two variables are
  missing on the same observation
- Hoeffding D nonparametric dependence index 
  (the scaling of which may be more problematic than the other two)

Thank you all,

Frank Harrell

Prof Brian Ripley wrote:> 
> On Tue, 10 Apr 2001, Frank E Harrell Jr wrote:
> 
> > I frequently use hclust on a similarity matrix.  In R only a
> > distance matrix is allowed.  Is there a simple reliable
> > transformation of a similarity matrix that will result
> > in a distance matrix making hclust work the same as
> > S-Plus with a similarity matrix?  Venables & Ripley 3rd
> > edition implies that a simple reversal of values
> > will suffice.  Thanks -Frank
> 
> Testing with Splus 6.0 shows that dist = 1 - sim is used there, so the
> simple assumption is correct.
> 
> d <- dist(longley.y)
> d <- d/max(d)
> hclust(d, "ave")
> $merge:
>       [,1] [,2]
>  [1,]   -2   -4
>  [2,]   -6   -8
>  [3,]   -1   -3
>  [4,]  -14  -15
>  [5,]  -10  -11
>  [6,]   -5    2
>  [7,]   -9  -12
>  [8,]  -13    5
>  [9,]    1    3
> [10,]  -16    4
> [11,]   -7    7
> [12,]    8   10
> [13,]    6   11
> [14,]    9   13
> [15,]   12   14
> 
> $height:
>  [1] 0.006262043 0.011753372 0.014643545 0.022447014 0.030057803
0.046146438
>  [7] 0.047591522 0.061849713 0.087427750 0.106310219 0.123025045
0.153018638
> [13] 0.221579969 0.384352922 0.570969820
> 
> $order:
>  [1] 13 10 11 16 14 15  2  4  1  3  5  6  8  7  9 12
> 
> hclust(sim=1-d, method="ave")
> $merge:
>       [,1] [,2]
>  [1,]   -2   -4
>  [2,]   -6   -8
>  [3,]   -1   -3
>  [4,]  -14  -15
>  [5,]  -10  -11
>  [6,]   -5    2
>  [7,]   -9  -12
>  [8,]  -13    5
>  [9,]    3    1
> [10,]  -16    4
> [11,]   -7    7
> [12,]   10    8
> [13,]   11    6
> [14,]   13    9
> [15,]   14   12
> 
> $height:
>  [1] 0.9937379 0.9882466 0.9853565 0.9775530 0.9699422 0.9538536 0.9524085
>  [8] 0.9381503 0.9125723 0.8936898 0.8769749 0.8469813 0.7784200 0.6156471
> [15] 0.4290302
> 
> $order:
>  [1]  7  9 12  5  6  8  1  3  2  4 16 14 15 13 10 11
> 
> which is the same but expressed in similarities.
> 
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  stats.ox.ac.uk/~ripley
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272860 (secr)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
-- 
Frank E Harrell Jr              Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine  hesweb1.med.virginia.edu/biostat
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Prof Brian Ripley

2001-Apr-11 12:11 UTC

head link

[R] Similarity matrix

On Tue, 10 Apr 2001, Frank E Harrell Jr wrote:
> I frequently use hclust on a similarity matrix.  In R only a
> distance matrix is allowed.  Is there a simple reliable
> transformation of a similarity matrix that will result
> in a distance matrix making hclust work the same as
> S-Plus with a similarity matrix?  Venables & Ripley 3rd
> edition implies that a simple reversal of values
> will suffice.  Thanks -Frank
Testing with Splus 6.0 shows that dist = 1 - sim is used there, so the
simple assumption is correct.

d <- dist(longley.y)
d <- d/max(d)
hclust(d, "ave")
$merge:
      [,1] [,2]
 [1,]   -2   -4
 [2,]   -6   -8
 [3,]   -1   -3
 [4,]  -14  -15
 [5,]  -10  -11
 [6,]   -5    2
 [7,]   -9  -12
 [8,]  -13    5
 [9,]    1    3
[10,]  -16    4
[11,]   -7    7
[12,]    8   10
[13,]    6   11
[14,]    9   13
[15,]   12   14

$height:
 [1] 0.006262043 0.011753372 0.014643545 0.022447014 0.030057803 0.046146438
 [7] 0.047591522 0.061849713 0.087427750 0.106310219 0.123025045 0.153018638
[13] 0.221579969 0.384352922 0.570969820

$order:
 [1] 13 10 11 16 14 15  2  4  1  3  5  6  8  7  9 12

hclust(sim=1-d, method="ave")
$merge:
      [,1] [,2]
 [1,]   -2   -4
 [2,]   -6   -8
 [3,]   -1   -3
 [4,]  -14  -15
 [5,]  -10  -11
 [6,]   -5    2
 [7,]   -9  -12
 [8,]  -13    5
 [9,]    3    1
[10,]  -16    4
[11,]   -7    7
[12,]   10    8
[13,]   11    6
[14,]   13    9
[15,]   14   12

$height:
 [1] 0.9937379 0.9882466 0.9853565 0.9775530 0.9699422 0.9538536 0.9524085
 [8] 0.9381503 0.9125723 0.8936898 0.8769749 0.8469813 0.7784200 0.6156471
[15] 0.4290302

$order:
 [1]  7  9 12  5  6  8  1  3  2  4 16 14 15 13 10 11

which is the same but expressed in similarities.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  stats.ox.ac.uk/~ripley
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Kaspar Pflugshaupt

2001-Apr-11 12:25 UTC

head link

[R] Similarity matrix

On Wednesday 11 April 2001 10:23, Prof Brian Ripley wrote:

> And what does S-PLUS use? (Which is the point here?)

I've never done cluster analysis with S-Plus. But let's see:

The statistical manual for S-Plus 5.1/Unix fails to even mention similarity 
matrices.

help(hclust) (in S-Plus 5.1/Unix and 3.4/Unix) says 

  USAGE:                                                            

  hclust(dist, method = "compact", sim =)

  [...]         

   sim=                                                  
          structure giving similarities rather than distances. This can
          either be a symmetric matrix or a vector with a "Size"
          attribute. Missing values are not allowed.

The help text does not explain how the conversion to distances is done, 
though. And the source is not available...
> I guess we have to experiment?

Well, I've taken the time to do it for you (S-PLus 3.4/Unix):

  mat <- matrix(runif(100), nrow=10)
  print(1 - plclust(hclust( sim=mat ))$yn)  # 1 - ...: S-Plus seems to mirror 
					    # the tree's y scale when given a similarity matrix

gives the same values as

  print(plclust(hclust( 1-mat ))$yn)

but different values from

  print(plclust(hclust( sqrt(1-mat) )$yn)

The grouping structure is constant, anyway.

So, S-Plus seems to use D=1-S rather than D=sqrt(1-S) internally.

For R, it might be a good idea to let the user choose the conversion method 
via an additional parameter, making D=1-S the default.

According to Legendre & Legendre, the choice of similarity coefficient 
_does_ make a difference as to which conversion should be preferred. For some 
"species" of similarity coefficients, the resulting distance would be
metric
and euclidean with one method but not with the other, for others vice versa. 
I don't know if this matters for cluster analysis, but I think that it
might,
especially when clustering with an euclidean metric.

Cheers (hoping this was to the point :-)

Kaspar Pflugshaupt

-- 

Kaspar Pflugshaupt
Geobotanical Institute
ETH Zurich, Switzerland
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Maybe Matching Threads

Search for more reasonably related threads

R help - Apr 2001 - Similarity matrix

[R] Similarity matrix

[R] Similarity matrix

[R] Similarity matrix

[R] Similarity matrix

[R] Similarity matrix

[R] Similarity matrix

Maybe Matching Threads