I frequently use hclust on a similarity matrix. In R only a distance matrix is allowed. Is there a simple reliable transformation of a similarity matrix that will result in a distance matrix making hclust work the same as S-Plus with a similarity matrix? Venables & Ripley 3rd edition implies that a simple reversal of values will suffice. Thanks -Frank -- Frank E Harrell Jr Prof. of Biostatistics & Statistics Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Tue, 10 Apr 2001, Frank E Harrell Jr wrote:> I frequently use hclust on a similarity matrix. In R only a > distance matrix is allowed. Is there a simple reliable > transformation of a similarity matrix that will result > in a distance matrix making hclust work the same as > S-Plus with a similarity matrix? Venables & Ripley 3rdWe'd have to know how S-PLUS works!> edition implies that a simple reversal of values > will suffice.Not quite. We say the scale is reversed, but not that it is linearly reversed, because I don't know. Of course it only matters for average-link clustering (in hclust). The usual way to do this is to scale similarities to [0, 1] and take D = sqrt(1-S) I believe, but I don't know why. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Tuesday 10 April 2001 22:55, Frank E Harrell Jr wrote:> I frequently use hclust on a similarity matrix. In R only a > distance matrix is allowed. Is there a simple reliable > transformation of a similarity matrix that will result > in a distance matrix making hclust work the same as > S-Plus with a similarity matrix? Venables & Ripley 3rd > edition implies that a simple reversal of values > will suffice. Thanks -FrankLegendre & Legendre (Numerical Ecology, 2nd ed., Elsevier) give a choice of D=1-S, D= sqrt(1-S), or D=sqrt(1-S^2) (p. 252) and list the respective properties of the first two (Table 7.2, p. 275). Basically, the properties of the resulting distance coefficient will depend on the kind of similarity coefficient you used (of which the book offers an amazing variety). Cheers Kaspar Pflugshaupt -- Kaspar Pflugshaupt Geobotanical Institute ETH Zurich, Switzerland -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thanks very much to Brian Ripley, Kaspar Pflugshaupt, and Jari Oksanen for addressing this issue. The S-Plus online help sheds no light on the issue. The S-Plus statistics manual has a lot of information on clustering, but only focuses on distance measures, as similarity measures are only allowed in a minority of the clustering functions. Brian Ripley did the test that I should have done to show that hclust is using a simple translation from similarity to distance. The kinds of similarities I routinely use are - pairwise squared Spearman rank correlation coefficients - pairwise proportion of the time that two variables are missing on the same observation - Hoeffding D nonparametric dependence index (the scaling of which may be more problematic than the other two) Thank you all, Frank Harrell Prof Brian Ripley wrote:> > On Tue, 10 Apr 2001, Frank E Harrell Jr wrote: > > > I frequently use hclust on a similarity matrix. In R only a > > distance matrix is allowed. Is there a simple reliable > > transformation of a similarity matrix that will result > > in a distance matrix making hclust work the same as > > S-Plus with a similarity matrix? Venables & Ripley 3rd > > edition implies that a simple reversal of values > > will suffice. Thanks -Frank > > Testing with Splus 6.0 shows that dist = 1 - sim is used there, so the > simple assumption is correct. > > d <- dist(longley.y) > d <- d/max(d) > hclust(d, "ave") > $merge: > [,1] [,2] > [1,] -2 -4 > [2,] -6 -8 > [3,] -1 -3 > [4,] -14 -15 > [5,] -10 -11 > [6,] -5 2 > [7,] -9 -12 > [8,] -13 5 > [9,] 1 3 > [10,] -16 4 > [11,] -7 7 > [12,] 8 10 > [13,] 6 11 > [14,] 9 13 > [15,] 12 14 > > $height: > [1] 0.006262043 0.011753372 0.014643545 0.022447014 0.030057803 0.046146438 > [7] 0.047591522 0.061849713 0.087427750 0.106310219 0.123025045 0.153018638 > [13] 0.221579969 0.384352922 0.570969820 > > $order: > [1] 13 10 11 16 14 15 2 4 1 3 5 6 8 7 9 12 > > hclust(sim=1-d, method="ave") > $merge: > [,1] [,2] > [1,] -2 -4 > [2,] -6 -8 > [3,] -1 -3 > [4,] -14 -15 > [5,] -10 -11 > [6,] -5 2 > [7,] -9 -12 > [8,] -13 5 > [9,] 3 1 > [10,] -16 4 > [11,] -7 7 > [12,] 10 8 > [13,] 11 6 > [14,] 13 9 > [15,] 14 12 > > $height: > [1] 0.9937379 0.9882466 0.9853565 0.9775530 0.9699422 0.9538536 0.9524085 > [8] 0.9381503 0.9125723 0.8936898 0.8769749 0.8469813 0.7784200 0.6156471 > [15] 0.4290302 > > $order: > [1] 7 9 12 5 6 8 1 3 2 4 16 14 15 13 10 11 > > which is the same but expressed in similarities. > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272860 (secr) > Oxford OX1 3TG, UK Fax: +44 1865 272595-- Frank E Harrell Jr Prof. of Biostatistics & Statistics Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Tue, 10 Apr 2001, Frank E Harrell Jr wrote:> I frequently use hclust on a similarity matrix. In R only a > distance matrix is allowed. Is there a simple reliable > transformation of a similarity matrix that will result > in a distance matrix making hclust work the same as > S-Plus with a similarity matrix? Venables & Ripley 3rd > edition implies that a simple reversal of values > will suffice. Thanks -FrankTesting with Splus 6.0 shows that dist = 1 - sim is used there, so the simple assumption is correct. d <- dist(longley.y) d <- d/max(d) hclust(d, "ave") $merge: [,1] [,2] [1,] -2 -4 [2,] -6 -8 [3,] -1 -3 [4,] -14 -15 [5,] -10 -11 [6,] -5 2 [7,] -9 -12 [8,] -13 5 [9,] 1 3 [10,] -16 4 [11,] -7 7 [12,] 8 10 [13,] 6 11 [14,] 9 13 [15,] 12 14 $height: [1] 0.006262043 0.011753372 0.014643545 0.022447014 0.030057803 0.046146438 [7] 0.047591522 0.061849713 0.087427750 0.106310219 0.123025045 0.153018638 [13] 0.221579969 0.384352922 0.570969820 $order: [1] 13 10 11 16 14 15 2 4 1 3 5 6 8 7 9 12 hclust(sim=1-d, method="ave") $merge: [,1] [,2] [1,] -2 -4 [2,] -6 -8 [3,] -1 -3 [4,] -14 -15 [5,] -10 -11 [6,] -5 2 [7,] -9 -12 [8,] -13 5 [9,] 3 1 [10,] -16 4 [11,] -7 7 [12,] 10 8 [13,] 11 6 [14,] 13 9 [15,] 14 12 $height: [1] 0.9937379 0.9882466 0.9853565 0.9775530 0.9699422 0.9538536 0.9524085 [8] 0.9381503 0.9125723 0.8936898 0.8769749 0.8469813 0.7784200 0.6156471 [15] 0.4290302 $order: [1] 7 9 12 5 6 8 1 3 2 4 16 14 15 13 10 11 which is the same but expressed in similarities. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wednesday 11 April 2001 10:23, Prof Brian Ripley wrote:> And what does S-PLUS use? (Which is the point here?)I've never done cluster analysis with S-Plus. But let's see: The statistical manual for S-Plus 5.1/Unix fails to even mention similarity matrices. help(hclust) (in S-Plus 5.1/Unix and 3.4/Unix) says USAGE: hclust(dist, method = "compact", sim =) [...] sim= structure giving similarities rather than distances. This can either be a symmetric matrix or a vector with a "Size" attribute. Missing values are not allowed. The help text does not explain how the conversion to distances is done, though. And the source is not available...> I guess we have to experiment?Well, I've taken the time to do it for you (S-PLus 3.4/Unix): mat <- matrix(runif(100), nrow=10) print(1 - plclust(hclust( sim=mat ))$yn) # 1 - ...: S-Plus seems to mirror # the tree's y scale when given a similarity matrix gives the same values as print(plclust(hclust( 1-mat ))$yn) but different values from print(plclust(hclust( sqrt(1-mat) )$yn) The grouping structure is constant, anyway. So, S-Plus seems to use D=1-S rather than D=sqrt(1-S) internally. For R, it might be a good idea to let the user choose the conversion method via an additional parameter, making D=1-S the default. According to Legendre & Legendre, the choice of similarity coefficient _does_ make a difference as to which conversion should be preferred. For some "species" of similarity coefficients, the resulting distance would be metric and euclidean with one method but not with the other, for others vice versa. I don't know if this matters for cluster analysis, but I think that it might, especially when clustering with an euclidean metric. Cheers (hoping this was to the point :-) Kaspar Pflugshaupt -- Kaspar Pflugshaupt Geobotanical Institute ETH Zurich, Switzerland -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._