Hi, I'm clustering objects defined by categorical variables with a hierarchical algorithm - average linkage. My distance matrix (general dissimilarity coefficient) includes several distances with exactly the same values. As I see, a standard agglomerative procedure ignores this problems, simply selecting, above equal distances, the one that comes first. For this reason the analysis in output depends strongly on the orderings of the objects within the raw data matrix. Is there a standard procedure to deal with this? Thanks Bruno
Bruno - Many people add a tiny random number to each of the distances, or deliberately randomize the input order. This means that any clustering is not reproducible, unless you go back to the original randoms, but it forces you not to pay attention to minor differences. Ah, I think you're asking about bootstrap confidence intervals for the set of descendants from each interior vertex. This is certainly routine procedure when inferring evolutionary trees, but I'm not sure any of that code has been re-implemented in R or Splus. - tom blackwell - u michigan medical school - ann arbor - On Wed, 3 Dec 2003, Bruno Giordano wrote:> Hi, > I'm clustering objects defined by categorical variables with a hierarchical > algorithm - average linkage. > My distance matrix (general dissimilarity coefficient) includes several > distances with exactly the same values. > As I see, a standard agglomerative procedure ignores this problems, simply > selecting, above equal distances, the one that comes first. > For this reason the analysis in output depends strongly on the orderings of > the objects within the raw data matrix. > Is there a standard procedure to deal with this? > Thanks > Bruno
On Wed, 3 Dec 2003, Bruno Giordano wrote:> Hi, > I'm clustering objects defined by categorical variables with a hierarchical > algorithm - average linkage. > My distance matrix (general dissimilarity coefficient) includes several > distances with exactly the same values. > As I see, a standard agglomerative procedure ignores this problems, simply > selecting, above equal distances, the one that comes first. > For this reason the analysis in output depends strongly on the orderings of > the objects within the raw data matrix. > Is there a standard procedure to deal with this?Don't use average linkage! -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Hi, Brian Ripley already replied "don't use average linkage"... You may think about k-medoid (pam) in package cluster instead. However, often average linkage is not such a bad choice, and if you really want to use it for your data, you may try the following: Among the hierarchical methods, single linkage has the smallest problem with equal distances, because possible agglomerations based on equal distances between clusters are all carried out regardless of the order. If at some step the smallest between cluster-distance is d(a,b)=d(a,c)<d(b,c), it may happen that a and b are merged first, or a and c are merged first, but before merging anything else with distance larger than d(a,b), a, b *and* c are merged. Thus, you have order dependence only between the steps where you merge clusters with the same distance, but not afterwards. If your problem occurs only at a low level of agglomeration (and you don't have situations where d(a,b) and d(a,c) are small and d(b,c) is very large; I do not know if the triangle inequality holds for your data), you may do some first steps with Single Linkage and then continue with average linkage (I haven't thought about if this can be done in R without extra effort). But if you have already observed that the averarge linkage outcome depends critically (from the viewpoint of interpretation) on the order of points, then it seems that you are in an unstable situation, if you are able to define a unique clustering or not. Christian On Wed, 3 Dec 2003, Bruno Giordano wrote:> Hi, > I'm clustering objects defined by categorical variables with a hierarchical > algorithm - average linkage. > My distance matrix (general dissimilarity coefficient) includes several > distances with exactly the same values. > As I see, a standard agglomerative procedure ignores this problems, simply > selecting, above equal distances, the one that comes first. > For this reason the analysis in output depends strongly on the orderings of > the objects within the raw data matrix. > Is there a standard procedure to deal with this? > Thanks > Bruno > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >*********************************************************************** Christian Hennig Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag-online.de