Hello, I have a data set on which I run the sammon algorithm as follows: library(MASS) data = read.table('problemforr.dat') y = cmdscale(data, add=TRUE) s = sammon(data, y$points) (In case it should be relevant, I make the data available at http://idi.ntnu.no/~edsberg/problemforr.dat) With R 2.2.1 on Debian Sid I always get one of two solutions (stress 1.74288 after 10 iterations or stress 1.33629 afer 9 iterations). I always get the same result within the same R session, even if I read the data again. With R 2.2.0 on SunOS 5.9 I always get the same result (stress 0.13186 after 74 iterations). I understand that the sammon algorithm is very sensitive to even tiny variations in the starting point, but the observed behaviour seems strange to me. Difference between machines could perhaps be explained by floating point portability issues, but not difference on the same machine, and not the fact that i get the same result within the same R session. I read in the documentation (http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/sammon.html) that "Further, since the configuration is only determined up to rotations and reflections (by convention the centroid is at the origin), the result can vary considerably from machine to machine." This doesn't make sense to me. If the data and the algorithm is the same, the result should be the same. What differences between machines do they refer to here? Floating point issues? I must admit that I am a beginner, both in R and in statistics. I'm very curious about the cause of this strangeness. Does anybody have an explanation? Best Regards, Ole Edsberg
Prof Brian Ripley
2006-Jan-30 13:06 UTC
[R] Varying results of sammon(), for the same data set
On Mon, 30 Jan 2006, Ole Edsberg wrote:> Hello, > > I have a data set on which I run the sammon algorithm as follows: > > library(MASS) > data = read.table('problemforr.dat')Hmm. This is a data frame of 387 rows and 387 columns and Euclidean distance is used. Squeezing 387 dims (and PCA shows these points as well spread in almost all those dimensions) to 2 is not a well-posed problem, and you should welcome the plurality of answers found.> y = cmdscale(data, add=TRUE) > s = sammon(data, y$points) > > (In case it should be relevant, I make the data available at > http://idi.ntnu.no/~edsberg/problemforr.dat) > > With R 2.2.1 on Debian Sid I always get one of two solutions (stress > 1.74288 after 10 iterations or stress 1.33629 afer 9 iterations). I > always get the same result within the same R session, even if I read > the data again. With R 2.2.0 on SunOS 5.9 I always get the same result > (stress 0.13186 after 74 iterations).Note that your subject line attributes this to sammon, but it could also be due to cmdscale. On AMD64 Linux I get> s = sammon(data, y$points)Initial stress : 2.21024 stress after 10 iters: 1.22268, magic = 0.092 stress after 20 iters: 0.48801, magic = 0.009 stress after 30 iters: 0.35007, magic = 0.020 stress after 40 iters: 0.24377, magic = 0.045 stress after 50 iters: 0.17343, magic = 0.021 stress after 60 iters: 0.14944, magic = 0.048 stress after 70 iters: 0.12810, magic = 0.022 stress after 80 iters: 0.12423, magic = 0.010 stress after 90 iters: 0.12191, magic = 0.118 stress after 100 iters: 0.11986, magic = 0.500 That large reduction in `magic' indicates the algorithm is having problems. Without optimization (used for valgrind) I got the solution you quoted for Solaris 9. However, on all four systems (AMD64 FC3 Linux, i686 FC3 Linux, Solaris and Windows) I tried the results were different between systems and repeatable by system. I even ran under valgrind to be sure that no uninitialized areas were used (on FC3).> I understand that the sammon algorithm is very sensitive to even tiny > variations in the starting point, but the observed behaviour seems > strange to me. Difference between machines could perhaps be explained > by floating point portability issues, but not difference on the same > machine, and not the fact that i get the same result within the same R > session.No, but then that is not reproducible, and has never been reported before. If for example different BLAS libraries get selected on different runs this would explain it. Or it could be a Debian-Sid-specific bug in a shared library or compiler.> I read in the documentation > (http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/sammon.html) > that "Further, since the configuration is only determined up to > rotations and reflections (by convention the centroid is at the origin), > the result can vary considerably from machine to machine." This doesn't > make sense to me.Note that is addressing a separate issue. For a given minimized stress there are multiple solutions which can be transformed into each other, and the help file is warning you of that. There are also (in general) multiple local minima.> If the data and the algorithm is the same, the result should be the > same.Depending what you mean by 'algorithm', this is what the subject of numerical analysis is about. I take it you are familiar with J. H. Wilkinson's classic work on the Algebraic Eigenvalue Problem?> What differences between machines do they refer to here? Floating > point issues?Any difference in the CPU/FPU or compiler or run-time environment (including all the dynamically linked support libraries). Just changing the optimization level of the compiler changes the assembler-level algorithm used, and can often affect the answer of e.g. an eigenvalue calculation. Rounding errors depend on whether (and when) extended-precision registers are used and the exact order of the calculations since computer arithmetic is not distributive. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595