Jeoffrey Gaspard
2010-Apr-26 12:37 UTC
[R] Cluster analysis: dissimilar results between R and SPSS
Hello everyone! My data is composed of 277 individuals measured on 8 binary variables (1=yes, 2=no). I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The objective is to have the means for each variable per retained cluster. 1) the R analysis ran as followed:> call data > dist=dist(data,method="euclidean") > cluster=hclust(dist,method="ward") > clusterCall: hclust(d = dist, method = "ward") Cluster method : ward Distance : euclidean Number of objects: 277> plot(cluster) > rect.hclust(cluster, k=4, border="red") > x=rect.hclust(cluster, k=4, border="red") > sapply(x, function(i) colMeans(data[i,])) > round(sapply(x, function(i) colMeans(data[i,])),2)2) The SPSS analysis ran as follows: Analysis --> Classify --> Hierarchical cluster analysis --> Cluster methodWard''s method and Distance measure= Interval: Squared Euclidean distance. After that, I computed the means of each variable for each cluster. The problem is I have different results between the two analyses (different clusters and means). However, when I use the "Euclidean distance" (unsquared) in SPSS, I have the same results! I thought the R "euclidean" command meant the "usual square distance between the two vectors (2 norm)" as specified in the documentation, no the unsquared distance. Did it not? Thanks for the comment! Jeffrey [[alternative HTML version deleted]]
Tal Galili
2010-Apr-26 12:41 UTC
[R] Cluster analysis: dissimilar results between R and SPSS
Hi Jeoffrey, How stable are the results in general ? If you repeat the analysis in R several times, does it yield the same results ? Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Mon, Apr 26, 2010 at 3:37 PM, Jeoffrey Gaspard < jeoffrey.gaspard@gmail.com> wrote:> Hello everyone! > > My data is composed of 277 individuals measured on 8 binary variables > (1=yes, 2=no). > > I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. > The > objective is to have the means for each variable per retained cluster. > > 1) the R analysis ran as followed: > > > call data > > dist=dist(data,method="euclidean") > > cluster=hclust(dist,method="ward") > > cluster > > Call: > hclust(d = dist, method = "ward") > > Cluster method : ward > Distance : euclidean > Number of objects: 277 > > > plot(cluster) > > rect.hclust(cluster, k=4, border="red") > > x=rect.hclust(cluster, k=4, border="red") > > sapply(x, function(i) colMeans(data[i,])) > > round(sapply(x, function(i) colMeans(data[i,])),2) > > 2) The SPSS analysis ran as follows: > > Analysis --> Classify --> Hierarchical cluster analysis --> Cluster method> Ward's method and Distance measure= Interval: Squared Euclidean distance. > After that, I computed the means of each variable for each cluster. > > The problem is I have different results between the two analyses (different > clusters and means). > > However, when I use the "Euclidean distance" (unsquared) in SPSS, I have > the > same results! > > I thought the R "euclidean" command meant the "usual square distance > between > the two vectors (2 norm)" as specified in the documentation, no the > unsquared distance. Did it not? > > Thanks for the comment! > > Jeffrey > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Sarah Goslee
2010-Apr-26 13:00 UTC
[R] Cluster analysis: dissimilar results between R and SPSS
I'm not sure why you'd expect Euclidean distance and squared Euclidean distance to give the same results. Euclidean distance is the square root of the sums of squared differences for each variable, and that's exactly what dist() returns. http://en.wikipedia.org/wiki/Euclidean_distance On a map, it's the length of the hypoteneuse, and you can measure it with a ruler and get the same number. Euclidean distance has a specific geometric meaning. Squared Euclidean distance is not the same thing, and not the standard definition you seem to be expecting. If that's what you want, then square the output of dist() before you perform the clustering. Sarah On Mon, Apr 26, 2010 at 8:37 AM, Jeoffrey Gaspard <jeoffrey.gaspard at gmail.com> wrote:> Hello everyone! > > My data is composed of 277 individuals measured on 8 binary variables > (1=yes, 2=no). > > I did two similar cluster analyses, one on SPSS 18.0 and one on R 2.9.2. The > objective is to have the means for each variable per retained cluster. > > 1) the R analysis ran as followed: > >> call data >> dist=dist(data,method="euclidean") >> cluster=hclust(dist,method="ward") >> cluster > > Call: > hclust(d = dist, method = "ward") > > Cluster method ? : ward > Distance ? ? ? ? : euclidean > Number of objects: 277 > >> plot(cluster) >> rect.hclust(cluster, k=4, border="red") >> x=rect.hclust(cluster, k=4, border="red") >> sapply(x, function(i) colMeans(data[i,])) >> round(sapply(x, function(i) colMeans(data[i,])),2) > > 2) The SPSS analysis ran as follows: > > Analysis --> Classify --> Hierarchical cluster analysis --> Cluster method> Ward's method and Distance measure= Interval: ?Squared Euclidean distance. > After that, I computed the means of each variable for each cluster. > > The problem is I have different results between the two analyses (different > clusters and means). > > However, when I use the "Euclidean distance" (unsquared) in SPSS, I have the > same results! > > I thought the R "euclidean" command meant the "usual square distance between > the two vectors (2 norm)" as specified in the documentation, no the > unsquared distance. Did it not? > > Thanks for the comment! > > Jeffrey > >-- Sarah Goslee http://www.functionaldiversity.org