Bryan Hanson
2010-Apr-19 23:31 UTC
[R] What is mclust up to? Different clusters found if x and y interchanged
Hello All... I gave a task to my students that involved using mclust to look for clusters in some bivariate data of isotopes vs various mining locations. They discovered something I didn?t expect; the data (called tur) is appended below. p <- qplot(x = dD, y = dCu65, data = tur, color = mine) print(p) # simple bivariate plot of the data; looks fine mod1 <- Mclust(tur[,2:3]) mod1$G mod2 <- Mclust(tur[,3:2]) mod2$G One can use coordProj to see the clusters found, but the basic result is that mclust found 3 clusters in mod1, but if you interchange the x and y columns it finds 7 clusters (mod2). Since this is bivariate data, I find this result pretty strange. The only thing I can think of is that it has something to do with how the algorithm is seeded, but that isn't too convincing. Since I couldn't believe my eyes, I made up some bivariate data with known clusters (not given here) and interchanged the x and y columns and got the same clusters. I'm at a loss. Hopefully someone can set my understanding on the right path. Thanks for any insight! Bryan> sessionInfo()R version 2.10.1 (2009-12-14) x86_64-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] datasets tools grid graphics grDevices utils stats methods base other attached packages: [1] faraway_1.0.4 GGally_0.1 xtable_1.5-6 mvbutils_2.5.1 ggplot2_0.8.7 [6] digest_0.4.2 reshape_0.8.3 proto_0.3-8 ChemoSpec_1.43 R.utils_1.4.0 [11] R.oo_1.7.1 R.methodsS3_1.2.0 rgl_0.91 lattice_0.18-3 mvoutlier_1.4 [16] plyr_0.1.9 RColorBrewer_1.0-2 chemometrics_0.8 som_0.3-5 robustbase_0.5-0-1 [21] rpart_3.1-46 pls_2.1-0 pcaPP_1.8 mvtnorm_0.9-9 nnet_7.3-1 [26] mclust_3.4.4 MASS_7.3-5 lars_0.9-7 e1071_1.5-23 class_7.3-2 And the data:> dput(tur)structure(list(mine = structure(c(13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 9L, 9L, 9L, 9L, 12L, 12L, 12L, 12L, 12L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 4L, 4L, 4L, 4L, 4L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Cas", "CH", "CR", "FM", "GU", "KG", "KM", "LV", "MC", "MZ", "N8", "OR", "SB", "TF"), class = "factor"), dCu65 = c(6.1, 6.5, 5.7, 6.4, 6.1, 5.9, 6.5, 5.7, 6, 6.4, 6.8, 6.9, 6.7, 6.7, 5.8, 5.3, 6.7, 5.7, 6.7, 5.6, 6.4, 6.4, 6.5, 7.6, 6.7, 6.4, 6.4, 7, 7, 5.9, 5.3, 6.1, 11.8, 12.1, 12.4, 1, 1.8, 1.4, 1.2, 1.2, 1.2, 1.1, 1.2, 1.8, 1, 0.8, 0.7, 1.2, 1.5, 1, 0.2, 1.4, 0.3, 1.1, 0, -0.5, -1.4, -0.2, 1.1, 0.9, 0.1, 1.4, -1.2, -0.1, -0.4, -0.8, 3, 3.3, 3.9, 3.9, 3.7, 9.5, 9.6, 9.6, 8.3, 10.3, 8.4, 9.4, 0.8, 0.9, 1.3, 0.2, 2.2, 0.7, 1.8, 4.5, 4.5, 4.6, 4.5, 4.6, 3.9, 4.2, 4.9, 4.9, 4.5, 16.7, 17.4, 17.5, 17.5, 17.4, 1.9, 2.4, 1.7, 1.2, 1.5, 1.1, 1.8, 3.1, 3.1, 3, 4.1, 4.2, 2.7, 4.5, 0.4, 1.2, 3, 3.5, 1.2, 0.8, 0.9, 2, -0.8, 5.8, 0.6, 1, 0.9, 4.4, 2.3, -0.1), dD = c(-66L, -81L, -81L, -71L, -68L, -79L, -73L, -77L, -74L, -71L, -73L, -80L, -76L, -80L, -74L, -77L, -74L, -77L, -74L, -82L, -79L, -72L, -74L, -71L, -76L, -72L, -76L, -73L, -75L, -78L, -81L, -77L, -77L, -73L, -79L, -104L, -98L, -98L, -109L, -106L, -99L, -107L, -100L, -98L, -95L, -95L, -95L, -97L, -98L, -121L, -119L, -118L, -121L, -121L, -106L, -116L, -122L, -112L, -125L, -109L, -114L, -118L, -102L, -109L, -101L, -101L, -74L, -78L, -63L, -67L, -75L, -65L, -60L, -62L, -67L, -59L, -62L, -68L, -61L, -58L, -69L, -57L, -59L, -67L, -73L, -81L, -90L, -87L, -82L, -80L, -80L, -81L, -83L, -90L, -89L, -107L, -122L, -128L, -113L, -124L, -93L, -89L, -92L, -85L, -79L, -84L, -83L, -148L, -142L, -127L, -130L, -123L, -131L, -85L, -87L, -90L, -81L, -65L, -81L, -119L, -121L, -94L, -90L, -117L, -95L, -69L, -122L, -103L, -85L, -106L)), .Names = c("mine", "dCu65", "dD"), class = "data.frame", row.names = c(NA, -130L))
Bryan Hanson
2010-Apr-20 17:36 UTC
[R] What is mclust up to? Different clusters found if x and y interchanged
Thanks Prof. Fraley. This is very helpful. I had looked at the way mclust gets its start and that looked like a possible source of the differences, but had not had time to get up to speed and figure it out. I was thinking it was more of a set.seed sort of thing. Now I know more. Thanks so much. Bryan On 4/20/10 11:41 AM, "Chris Fraley" <fraley at u.washington.edu> wrote:> > > This is a really interesting case, but it is probably not that unusual. > > mod23 <- Mclust(tur[,2:3]) > mod32 <- Mclust(tur[,3:2]) > > mod23[c("modelName","G")] > mod32[c("modelName","G")] > > The 2:3 case is choosing the VVV model (more parameters per cluster) with 3 > clusters. > The 3:2 case is choosing the EEV model (more parsimonious), but with 7 > clusters. > > If you look at the BIC curves, you can see that mclust is indeed choosing the > model > corresponding to the maximum BIC in each case. > > plot( mclustBIC( tur[,2:3], modelNames = c("EEV", "VVV")) > plot( mclustBIC( tur[,3:2], modelNames = c("EEV", "VVV")) > > The default initialization for Mclust is via hierarchical clustering, and > that's > why the results differ > > hc23 <- hcVVV( tur[,2:3]) # default initialization for mod23 > hc32 <- hcVVV( tur[,3:2]) # default initialization for mod32 > > all.equal(hc32,hc23) # they differ! > > The hierarchical clustering merges two groups at each stage. > If there is more than one choice at any particular stage for > which the merge criteria are equal or within roundoff error, > a different pair could be chosen for merging if columns (or rows) > are permuted. This will affect later merges, and could ultimately > affect the mclust results. > > If you initialize both col permutations in the same way, you'll get > the same results. > > mod23mod <- Mclust(tur[,2:3], initialization = list( hcPairs = hc32)) # uses > mod32 initialization > mod32mod <- Mclust(tur[,3:2], initialization = list( hcPairs = hc23)) # uses > mod23 initialization > > mod23mod[c("modelName","G")] # same as mod32 > mod32mod[c("modelName","G")] # same as mod23 > > Hope this helps, > > Chris Fraley > > > On Tue, 20 Apr 2010, Bryan Hanson wrote: > >> Prof. Fraley, I wonder if you would have a couple of moments to respond to >> this question I put to the R help list. Thanks in advance for your time. >> >> Bryan >> ************* >> Bryan Hanson >> Acting Chair >> Professor of Chemistry & Biochemistry >> DePauw University, Greencastle IN USA >> >> >> ------ Forwarded Message >> From: Bryan Hanson <hanson at depauw.edu> >> Date: Mon, 19 Apr 2010 19:31:03 -0400 >> To: <hanson at gapps.depauw.edu>, R Help <r-help at stat.math.ethz.ch> >> Subject: [R] What is mclust up to? Different clusters found if x and y >> interchanged >> >> Hello All... >> >> I gave a task to my students that involved using mclust to look for clusters >> in some bivariate data of isotopes vs various mining locations. They >> discovered something I didn?t expect; the data (called tur) is appended >> below. >> >> p <- qplot(x = dD, y = dCu65, data = tur, color = mine) >> print(p) # simple bivariate plot of the data; looks fine >> >> mod1 <- Mclust(tur[,2:3]) >> mod1$G >> mod2 <- Mclust(tur[,3:2]) >> mod2$G >> >> One can use coordProj to see the clusters found, but the basic result is >> that mclust found 3 clusters in mod1, but if you interchange the x and y >> columns it finds 7 clusters (mod2). Since this is bivariate data, I find >> this result pretty strange. The only thing I can think of is that it has >> something to do with how the algorithm is seeded, but that isn't too >> convincing. Since I couldn't believe my eyes, I made up some bivariate data >> with known clusters (not given here) and interchanged the x and y columns >> and got the same clusters. I'm at a loss. Hopefully someone can set my >> understanding on the right path. >> >> Thanks for any insight! Bryan >> >>> sessionInfo() >> R version 2.10.1 (2009-12-14) >> x86_64-apple-darwin9.8.0 >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] datasets tools grid graphics grDevices utils stats >> methods base >> >> other attached packages: >> [1] faraway_1.0.4 GGally_0.1 xtable_1.5-6 mvbutils_2.5.1 >> ggplot2_0.8.7 >> [6] digest_0.4.2 reshape_0.8.3 proto_0.3-8 ChemoSpec_1.43 >> R.utils_1.4.0 >> [11] R.oo_1.7.1 R.methodsS3_1.2.0 rgl_0.91 lattice_0.18-3 >> mvoutlier_1.4 >> [16] plyr_0.1.9 RColorBrewer_1.0-2 chemometrics_0.8 som_0.3-5 >> robustbase_0.5-0-1 >> [21] rpart_3.1-46 pls_2.1-0 pcaPP_1.8 mvtnorm_0.9-9 >> nnet_7.3-1 >> [26] mclust_3.4.4 MASS_7.3-5 lars_0.9-7 e1071_1.5-23 >> class_7.3-2 >> >> And the data: >> >>> dput(tur) >> structure(list(mine = structure(c(13L, 13L, 13L, 13L, 13L, 13L, >> 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, >> 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, >> 6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, >> 1L, 1L, 1L, 1L, 1L, 1L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, >> 9L, 9L, 9L, 9L, 12L, 12L, 12L, 12L, 12L, 8L, 8L, 8L, 8L, 8L, >> 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 10L, 10L, 10L, 10L, 10L, >> 10L, 10L, 10L, 10L, 10L, 4L, 4L, 4L, 4L, 4L, 11L, 11L, 11L, 11L, >> 11L, 11L, 11L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, >> 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Cas", >> "CH", "CR", "FM", "GU", "KG", "KM", "LV", "MC", "MZ", "N8", "OR", >> "SB", "TF"), class = "factor"), dCu65 = c(6.1, 6.5, 5.7, 6.4, >> 6.1, 5.9, 6.5, 5.7, 6, 6.4, 6.8, 6.9, 6.7, 6.7, 5.8, 5.3, 6.7, >> 5.7, 6.7, 5.6, 6.4, 6.4, 6.5, 7.6, 6.7, 6.4, 6.4, 7, 7, 5.9, >> 5.3, 6.1, 11.8, 12.1, 12.4, 1, 1.8, 1.4, 1.2, 1.2, 1.2, 1.1, >> 1.2, 1.8, 1, 0.8, 0.7, 1.2, 1.5, 1, 0.2, 1.4, 0.3, 1.1, 0, -0.5, >> -1.4, -0.2, 1.1, 0.9, 0.1, 1.4, -1.2, -0.1, -0.4, -0.8, 3, 3.3, >> 3.9, 3.9, 3.7, 9.5, 9.6, 9.6, 8.3, 10.3, 8.4, 9.4, 0.8, 0.9, >> 1.3, 0.2, 2.2, 0.7, 1.8, 4.5, 4.5, 4.6, 4.5, 4.6, 3.9, 4.2, 4.9, >> 4.9, 4.5, 16.7, 17.4, 17.5, 17.5, 17.4, 1.9, 2.4, 1.7, 1.2, 1.5, >> 1.1, 1.8, 3.1, 3.1, 3, 4.1, 4.2, 2.7, 4.5, 0.4, 1.2, 3, 3.5, >> 1.2, 0.8, 0.9, 2, -0.8, 5.8, 0.6, 1, 0.9, 4.4, 2.3, -0.1), dD = c(-66L, >> -81L, -81L, -71L, -68L, -79L, -73L, -77L, -74L, -71L, -73L, -80L, >> -76L, -80L, -74L, -77L, -74L, -77L, -74L, -82L, -79L, -72L, -74L, >> -71L, -76L, -72L, -76L, -73L, -75L, -78L, -81L, -77L, -77L, -73L, >> -79L, -104L, -98L, -98L, -109L, -106L, -99L, -107L, -100L, -98L, >> -95L, -95L, -95L, -97L, -98L, -121L, -119L, -118L, -121L, -121L, >> -106L, -116L, -122L, -112L, -125L, -109L, -114L, -118L, -102L, >> -109L, -101L, -101L, -74L, -78L, -63L, -67L, -75L, -65L, -60L, >> -62L, -67L, -59L, -62L, -68L, -61L, -58L, -69L, -57L, -59L, -67L, >> -73L, -81L, -90L, -87L, -82L, -80L, -80L, -81L, -83L, -90L, -89L, >> -107L, -122L, -128L, -113L, -124L, -93L, -89L, -92L, -85L, -79L, >> -84L, -83L, -148L, -142L, -127L, -130L, -123L, -131L, -85L, -87L, >> -90L, -81L, -65L, -81L, -119L, -121L, -94L, -90L, -117L, -95L, >> -69L, -122L, -103L, -85L, -106L)), .Names = c("mine", "dCu65", >> "dD"), class = "data.frame", row.names = c(NA, -130L)) >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> ------ End of Forwarded Message >> >> >> >>