thr3ads.net - R help - [R] What is mclust up to? Different clusters found if x and y interchanged [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Bryan Hanson

2010-Apr-19 23:31 UTC

[R] What is mclust up to? Different clusters found if x and y interchanged

Hello All...

I gave a task to my students that involved using mclust to look for clusters
in some bivariate data of isotopes vs various mining locations.  They
discovered something I didn?t expect; the data (called tur) is appended
below.

p <- qplot(x = dD, y = dCu65, data = tur, color = mine)
print(p) # simple bivariate plot of the data; looks fine

mod1 <- Mclust(tur[,2:3])
mod1$G
mod2 <- Mclust(tur[,3:2])
mod2$G

One can use coordProj to see the clusters found, but the basic result is
that mclust found 3 clusters in mod1, but if you interchange the x and y
columns it finds 7 clusters (mod2).  Since this is bivariate data, I find
this result pretty strange.  The only thing I can think of is that it has
something to do with how the algorithm is seeded, but that isn't too
convincing.  Since I couldn't believe my eyes, I made up some bivariate data
with known clusters (not given here) and interchanged the x and y columns
and got the same clusters.  I'm at a loss.  Hopefully someone can set my
understanding on the right path.

Thanks for any insight!  Bryan
> sessionInfo()R version 2.10.1 (2009-12-14)
x86_64-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] datasets  tools     grid      graphics  grDevices utils     stats
methods   base     

other attached packages:
 [1] faraway_1.0.4      GGally_0.1         xtable_1.5-6       mvbutils_2.5.1
ggplot2_0.8.7     
 [6] digest_0.4.2       reshape_0.8.3      proto_0.3-8        ChemoSpec_1.43
R.utils_1.4.0     
[11] R.oo_1.7.1         R.methodsS3_1.2.0  rgl_0.91           lattice_0.18-3
mvoutlier_1.4     
[16] plyr_0.1.9         RColorBrewer_1.0-2 chemometrics_0.8   som_0.3-5
robustbase_0.5-0-1
[21] rpart_3.1-46       pls_2.1-0          pcaPP_1.8          mvtnorm_0.9-9
nnet_7.3-1        
[26] mclust_3.4.4       MASS_7.3-5         lars_0.9-7         e1071_1.5-23
class_7.3-2       

And the data:
> dput(tur)structure(list(mine = structure(c(13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
9L, 9L, 9L, 9L, 12L, 12L, 12L, 12L, 12L, 8L, 8L, 8L, 8L, 8L,
8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 4L, 4L, 4L, 4L, 4L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Cas",
"CH", "CR", "FM", "GU", "KG",
"KM", "LV", "MC", "MZ", "N8",
"OR",
"SB", "TF"), class = "factor"), dCu65 = c(6.1,
6.5, 5.7, 6.4,
6.1, 5.9, 6.5, 5.7, 6, 6.4, 6.8, 6.9, 6.7, 6.7, 5.8, 5.3, 6.7,
5.7, 6.7, 5.6, 6.4, 6.4, 6.5, 7.6, 6.7, 6.4, 6.4, 7, 7, 5.9,
5.3, 6.1, 11.8, 12.1, 12.4, 1, 1.8, 1.4, 1.2, 1.2, 1.2, 1.1,
1.2, 1.8, 1, 0.8, 0.7, 1.2, 1.5, 1, 0.2, 1.4, 0.3, 1.1, 0, -0.5,
-1.4, -0.2, 1.1, 0.9, 0.1, 1.4, -1.2, -0.1, -0.4, -0.8, 3, 3.3,
3.9, 3.9, 3.7, 9.5, 9.6, 9.6, 8.3, 10.3, 8.4, 9.4, 0.8, 0.9,
1.3, 0.2, 2.2, 0.7, 1.8, 4.5, 4.5, 4.6, 4.5, 4.6, 3.9, 4.2, 4.9,
4.9, 4.5, 16.7, 17.4, 17.5, 17.5, 17.4, 1.9, 2.4, 1.7, 1.2, 1.5,
1.1, 1.8, 3.1, 3.1, 3, 4.1, 4.2, 2.7, 4.5, 0.4, 1.2, 3, 3.5,
1.2, 0.8, 0.9, 2, -0.8, 5.8, 0.6, 1, 0.9, 4.4, 2.3, -0.1), dD = c(-66L,
-81L, -81L, -71L, -68L, -79L, -73L, -77L, -74L, -71L, -73L, -80L,
-76L, -80L, -74L, -77L, -74L, -77L, -74L, -82L, -79L, -72L, -74L,
-71L, -76L, -72L, -76L, -73L, -75L, -78L, -81L, -77L, -77L, -73L,
-79L, -104L, -98L, -98L, -109L, -106L, -99L, -107L, -100L, -98L,
-95L, -95L, -95L, -97L, -98L, -121L, -119L, -118L, -121L, -121L,
-106L, -116L, -122L, -112L, -125L, -109L, -114L, -118L, -102L,
-109L, -101L, -101L, -74L, -78L, -63L, -67L, -75L, -65L, -60L,
-62L, -67L, -59L, -62L, -68L, -61L, -58L, -69L, -57L, -59L, -67L,
-73L, -81L, -90L, -87L, -82L, -80L, -80L, -81L, -83L, -90L, -89L,
-107L, -122L, -128L, -113L, -124L, -93L, -89L, -92L, -85L, -79L,
-84L, -83L, -148L, -142L, -127L, -130L, -123L, -131L, -85L, -87L,
-90L, -81L, -65L, -81L, -119L, -121L, -94L, -90L, -117L, -95L,
-69L, -122L, -103L, -85L, -106L)), .Names = c("mine",
"dCu65",
"dD"), class = "data.frame", row.names = c(NA, -130L))

Bryan Hanson

2010-Apr-20 17:36 UTC

head link

[R] What is mclust up to? Different clusters found if x and y interchanged

Thanks Prof. Fraley.  This is very helpful.  I had looked at the way mclust
gets its start and that looked like a possible source of the differences,
but had not had time to get up to speed and figure it out.  I was thinking
it was more of a set.seed sort of thing.  Now I know more.  Thanks so much.
Bryan


On 4/20/10 11:41 AM, "Chris Fraley" <fraley at u.washington.edu>
wrote:
> 
> 
> This is a really interesting case, but it is probably not that unusual.
> 
> mod23 <- Mclust(tur[,2:3])
> mod32 <- Mclust(tur[,3:2])
> 
> mod23[c("modelName","G")]
> mod32[c("modelName","G")]
> 
> The 2:3 case is choosing the VVV model (more parameters per cluster) with 3
> clusters.
> The 3:2 case is choosing the EEV model (more parsimonious), but with 7
> clusters.
> 
> If you look at the BIC curves, you can see that mclust is indeed choosing
the
> model
> corresponding to the maximum BIC in each case.
> 
> plot( mclustBIC( tur[,2:3], modelNames = c("EEV",
"VVV"))
> plot( mclustBIC( tur[,3:2], modelNames = c("EEV",
"VVV"))
> 
> The default initialization for Mclust is via hierarchical clustering, and
> that's
> why the results differ
> 
> hc23 <- hcVVV( tur[,2:3]) # default initialization for mod23
> hc32 <- hcVVV( tur[,3:2]) # default initialization for mod32
> 
> all.equal(hc32,hc23) # they differ!
> 
> The hierarchical clustering merges two groups at each stage.
> If there is more than one choice at any particular stage for
> which the merge criteria are equal or within roundoff error,
> a different pair could be chosen for merging if columns (or rows)
> are permuted. This will affect later merges, and could ultimately
> affect the mclust results.
> 
> If you initialize both col permutations in the same way, you'll get
> the same results.
> 
> mod23mod <- Mclust(tur[,2:3], initialization = list( hcPairs = hc32)) #
uses
> mod32 initialization
> mod32mod <- Mclust(tur[,3:2], initialization = list( hcPairs = hc23)) #
uses
> mod23 initialization
> 
> mod23mod[c("modelName","G")] # same as mod32
> mod32mod[c("modelName","G")] # same as mod23
> 
> Hope this helps,
> 
> Chris Fraley
> 
> 
> On Tue, 20 Apr 2010, Bryan Hanson wrote:
> 
>> Prof. Fraley, I wonder if you would have a couple of moments to respond
to
>> this question I put to the R help list.  Thanks in advance for your
time.
>> 
>> Bryan
>> *************
>> Bryan Hanson
>> Acting Chair
>> Professor of Chemistry & Biochemistry
>> DePauw University, Greencastle IN USA
>> 
>> 
>> ------ Forwarded Message
>> From: Bryan Hanson <hanson at depauw.edu>
>> Date: Mon, 19 Apr 2010 19:31:03 -0400
>> To: <hanson at gapps.depauw.edu>, R Help <r-help at
stat.math.ethz.ch>
>> Subject: [R] What is mclust up to? Different clusters found if x and y
>> interchanged
>> 
>> Hello All...
>> 
>> I gave a task to my students that involved using mclust to look for
clusters
>> in some bivariate data of isotopes vs various mining locations.  They
>> discovered something I didn?t expect; the data (called tur) is appended
>> below.
>> 
>> p <- qplot(x = dD, y = dCu65, data = tur, color = mine)
>> print(p) # simple bivariate plot of the data; looks fine
>> 
>> mod1 <- Mclust(tur[,2:3])
>> mod1$G
>> mod2 <- Mclust(tur[,3:2])
>> mod2$G
>> 
>> One can use coordProj to see the clusters found, but the basic result
is
>> that mclust found 3 clusters in mod1, but if you interchange the x and
y
>> columns it finds 7 clusters (mod2).  Since this is bivariate data, I
find
>> this result pretty strange.  The only thing I can think of is that it
has
>> something to do with how the algorithm is seeded, but that isn't
too
>> convincing.  Since I couldn't believe my eyes, I made up some
bivariate data
>> with known clusters (not given here) and interchanged the x and y
columns
>> and got the same clusters.  I'm at a loss.  Hopefully someone can
set my
>> understanding on the right path.
>> 
>> Thanks for any insight!  Bryan
>> 
>>> sessionInfo()
>> R version 2.10.1 (2009-12-14)
>> x86_64-apple-darwin9.8.0
>> 
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>> 
>> attached base packages:
>> [1] datasets  tools     grid      graphics  grDevices utils     stats
>> methods   base
>> 
>> other attached packages:
>> [1] faraway_1.0.4      GGally_0.1         xtable_1.5-6      
mvbutils_2.5.1
>> ggplot2_0.8.7
>> [6] digest_0.4.2       reshape_0.8.3      proto_0.3-8       
ChemoSpec_1.43
>> R.utils_1.4.0
>> [11] R.oo_1.7.1         R.methodsS3_1.2.0  rgl_0.91          
lattice_0.18-3
>> mvoutlier_1.4
>> [16] plyr_0.1.9         RColorBrewer_1.0-2 chemometrics_0.8   som_0.3-5
>> robustbase_0.5-0-1
>> [21] rpart_3.1-46       pls_2.1-0          pcaPP_1.8         
mvtnorm_0.9-9
>> nnet_7.3-1
>> [26] mclust_3.4.4       MASS_7.3-5         lars_0.9-7        
e1071_1.5-23
>> class_7.3-2
>> 
>> And the data:
>> 
>>> dput(tur)
>> structure(list(mine = structure(c(13L, 13L, 13L, 13L, 13L, 13L,
>> 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
>> 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
>> 6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> 1L, 1L, 1L, 1L, 1L, 1L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
>> 9L, 9L, 9L, 9L, 12L, 12L, 12L, 12L, 12L, 8L, 8L, 8L, 8L, 8L,
>> 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 10L, 10L, 10L, 10L, 10L,
>> 10L, 10L, 10L, 10L, 10L, 4L, 4L, 4L, 4L, 4L, 11L, 11L, 11L, 11L,
>> 11L, 11L, 11L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L), .Label =
c("Cas",
>> "CH", "CR", "FM", "GU",
"KG", "KM", "LV", "MC", "MZ",
"N8", "OR",
>> "SB", "TF"), class = "factor"), dCu65 =
c(6.1, 6.5, 5.7, 6.4,
>> 6.1, 5.9, 6.5, 5.7, 6, 6.4, 6.8, 6.9, 6.7, 6.7, 5.8, 5.3, 6.7,
>> 5.7, 6.7, 5.6, 6.4, 6.4, 6.5, 7.6, 6.7, 6.4, 6.4, 7, 7, 5.9,
>> 5.3, 6.1, 11.8, 12.1, 12.4, 1, 1.8, 1.4, 1.2, 1.2, 1.2, 1.1,
>> 1.2, 1.8, 1, 0.8, 0.7, 1.2, 1.5, 1, 0.2, 1.4, 0.3, 1.1, 0, -0.5,
>> -1.4, -0.2, 1.1, 0.9, 0.1, 1.4, -1.2, -0.1, -0.4, -0.8, 3, 3.3,
>> 3.9, 3.9, 3.7, 9.5, 9.6, 9.6, 8.3, 10.3, 8.4, 9.4, 0.8, 0.9,
>> 1.3, 0.2, 2.2, 0.7, 1.8, 4.5, 4.5, 4.6, 4.5, 4.6, 3.9, 4.2, 4.9,
>> 4.9, 4.5, 16.7, 17.4, 17.5, 17.5, 17.4, 1.9, 2.4, 1.7, 1.2, 1.5,
>> 1.1, 1.8, 3.1, 3.1, 3, 4.1, 4.2, 2.7, 4.5, 0.4, 1.2, 3, 3.5,
>> 1.2, 0.8, 0.9, 2, -0.8, 5.8, 0.6, 1, 0.9, 4.4, 2.3, -0.1), dD = c(-66L,
>> -81L, -81L, -71L, -68L, -79L, -73L, -77L, -74L, -71L, -73L, -80L,
>> -76L, -80L, -74L, -77L, -74L, -77L, -74L, -82L, -79L, -72L, -74L,
>> -71L, -76L, -72L, -76L, -73L, -75L, -78L, -81L, -77L, -77L, -73L,
>> -79L, -104L, -98L, -98L, -109L, -106L, -99L, -107L, -100L, -98L,
>> -95L, -95L, -95L, -97L, -98L, -121L, -119L, -118L, -121L, -121L,
>> -106L, -116L, -122L, -112L, -125L, -109L, -114L, -118L, -102L,
>> -109L, -101L, -101L, -74L, -78L, -63L, -67L, -75L, -65L, -60L,
>> -62L, -67L, -59L, -62L, -68L, -61L, -58L, -69L, -57L, -59L, -67L,
>> -73L, -81L, -90L, -87L, -82L, -80L, -80L, -81L, -83L, -90L, -89L,
>> -107L, -122L, -128L, -113L, -124L, -93L, -89L, -92L, -85L, -79L,
>> -84L, -83L, -148L, -142L, -127L, -130L, -123L, -131L, -85L, -87L,
>> -90L, -81L, -65L, -81L, -119L, -121L, -94L, -90L, -117L, -95L,
>> -69L, -122L, -103L, -85L, -106L)), .Names = c("mine",
"dCu65",
>> "dD"), class = "data.frame", row.names = c(NA,
-130L))
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> ------ End of Forwarded Message
>> 
>> 
>> 
>>

Apparently Analagous Threads

Search for more possibly parallel threads

R help - Apr 2010 - What is mclust up to? Different clusters found if x and y interchanged

[R] What is mclust up to? Different clusters found if x and y interchanged

[R] What is mclust up to? Different clusters found if x and y interchanged

Apparently Analagous Threads