Jonathan Qiang Li
2001-Jul-31 16:59 UTC
[R] fitting mixture of gaussians using emclust() of mclust package
Hi, Has someone tried to use mclust package function emclust() to fit a mixture of gaussian model for a relatively large dataset? By "large", I specifically have in mind a data set with 50,000 observations and 23 dimensions. My machine has 750M memory and 500M swap space. When I tried to use emclust on the dataset, I consistently get messages such as "Error: cannot allocate vector of size 1991669 Kb". In other words does this mean that R is trying to allocate almost 2000Mb space? Should this be considered abnormal? Thanks, -- Jonathan Q. Li, PhD Agilent Technologies Laboratory Palo Alto, California, USA -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Christian Hennig
2001-Aug-01 09:16 UTC
[R] fitting mixture of gaussians using emclust() of mclust package
On Tue, 31 Jul 2001, Jonathan Qiang Li wrote:> Hi, > > Has someone tried to use mclust package function emclust() to fit a > mixture of gaussian model for a relatively large dataset? > By "large", I specifically have in mind a data set with 50,000 > observations and 23 dimensions. My machine has 750M memory and 500M swap > space. When I tried to use emclust on the dataset, I consistently get > messages such as "Error: cannot allocate vector of size 1991669 Kb". In > other words does this mean that R is trying to allocate almost 2000Mb > space? Should this be considered abnormal?No. I recently talked to A.E.Raftery, one of the designers of the Splus original, and he said that there are indeed problems with datasets of more than, say, 10000 observations. He said that it is the number of observations that matters, not the dimension. The main problem is, according to him, the hierarchical routine which leads to the initial partition. He suggests to take a random subsample of size 100-1000 and to generate initial starting parameters from the subsample. I cannot tell you the details, because I have not tried this until now. But the principle is that you can tell emclust/mclust somehow, how the starting values are generated and the default, the memory intensive hierarchical clustering, must be replaced by some fixed starting configuration obtained from a subsample. Another hint is that for high dimensions it is not advisable to calculate the "VVV"-model because of the high probability for spurious local maxima of the likelihood. Hope that helps, Christian *********************************************************************** Christian Hennig University of Hamburg, Faculty of Mathematics - SPST/ZMS (Schwerpunkt Mathematische Statistik und Stochastische Prozesse, Zentrum fuer Modellierung und Simulation) Bundesstrasse 55, D-20146 Hamburg, Germany Tel: x40/42838 4907, privat x40/631 62 79 hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag.de -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Jonathan Qiang Li
2001-Aug-01 15:34 UTC
[R] fitting mixture of gaussians using emclust() of mclust package
Thanks for help. Rather than using emclust(), using me() directly with kmeans-induced initial starting parameters seems to work better (not sure how much since to get results I have to sample the data pretty aggressively). But I still found that when I have data with more than 10,000 obs, it takes the routine painfully long time to converge. I understand that the speed of convergence for EM algorithm is data-dependent and in general very slow. But do people have some benchmark estimate for the relationship between the sample size and the computation time using R from their experience? Can also some one point out some references/packages for speeding up EM, especially when sample size and dimension are not trivial? (not exactly a R-related question, but I thought people on this list would be interested in such problems). Regards, Jonathan Christian Hennig wrote:> > On Tue, 31 Jul 2001, Jonathan Qiang Li wrote: >> > Hi, > > > > Has someone tried to use mclust package function emclust() to fit a > > mixture of gaussian model for a relatively large dataset? > > By "large", I specifically have in mind a data set with 50,000 > > observations and 23 dimensions. My machine has 750M memory and 500M swap > > space. When I tried to use emclust on the dataset, I consistently get > > messages such as "Error: cannot allocate vector of size 1991669 Kb". In > > other words does this mean that R is trying to allocate almost 2000Mb > > space? Should this be considered abnormal? > > No. I recently talked to A.E.Raftery, one of the designers of the Splus > original, and he said that there are indeed problems with datasets of more > than, say, 10000 observations. He said that it is the number of observations > that matters, not the dimension. The main problem is, according to him, > the hierarchical routine which leads to the initial partition. He suggests > to take a random subsample of size 100-1000 and to generate initial starting > parameters from the subsample. I cannot tell you the details, because I > have not tried this until now. But the principle is that you can tell > emclust/mclust somehow, how the starting values are generated and the default, > the memory intensive hierarchical clustering, must be replaced by some fixed > starting configuration obtained from a subsample. > > Another hint is that for high dimensions it is not advisable to calculate > the "VVV"-model because of the high probability for spurious local maxima of > the likelihood. > > Hope that helps, > Christian > > *********************************************************************** > Christian Hennig > University of Hamburg, Faculty of Mathematics - SPST/ZMS > (Schwerpunkt Mathematische Statistik und Stochastische Prozesse, > Zentrum fuer Modellierung und Simulation) > Bundesstrasse 55, D-20146 Hamburg, Germany > Tel: x40/42838 4907, privat x40/631 62 79 > hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ > ####################################################################### > ich empfehle www.boag.de-- Jonathan Q. Li, PhD Agilent Technologies Laboratory Palo Alto, California, USA -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Murray Jorgensen
2001-Aug-02 04:22 UTC
[R] fitting mixture of gaussians using emclust() of mclust package
Jonathan, I have no direct experience with mclust, but I have used Multimix, a similar program whose design I was involved in. I believe that one way to speed up EM for mixtures is to allocate all observations to clusters according to an unconverged set of parameters, then to re-start the algorithm using this as the initial clustering. Another way might be to slip in a stochastic EM step, where observations are allocated probabilistically to clusters using the current values of the observation-specific cluster membership probabilities. If clustering is your main purpose in fitting the mixture model then complete convergence may not be all that important as clusters tend to stabilize quite a while before the parameter estimates. Multimix is written in Fortran77 and also copes with categorical variables in addition to gaussian. You may download it from my home page along with some documentation. I would appreciate feedback on how well it copes with your data if you use it. You will need to adjust some array bounds and recompile to tune it to your data set. Murray Jorgensen At 08:34 AM 1-08-01 -0700, you wrote:>Thanks for help. > >Rather than using emclust(), using me() directly with kmeans-induced >initial starting parameters >seems to work better (not sure how much since to get results I have to >sample the data pretty aggressively). > >But I still found that when I have data with more than 10,000 obs, >it takes the routine painfully long time to converge. I understand that >the speed of >convergence for EM algorithm is data-dependent and in general very slow. >But do people have some benchmark >estimate for the relationship between the sample size and the >computation time using R from their experience? Can also >some one point out some references/packages for speeding up EM, >especially when sample size and >dimension are not trivial? (not exactly a R-related question, but I >thought people on this list would be interested in such problems). > >Regards, >Jonathan > >Christian Hennig wrote: >> >> On Tue, 31 Jul 2001, Jonathan Qiang Li wrote: >> > > >> > Hi, >> > >> > Has someone tried to use mclust package function emclust() to fit a >> > mixture of gaussian model for a relatively large dataset? >> > By "large", I specifically have in mind a data set with 50,000 >> > observations and 23 dimensions. My machine has 750M memory and 500M swap >> > space. When I tried to use emclust on the dataset, I consistently get >> > messages such as "Error: cannot allocate vector of size 1991669 Kb". In >> > other words does this mean that R is trying to allocate almost 2000Mb >> > space? Should this be considered abnormal? >> >> No. I recently talked to A.E.Raftery, one of the designers of the Splus >> original, and he said that there are indeed problems with datasets of more >> than, say, 10000 observations. He said that it is the number ofobservations>> that matters, not the dimension. The main problem is, according to him, >> the hierarchical routine which leads to the initial partition. He suggests >> to take a random subsample of size 100-1000 and to generate initialstarting>> parameters from the subsample. I cannot tell you the details, because I >> have not tried this until now. But the principle is that you can tell >> emclust/mclust somehow, how the starting values are generated and thedefault,>> the memory intensive hierarchical clustering, must be replaced by somefixed>> starting configuration obtained from a subsample. >> >> Another hint is that for high dimensions it is not advisable to calculate >> the "VVV"-model because of the high probability for spurious localmaxima of>> the likelihood. >> >> Hope that helps, >> Christian >> >> *********************************************************************** >> Christian Hennig >> University of Hamburg, Faculty of Mathematics - SPST/ZMS >> (Schwerpunkt Mathematische Statistik und Stochastische Prozesse, >> Zentrum fuer Modellierung und Simulation) >> Bundesstrasse 55, D-20146 Hamburg, Germany >> Tel: x40/42838 4907, privat x40/631 62 79 >> hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ >> ####################################################################### >> ich empfehle www.boag.de > >-- >Jonathan Q. Li, PhD >Agilent Technologies Laboratory >Palo Alto, California, USA >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.->r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html >Send "info", "help", or "[un]subscribe" >(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._>Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand *Applications Editor, Australian and New Zealand Journal of Statistics* maj at waikato.ac.nz Phone +64-7 838 4773 home phone 856 6705 Fax 838 4155 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Christian Hennig
2001-Aug-02 09:38 UTC
[R] fitting mixture of gaussians using emclust() of mclust package
Dear Jonathan, Chapter 12 of G. McLachlan, D. Peel "Finite Mixture Models", Wiley, NY 2000 is devoted to this topic and contains lots of further references. Regards, Christian On Wed, 1 Aug 2001, Jonathan Qiang Li wrote:> Thanks for help. > > Rather than using emclust(), using me() directly with kmeans-induced > initial starting parameters > seems to work better (not sure how much since to get results I have to > sample the data pretty aggressively). > > But I still found that when I have data with more than 10,000 obs, > it takes the routine painfully long time to converge. I understand that > the speed of > convergence for EM algorithm is data-dependent and in general very slow. > But do people have some benchmark > estimate for the relationship between the sample size and the > computation time using R from their experience? Can also > some one point out some references/packages for speeding up EM, > especially when sample size and > dimension are not trivial? (not exactly a R-related question, but I > thought people on this list would be interested in such problems). > > Regards, > Jonathan*********************************************************************** Christian Hennig University of Hamburg, Faculty of Mathematics - SPST/ZMS (Schwerpunkt Mathematische Statistik und Stochastische Prozesse, Zentrum fuer Modellierung und Simulation) Bundesstrasse 55, D-20146 Hamburg, Germany Tel: x40/42838 4907, privat x40/631 62 79 hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag.de -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Jonathan Qiang Li
2001-Aug-02 10:18 UTC
[R] fitting mixture of gaussians using emclust() of mclust package
Thanks for the pointer. BTW, does your package implement some of these variations (such as stochastic EM, restarting EM etc.) or is a straight EM? Regards, Jonathan Christian Hennig wrote:> > Dear Jonathan, > > Chapter 12 of G. McLachlan, D. Peel "Finite Mixture Models", Wiley, NY 2000 > is devoted to this topic and contains lots of further references. > > Regards, > Christian > > On Wed, 1 Aug 2001, Jonathan Qiang Li wrote: > > > Thanks for help. > > > > Rather than using emclust(), using me() directly with kmeans-induced > > initial starting parameters > > seems to work better (not sure how much since to get results I have to > > sample the data pretty aggressively). > > > > But I still found that when I have data with more than 10,000 obs, > > it takes the routine painfully long time to converge. I understand that > > the speed of > > convergence for EM algorithm is data-dependent and in general very slow. > > But do people have some benchmark > > estimate for the relationship between the sample size and the > > computation time using R from their experience? Can also > > some one point out some references/packages for speeding up EM, > > especially when sample size and > > dimension are not trivial? (not exactly a R-related question, but I > > thought people on this list would be interested in such problems). > > > > Regards, > > Jonathan > > *********************************************************************** > Christian Hennig > University of Hamburg, Faculty of Mathematics - SPST/ZMS > (Schwerpunkt Mathematische Statistik und Stochastische Prozesse, > Zentrum fuer Modellierung und Simulation) > Bundesstrasse 55, D-20146 Hamburg, Germany > Tel: x40/42838 4907, privat x40/631 62 79 > hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ > ####################################################################### > ich empfehle www.boag.de-- Jonathan Q. Li, PhD Agilent Technologies Laboratory Palo Alto, California, USA -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Jonathan Qiang Li
2001-Aug-02 10:53 UTC
[R] fitting mixture of gaussians using emclust() of mclust package
Sorry Christian... The following question in my previous message is meant for Murray Jorgensen. "BTW, does your package implement some of these variations (such as stochastic EM, restarting EM etc.) or is a straight EM? " -- Jonathan Q. Li, PhD Agilent Technologies Laboratory Palo Alto, California, USA -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._