thr3ads.net - R help - [R] fitting mixture of gaussians using emclust() of mclust package [Jul 2001]

If this information is useful, please help other people find it:
Share via:

Jonathan Qiang Li

2001-Jul-31 16:59 UTC

[R] fitting mixture of gaussians using emclust() of mclust package

Hi,

Has someone tried to use mclust package function emclust() to fit a
mixture of gaussian model for a relatively large dataset?
By "large", I specifically have in mind a data set with 50,000
observations and 23 dimensions. My machine has 750M memory and 500M swap
space. When I tried to use emclust on the dataset, I consistently get
messages such as "Error: cannot allocate vector of size 1991669 Kb".
In
other words does this mean that R is trying to allocate almost 2000Mb
space? Should this be considered abnormal? 

Thanks, 

-- 
Jonathan Q. Li, PhD
Agilent Technologies Laboratory
Palo Alto, California, USA
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Christian Hennig

2001-Aug-01 09:16 UTC

head link

[R] fitting mixture of gaussians using emclust() of mclust package

On Tue, 31 Jul 2001, Jonathan Qiang Li wrote:
> Hi,
> 
> Has someone tried to use mclust package function emclust() to fit a
> mixture of gaussian model for a relatively large dataset?
> By "large", I specifically have in mind a data set with 50,000
> observations and 23 dimensions. My machine has 750M memory and 500M swap
> space. When I tried to use emclust on the dataset, I consistently get
> messages such as "Error: cannot allocate vector of size 1991669
Kb". In
> other words does this mean that R is trying to allocate almost 2000Mb
> space? Should this be considered abnormal?
No. I recently talked to A.E.Raftery, one of the designers of the Splus 
original, and he said that there are indeed problems with datasets of more
than, say, 10000 observations. He said that it is the number of observations
that matters, not the dimension. The main problem is, according to him, 
the hierarchical routine which leads to the initial partition. He suggests
to take a random subsample of size 100-1000 and to generate initial starting
parameters from the subsample. I cannot tell you the details, because I
have not tried this until now. But the principle is that you can tell
emclust/mclust somehow, how the starting values are generated and the default,
the memory intensive hierarchical clustering, must be replaced by some fixed
starting configuration obtained from a subsample. 

Another hint is that for high dimensions it is not advisable to calculate
the "VVV"-model because of the high probability for spurious local
maxima of
the likelihood. 
 
Hope that helps,
Christian


***********************************************************************
Christian Hennig
University of Hamburg, Faculty of Mathematics - SPST/ZMS
 (Schwerpunkt Mathematische Statistik und Stochastische Prozesse,
 Zentrum fuer Modellierung und Simulation)
Bundesstrasse 55, D-20146 Hamburg, Germany
Tel: x40/42838 4907, privat x40/631 62 79
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag.de


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Qiang Li

2001-Aug-01 15:34 UTC

head link

[R] fitting mixture of gaussians using emclust() of mclust package

Thanks for help.

Rather than using emclust(), using me() directly with  kmeans-induced
initial starting parameters
seems to work better (not sure how much since to get results I have to
sample the data pretty aggressively). 

But I still found that when I have data with more than 10,000 obs,
it takes the routine painfully long time to converge. I understand that
the speed of 
convergence for EM algorithm is data-dependent and in general very slow.
But do people have some benchmark
estimate for the relationship between the sample size and the
computation time using R from their experience? Can also
some one point out some references/packages for speeding up EM,
especially when sample size and 
dimension are not trivial? (not exactly a R-related question, but I
thought people on this list would be interested in such problems).

Regards,
Jonathan

Christian Hennig wrote:> 
> On Tue, 31 Jul 2001, Jonathan Qiang Li wrote:
> 
> > Hi,
> >
> > Has someone tried to use mclust package function emclust() to fit a
> > mixture of gaussian model for a relatively large dataset?
> > By "large", I specifically have in mind a data set with
50,000
> > observations and 23 dimensions. My machine has 750M memory and 500M
swap
> > space. When I tried to use emclust on the dataset, I consistently get
> > messages such as "Error: cannot allocate vector of size 1991669
Kb". In
> > other words does this mean that R is trying to allocate almost 2000Mb
> > space? Should this be considered abnormal?
> 
> No. I recently talked to A.E.Raftery, one of the designers of the Splus
> original, and he said that there are indeed problems with datasets of more
> than, say, 10000 observations. He said that it is the number of
observations
> that matters, not the dimension. The main problem is, according to him,
> the hierarchical routine which leads to the initial partition. He suggests
> to take a random subsample of size 100-1000 and to generate initial
starting
> parameters from the subsample. I cannot tell you the details, because I
> have not tried this until now. But the principle is that you can tell
> emclust/mclust somehow, how the starting values are generated and the
default,
> the memory intensive hierarchical clustering, must be replaced by some
fixed
> starting configuration obtained from a subsample.
> 
> Another hint is that for high dimensions it is not advisable to calculate
> the "VVV"-model because of the high probability for spurious
local maxima of
> the likelihood.
> 
> Hope that helps,
> Christian
> 
> ***********************************************************************
> Christian Hennig
> University of Hamburg, Faculty of Mathematics - SPST/ZMS
>  (Schwerpunkt Mathematische Statistik und Stochastische Prozesse,
>  Zentrum fuer Modellierung und Simulation)
> Bundesstrasse 55, D-20146 Hamburg, Germany
> Tel: x40/42838 4907, privat x40/631 62 79
> hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
> #######################################################################
> ich empfehle www.boag.de
-- 
Jonathan Q. Li, PhD
Agilent Technologies Laboratory
Palo Alto, California, USA
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Murray Jorgensen

2001-Aug-02 04:22 UTC

head link

[R] fitting mixture of gaussians using emclust() of mclust package

Jonathan,

I have no direct experience with mclust, but I have used Multimix, a similar
program whose design I was involved in.

I believe that one way to speed up EM for mixtures is to allocate all
observations to clusters according to an unconverged set of parameters,
then to
re-start the algorithm using this as the initial clustering.

Another way might be to slip in a stochastic EM step, where observations are
allocated probabilistically to clusters using the current values of the
observation-specific cluster membership probabilities.

If clustering is your main purpose in fitting the mixture model then complete
convergence may not be all that important as clusters tend to stabilize
quite a
while before the parameter estimates.

Multimix is written in Fortran77 and also copes with categorical variables in
addition to gaussian. You may download it from my home page along with some
documentation. I would appreciate feedback on how well it copes with your data
if you use it. You will need to adjust some array bounds and recompile to tune
it to your data set.

Murray Jorgensen

At 08:34 AM 1-08-01 -0700, you wrote:>Thanks for help.
>
>Rather than using emclust(), using me() directly with  kmeans-induced
>initial starting parameters
>seems to work better (not sure how much since to get results I have to
>sample the data pretty aggressively). 
>
>But I still found that when I have data with more than 10,000 obs,
>it takes the routine painfully long time to converge. I understand that
>the speed of 
>convergence for EM algorithm is data-dependent and in general very slow.
>But do people have some benchmark
>estimate for the relationship between the sample size and the
>computation time using R from their experience? Can also
>some one point out some references/packages for speeding up EM,
>especially when sample size and 
>dimension are not trivial? (not exactly a R-related question, but I
>thought people on this list would be interested in such problems).
>
>Regards,
>Jonathan
>
>Christian Hennig wrote:
>> 
>> On Tue, 31 Jul 2001, Jonathan Qiang Li wrote:
>> 
>
>
>> > Hi,
>> >
>> > Has someone tried to use mclust package function emclust() to fit
a
>> > mixture of gaussian model for a relatively large dataset?
>> > By "large", I specifically have in mind a data set with
50,000
>> > observations and 23 dimensions. My machine has 750M memory and
500M swap
>> > space. When I tried to use emclust on the dataset, I consistently
get
>> > messages such as "Error: cannot allocate vector of size
1991669 Kb". In
>> > other words does this mean that R is trying to allocate almost
2000Mb
>> > space? Should this be considered abnormal?
>> 
>> No. I recently talked to A.E.Raftery, one of the designers of the Splus
>> original, and he said that there are indeed problems with datasets of
more
>> than, say, 10000 observations. He said that it is the number of
observations>> that matters, not the dimension. The main problem is, according to him,
>> the hierarchical routine which leads to the initial partition. He
suggests
>> to take a random subsample of size 100-1000 and to generate initial
starting>> parameters from the subsample. I cannot tell you the details, because I
>> have not tried this until now. But the principle is that you can tell
>> emclust/mclust somehow, how the starting values are generated and the
default,>> the memory intensive hierarchical clustering, must be replaced by some
fixed>> starting configuration obtained from a subsample.
>> 
>> Another hint is that for high dimensions it is not advisable to
calculate
>> the "VVV"-model because of the high probability for spurious
local
maxima of>> the likelihood.
>> 
>> Hope that helps,
>> Christian
>> 
>> ***********************************************************************
>> Christian Hennig
>> University of Hamburg, Faculty of Mathematics - SPST/ZMS
>>  (Schwerpunkt Mathematische Statistik und Stochastische Prozesse,
>>  Zentrum fuer Modellierung und Simulation)
>> Bundesstrasse 55, D-20146 Hamburg, Germany
>> Tel: x40/42838 4907, privat x40/631 62 79
>> hennig at math.uni-hamburg.de,
http://www.math.uni-hamburg.de/home/hennig/
>> #######################################################################
>> ich empfehle www.boag.de
>
>-- 
>Jonathan Q. Li, PhD
>Agilent Technologies Laboratory
>Palo Alto, California, USA
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
.-.->r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
._._>  Dr Murray Jorgensen       http://www.stats.waikato.ac.nz/Staff/maj.html 
Department of Statistics, University of Waikato, Hamilton, New Zealand 
*Applications Editor, Australian and New Zealand Journal of Statistics* 
maj at waikato.ac.nz Phone +64-7 838 4773 home phone 856 6705 Fax 838 4155

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Christian Hennig

2001-Aug-02 09:38 UTC

head link

[R] fitting mixture of gaussians using emclust() of mclust package

Dear Jonathan,

Chapter 12 of G. McLachlan, D. Peel "Finite Mixture Models", Wiley, NY
2000
is devoted to this topic and contains lots of further references.

Regards,
Christian

On Wed, 1 Aug 2001, Jonathan Qiang Li wrote:
> Thanks for help.
> 
> Rather than using emclust(), using me() directly with  kmeans-induced
> initial starting parameters
> seems to work better (not sure how much since to get results I have to
> sample the data pretty aggressively). 
> 
> But I still found that when I have data with more than 10,000 obs,
> it takes the routine painfully long time to converge. I understand that
> the speed of 
> convergence for EM algorithm is data-dependent and in general very slow.
> But do people have some benchmark
> estimate for the relationship between the sample size and the
> computation time using R from their experience? Can also
> some one point out some references/packages for speeding up EM,
> especially when sample size and 
> dimension are not trivial? (not exactly a R-related question, but I
> thought people on this list would be interested in such problems).
> 
> Regards,
> Jonathan

***********************************************************************
Christian Hennig
University of Hamburg, Faculty of Mathematics - SPST/ZMS
 (Schwerpunkt Mathematische Statistik und Stochastische Prozesse,
 Zentrum fuer Modellierung und Simulation)
Bundesstrasse 55, D-20146 Hamburg, Germany
Tel: x40/42838 4907, privat x40/631 62 79
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag.de


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Qiang Li

2001-Aug-02 10:18 UTC

head link

[R] fitting mixture of gaussians using emclust() of mclust package

Thanks for the pointer.
BTW, does your package implement some of these variations (such as
stochastic EM, restarting EM etc.)
or is a straight EM? 

Regards,
Jonathan

Christian Hennig wrote:> 
> Dear Jonathan,
> 
> Chapter 12 of G. McLachlan, D. Peel "Finite Mixture Models",
Wiley, NY 2000
> is devoted to this topic and contains lots of further references.
> 
> Regards,
> Christian
> 
> On Wed, 1 Aug 2001, Jonathan Qiang Li wrote:
> 
> > Thanks for help.
> >
> > Rather than using emclust(), using me() directly with  kmeans-induced
> > initial starting parameters
> > seems to work better (not sure how much since to get results I have to
> > sample the data pretty aggressively).
> >
> > But I still found that when I have data with more than 10,000 obs,
> > it takes the routine painfully long time to converge. I understand
that
> > the speed of
> > convergence for EM algorithm is data-dependent and in general very
slow.
> > But do people have some benchmark
> > estimate for the relationship between the sample size and the
> > computation time using R from their experience? Can also
> > some one point out some references/packages for speeding up EM,
> > especially when sample size and
> > dimension are not trivial? (not exactly a R-related question, but I
> > thought people on this list would be interested in such problems).
> >
> > Regards,
> > Jonathan
> 
> ***********************************************************************
> Christian Hennig
> University of Hamburg, Faculty of Mathematics - SPST/ZMS
>  (Schwerpunkt Mathematische Statistik und Stochastische Prozesse,
>  Zentrum fuer Modellierung und Simulation)
> Bundesstrasse 55, D-20146 Hamburg, Germany
> Tel: x40/42838 4907, privat x40/631 62 79
> hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
> #######################################################################
> ich empfehle www.boag.de
-- 
Jonathan Q. Li, PhD
Agilent Technologies Laboratory
Palo Alto, California, USA
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Jonathan Qiang Li

2001-Aug-02 10:53 UTC

head link

[R] fitting mixture of gaussians using emclust() of mclust package

Sorry Christian... The following question in my previous message is
meant for Murray Jorgensen. 

"BTW, does your package implement some of these variations (such as
stochastic EM, restarting EM etc.)
or is a straight EM? "




-- 
Jonathan Q. Li, PhD
Agilent Technologies Laboratory
Palo Alto, California, USA
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

R help - Jul 2001 - fitting mixture of gaussians using emclust() of mclust package

[R] fitting mixture of gaussians using emclust() of mclust package

[R] fitting mixture of gaussians using emclust() of mclust package

[R] fitting mixture of gaussians using emclust() of mclust package

[R] fitting mixture of gaussians using emclust() of mclust package

[R] fitting mixture of gaussians using emclust() of mclust package

[R] fitting mixture of gaussians using emclust() of mclust package

[R] fitting mixture of gaussians using emclust() of mclust package