thr3ads.net - R help - [R] Subsample points for mclust [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Mario Valle

2009-Jul-21 15:03 UTC

[R] Subsample points for mclust

Hi all!

I have an ordered vector of values. The distribution of these values can 
be modeled by a sum of Gaussians.
So I'm using the package 'mclust' to get the Gaussians's
parameters for
this 1D distribution. It works very well, but, for input sizes above 
100.000 values it starts taking really forever. Unfortunately my dataset 
has around 4.6M values...

My question: is it correct to subsample my dataset taking a value every 
N to make mclust happy? Or have I no alternative except using the 
complete dataset?

Excuse my profound ignorance and thank for your help!
                                                                         
                     mario

-- 
Ing. Mario Valle
Data Analysis and Visualization Group            | cscs.ch/~mvalle
Swiss National Supercomputing Centre (CSCS)      | Tel:  +41 (91) 610.82.60
v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax:  +41 (91) 610.82.82

Mario Valle

2009-Jul-22 06:40 UTC

head link

[R] Subsample points for mclust

Nothing is better than asking help to find the answer by myself...

Page 47 of the technical report (tr504.pdf) deals exactly with the 
problem of big datasets.

Also I found that mclust is too much for my problem, the optimum number 
of Gaussian suggested is way too high. For example for one dataset 
(downsampled to 1/10) it suggests 9 Gaussian, but the central 7 sum with 
good approximation to a single Gaussian, so the dataset is better 
decomposed into only 3 Gaussian.
I admit I'm not rigorous at all...

Bye!
                   mario

Mario Valle wrote:> Hi all!
>
> I have an ordered vector of values. The distribution of these values 
> can be modeled by a sum of Gaussians.
> So I'm using the package 'mclust' to get the Gaussians's
parameters
> for this 1D distribution. It works very well, but, for input sizes 
> above 100.000 values it starts taking really forever. Unfortunately my 
> dataset has around 4.6M values...
>
> My question: is it correct to subsample my dataset taking a value 
> every N to make mclust happy? Or have I no alternative except using 
> the complete dataset?
>
> Excuse my profound ignorance and thank for your help!
>                                                                         
>                     mario
>

-- 
Ing. Mario Valle
Data Analysis and Visualization Group            | cscs.ch/~mvalle
Swiss National Supercomputing Centre (CSCS)      | Tel:  +41 (91) 610.82.60
v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax:  +41 (91) 610.82.82

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Jul 2009 - Subsample points for mclust

[R] Subsample points for mclust

[R] Subsample points for mclust

Seemingly Similar Threads