Nothing is better than asking help to find the answer by myself...
Page 47 of the technical report (tr504.pdf) deals exactly with the
problem of big datasets.
Also I found that mclust is too much for my problem, the optimum number
of Gaussian suggested is way too high. For example for one dataset
(downsampled to 1/10) it suggests 9 Gaussian, but the central 7 sum with
good approximation to a single Gaussian, so the dataset is better
decomposed into only 3 Gaussian.
I admit I'm not rigorous at all...
Bye!
mario
Mario Valle wrote:> Hi all!
>
> I have an ordered vector of values. The distribution of these values
> can be modeled by a sum of Gaussians.
> So I'm using the package 'mclust' to get the Gaussians's
parameters
> for this 1D distribution. It works very well, but, for input sizes
> above 100.000 values it starts taking really forever. Unfortunately my
> dataset has around 4.6M values...
>
> My question: is it correct to subsample my dataset taking a value
> every N to make mclust happy? Or have I no alternative except using
> the complete dataset?
>
> Excuse my profound ignorance and thank for your help!
>
> mario
>
--
Ing. Mario Valle
Data Analysis and Visualization Group | http://www.cscs.ch/~mvalle
Swiss National Supercomputing Centre (CSCS) | Tel: +41 (91) 610.82.60
v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax: +41 (91) 610.82.82