Khamenia, Valery
2003-Apr-24 14:30 UTC
AW: AW: [R] estimating number of clusters ("Null or more")
> > It would be nice not only for me. > > I agree totally.If you belong to R-contributors group then thanks a lot in advance!> The problem is that you have to formalize what a cluster is, > and this is not a well defined notion. > It has different meanings in different applications.you are right if one follows the idea of full formalization of the notion it should rather lead to a fail. Should one really take this extreme way then? Let's take a small analogy with statistical tests. Statistical tests never answer "yes" or "no". One should interpret/treat p-values instead on his/her own. Thus, say, nice formed statistics just help us to focus on particular properties of a given distribution. Now back to our case. Why not to build some statistics (in cclust package they are named as `indices') to help focusing our attention on properties of the distribution given?> My interpretation of the normal mixture/BIC > approach is that it should work well if *your* concept of > a cluster is that it looks normal-shaped > (and the clusters do not need to be separated > too strongly).fine. I'd like to emphasize here that as long as possible one should rather deny taking any decision about how much clusters we have. Like with those p-values.> Normal mixtures (sometimes with lots of components) are reasonable > approximations to a wide class of distributions, so the > validity of the approach is rather a question of your > cluster concept than of the distribution of the data.I do agree that multimodal normal mixture is a very powerful approximation basis for a wider class of distributions. But in context of data homogeneity criterion it is rather a weak basis. Indeed, simple lognormal distribution will be adequately approximated with more then one mode only. That pushes us automatically to a false conclusion that lognormal distribution is not homogeneous one. I estimate the very idea of using entropy as quite adequate idea for describing homogeneity of the set, and therefore, good enough to be a basis for taking decision about having cluster or having no cluster.> Some material about my own point of view is given in "What > clusters are generated by Normal mixtures?" on > http://www.math.uni-hamburg.de/home/hennig/ -> Papers/publications > with associated R-software (fixed point clusters) on the same > website.I am reading.> This means: Do not use N(0,1) as null distribution for > homogeneous data if your > ...a bit more clear now. thank you. Well, could I ask what is your own opinion about some statistics (or so called cluster indices) which could focus on properties of data with respect to being homogeneously spread or being attracted to some clusters? In particular do you believe that entropy-based statistics should be adequate according to *your* own comprehension of what the clusters are? And there is still an open question for me whether one could calculate BIC based on ECDF. kind regards, Valery A.Khamenya