thr3ads.net - R help - AW: [R] estimating number of clusters ("Null or more") [Apr 2003]

If this information is useful, please help other people find it:
Share via:

Khamenia, Valery

2003-Apr-24 13:11 UTC

AW: [R] estimating number of clusters ("Null or more")

Dear Christian,

  first of all thank you for your answer. I am going to parse through 
  the pages you told me. Meanwhile I'd like to note that probably it 
  is a good idea to put 2-3 lines of R-code demonstrating such a 
  simple needs somnewhere in docs of `cluster' package. E.g.

  x<-rnorm(500)
  ... # output means we have rather 1 claster

  x<-c(rnorm(500), rnorm(500)+5)
  ... # output means we have rather 2 or more claster

  It would be nice not only for me.
> EMclust of library mclust decides about an optimal number of mixture
> components using the BIC.
It is not clear for me whether one could use BIC without a
statement about the familiy of distribution. Indeed BIC is based 
on likelihood, and what the likelihood should be if the only 
adequate statement about the destribution is the ECDF itself?..
 > As far as I know, there is no direct answer to the problem of testing
> homogeneity vs. clustering in R. There are lots of 
> theoretical difficultiesand there is no "standard routine" to 
> do this, neither in R, nor elsewhere.
I am not looking for the Holy Grail, or I hope so :-)

In particular, I beleive some entropy-based criteria should 
fully satisfy me here. BIC might be also good if it might be 
applied to a ECDF.
> I would suggest to invent a null model for your  
> data modelled as
> homogeneous and to estimate the distribution of a 
> suitable clustering
> statistics (such as the silhouette avg.width in pam, 
> BIC, average
> distance of the points to kth nearest neighbor or ratio 
> between 25% largest
> and smallest distances in the dataset) by Monte
> Carlo/parametric bootstrap. Perhaps I say this too quickly; 
a bit compressed, but something is clear anyway :-)
> it's non-trivial and at least you have to design the 
> simulation so that rejection/acceptance is not a 
> consequence of different scaling of data and null model. 
not clear here :-)

thanks again
Valery A.Khamenya

Christian Hennig

2003-Apr-24 13:30 UTC

head link

AW: [R] estimating number of clusters ("Null or more")

Dear Valery,

On Thu, 24 Apr 2003, Khamenia, Valery wrote:
>  Meanwhile I'd like to note that probably it 
>   is a good idea to put 2-3 lines of R-code demonstrating such a 
>   simple needs somnewhere in docs of `cluster' package. E.g.
> 
>   x<-rnorm(500)
>   ... # output means we have rather 1 claster
> 
>   x<-c(rnorm(500), rnorm(500)+5)
>   ... # output means we have rather 2 or more claster
> 
>   It would be nice not only for me.
I agree totally.
> > EMclust of library mclust decides about an optimal number of mixture
> > components using the BIC.
> 
> It is not clear for me whether one could use BIC without a
> statement about the familiy of distribution. Indeed BIC is based 
> on likelihood, and what the likelihood should be if the only 
> adequate statement about the destribution is the ECDF itself?..
The problem is that you have to formalize what a cluster is, and this is
not a well defined notion. It has different meanings in different
applications. My interpretation of the normal mixture/BIC approach is that
it should work well if *your* concept of a cluster is that it looks
normal-shaped (and the clusters do not need to be separated too strongly).
Normal mixtures (sometimes with lots of components) are reasonable
approximations to a wide class of distributions, so the validity of the
approach is rather a question of your cluster concept than of the
distribution of the data. (However, if your concept of "homogeneity"
does
not look normal, BIC may often decide for more than one component for
*in your sense* homogeneous data.)

Some material about my own point of view is given in "What clusters are
generated by Normal mixtures?" on
http://www.math.uni-hamburg.de/home/hennig/ -> Papers/publications
with associated R-software (fixed point clusters) on the same website. 
> > it's non-trivial and at least you have to design the 
> > simulation so that rejection/acceptance is not a 
> > consequence of different scaling of data and null model. 
> 
> not clear here :-)
This means: Do not use N(0,1) as null distribution for homogeneous data if your
data has variance 5 and the test statistics is not scale equivariant (as
k-nearest neighbors and others). A bit more general you have to think about
which features of your data should enter into your homogeneous null model
(which makes the procedure a parametric bootstrap with non-guaranteed
validity of p-values). 

Best,
Christian

-- 
***********************************************************************
Christian Hennig
Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently)
and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag.de

Possibly Parallel Threads

Search for more maybe matching threads

R help - Apr 2003 - AW: estimating number of clusters ("Null or more")

AW: [R] estimating number of clusters ("Null or more")

AW: [R] estimating number of clusters ("Null or more")

Possibly Parallel Threads