thr3ads.net - R help - [R] Dynamic clustering? [May 2010]

If this information is useful, please help other people find it:
Share via:

Ralf B

2010-May-05 21:18 UTC

[R] Dynamic clustering?

Are there R packages that allow for dynamic clustering, i.e. where the
number of clusters are not predefined? I have a list of numbers that
falls in either 2 or just 1 cluster. Here an example of one that
should be clustered into two clusters:

two <- c(1,2,3,2,3,1,2,3,400,300,400)

and here one that only contains one cluster and would therefore not
need to be clustered at all.

one <- c(400,402,405, 401,410,415, 407,412)

Given a sufficiently large amount of data, a statistical test or an
effect size should be able to determined if a data set makes sense to
be divided i.e. if there are two groups that differ well enough. I am
not familiar with the underlying techniques in kmeans, but I know that
it blindly divides both data sets based on the predefined number of
clusters. Are there any more sophisticated methods that allow me to
determine the number of clusters in a data set based on statistical
tests or effect sizes ?

Is it possible that this is not a clustering problem but a
classification problem?

Ralf

Erik Iverson

2010-May-05 21:32 UTC

head link

[R] Dynamic clustering?

Hello,

Ralf B wrote:> Are there R packages that allow for dynamic clustering, i.e. where the
> number of clusters are not predefined? I have a list of numbers that
> falls in either 2 or just 1 cluster. Here an example of one that
> should be clustered into two clusters:
> 
> two <- c(1,2,3,2,3,1,2,3,400,300,400)
> 
> and here one that only contains one cluster and would therefore not
> need to be clustered at all.
> 
> one <- c(400,402,405, 401,410,415, 407,412)
> 
> Given a sufficiently large amount of data, a statistical test or an
> effect size should be able to determined if a data set makes sense to
> be divided i.e. if there are two groups that differ well enough. I am
> not familiar with the underlying techniques in kmeans, but I know that
> it blindly divides both data sets based on the predefined number of
> clusters. Are there any more sophisticated methods that allow me to
> determine the number of clusters in a data set based on statistical
> tests or effect sizes ?
Caveat: I have very little experience with clustering methods, but maybe 
this could get you started:

http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

If you only want to make 2 clusters when the means of the data are an 
order of magnitude apart or more, that's easy enough to do without a 
statistical test.

For your examples above, I naively tried some functions in the mclust 
package, which I've never used before:

mclustModel(one, (mclustBIC(one, G=1:2)))$G # gives 1
mclustModel(two, (mclustBIC(two, G=1:2)))$G # gives 2

You'll have to decide for yourself to determine if this is appropriate 
for your data...or if I'm even using these functions correctly. :)

Achim Zeileis

2010-May-05 21:52 UTC

head link

[R] Dynamic clustering?

On Wed, 5 May 2010, Ralf B wrote:
> Are there R packages that allow for dynamic clustering, i.e. where the
> number of clusters are not predefined?
Yes.
> I have a list of numbers that
> falls in either 2 or just 1 cluster. Here an example of one that
> should be clustered into two clusters:
>
> two <- c(1,2,3,2,3,1,2,3,400,300,400)
>
> and here one that only contains one cluster and would therefore not
> need to be clustered at all.
>
> one <- c(400,402,405, 401,410,415, 407,412)
>
> Given a sufficiently large amount of data, a statistical test or an
> effect size should be able to determined if a data set makes sense to
> be divided i.e. if there are two groups that differ well enough. I am
> not familiar with the underlying techniques in kmeans, but I know that
> it blindly divides both data sets based on the predefined number of
> clusters. Are there any more sophisticated methods that allow me to
> determine the number of clusters in a data set based on statistical
> tests or effect sizes ?
There are loads of techniques, e.g., cluster indices, or information 
criteria, etc.

Inference is more difficult but there are also certain tools available.

In any case, there is a multitude of methods and many of them are 
discussed in standard textbooks about clustering and/or multivariate 
analysis etc.
> Is it possible that this is not a clustering problem but a
> classification problem?
That depends on the terminology. "Clustering" is rather unambiguous
while
"classification" can have different meanings.

   - In statistical learning, for example, one often distinguishes between
     "supervised" learning (a response variable is modeled using
certain
     explanatory variables) versus "unsupervised" learning (there is
no
     response). In this terminology: clustering would be unsupervised
     learning (i.e., what you are trying to do). Supervised learning would
     encompass "regression" (numeric response) and
"classification"
     (categorical response).

   - In other statistical communities "classification" is used as term
     that encompasses "clustering". For example, Gordon's textbook
     (see ?hclust) is called "Classification".

So in the latter terminology the answer to your question is: Yes, it is 
classification (= clustering).

In the former terminology the answer is: No, it's unsupervised learning
(= clustering), not supervised learning (= regression/classification).

Best,
Z

Nordlund, Dan (DSHS/RDA)

2010-May-05 21:52 UTC

head link

[R] Dynamic clustering?

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Erik Iverson
> Sent: Wednesday, May 05, 2010 2:33 PM
> To: Ralf B
> Cc: r-help at r-project.org
> Subject: Re: [R] Dynamic clustering?
> 
> Hello,
> 
> Ralf B wrote:
> > Are there R packages that allow for dynamic clustering, i.e. where
> the
> > number of clusters are not predefined? I have a list of numbers that
> > falls in either 2 or just 1 cluster. Here an example of one that
> > should be clustered into two clusters:
> >
> > two <- c(1,2,3,2,3,1,2,3,400,300,400)
> >
> > and here one that only contains one cluster and would therefore not
> > need to be clustered at all.
> >
> > one <- c(400,402,405, 401,410,415, 407,412)
> >
> > Given a sufficiently large amount of data, a statistical test or an
> > effect size should be able to determined if a data set makes sense to
> > be divided i.e. if there are two groups that differ well enough. I am
> > not familiar with the underlying techniques in kmeans, but I know
> that
> > it blindly divides both data sets based on the predefined number of
> > clusters. Are there any more sophisticated methods that allow me to
> > determine the number of clusters in a data set based on statistical
> > tests or effect sizes ?
> <<<snip>>>

Ralf,

There is no procedure in R or any other stat package that can make these kinds
of decisions without a whole lot more specification of the problem.  You give
two examples above.  What would you want done with

c(380, 400, 402, 405, 401, 410, 415, 407, 412), or
c(350, 400, 402, 405, 401, 410, 415, 407, 412), or
c(300, 400, 402, 405, 401, 410, 415, 407, 412), or
c(100, 400, 402, 405, 401, 410, 415, 407, 412), or
...

i.e. what difference counts as big enough or variable enough or ...? 

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204

Ralf B

2010-May-06 15:28 UTC

head link

[R] Dynamic clustering?

The problem here is that distances between the two cases change
dynamically across different sets, I have 100 of such sets. I guess
there is no better solution than finding an experience value from a
training set, isn't it?

Ralf

On Wed, May 5, 2010 at 6:04 PM, Phil Spector <spector at
stat.berkeley.edu> wrote:> Ralf -
> ? I think you're making things more complicated than they
> need to be. ?All clustering methods are based on the distances
> between observations. ?If the observations are all close
> together, the distances between them won't be very large.
> If some are farther away than others, then the distances will
> be larger. ? The first case would suggest just one cluster,
> while the second case would suggest more than one. ?For your
> example:
>
>> two <- c(1,2,3,2,3,1,2,3,400,300,400)
>> one <- c(400,402,405, 401,410,415, 407,412)
>> max(dist(one))
>
> [1] 15
>>
>> max(dist(two))
>
> [1] 399
>
> A little experimentation should provide you with a cut off
> that should reliably tell you whether there are 0 or 1 clusters in your
> data.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Phil Spector
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Statistical Computing Facility
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Department of Statistics
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? UC Berkeley
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? spector at stat.berkeley.edu
>
>
> On Wed, 5 May 2010, Ralf B wrote:
>
>> Are there R packages that allow for dynamic clustering, i.e. where the
>> number of clusters are not predefined? I have a list of numbers that
>> falls in either 2 or just 1 cluster. Here an example of one that
>> should be clustered into two clusters:
>>
>> two <- c(1,2,3,2,3,1,2,3,400,300,400)
>>
>> and here one that only contains one cluster and would therefore not
>> need to be clustered at all.
>>
>> one <- c(400,402,405, 401,410,415, 407,412)
>>
>> Given a sufficiently large amount of data, a statistical test or an
>> effect size should be able to determined if a data set makes sense to
>> be divided i.e. if there are two groups that differ well enough. I am
>> not familiar with the underlying techniques in kmeans, but I know that
>> it blindly divides both data sets based on the predefined number of
>> clusters. Are there any more sophisticated methods that allow me to
>> determine the number of clusters in a data set based on statistical
>> tests or effect sizes ?
>>
>> Is it possible that this is not a clustering problem but a
>> classification problem?
>>
>> Ralf
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

Greg Snow

2010-May-06 19:28 UTC

head link

[R] Dynamic clustering?

You could do a hierarchical clustering, then look at the height of the last
combination relative to the other heights, for your data:
> tmp <- hclust( dist( c(1,2,3,2,3,1,2,3,400,300,400) ) )
> tmp2 <- hclust( dist( c(400,402,405, 401,410,415, 407,412) ) )
> tmp$height
 [1]   0   0   0   0   0   0   1   2 100 399> tmp2$height[1]  1  2  2  2  5  7 15

You still need to make some assumptions and come up with a method for choosing a
cutoff, but this may help get you started.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Ralf B
> Sent: Wednesday, May 05, 2010 3:18 PM
> To: r-help at r-project.org
> Subject: [R] Dynamic clustering?
> 
> Are there R packages that allow for dynamic clustering, i.e. where the
> number of clusters are not predefined? I have a list of numbers that
> falls in either 2 or just 1 cluster. Here an example of one that
> should be clustered into two clusters:
> 
> two <- c(1,2,3,2,3,1,2,3,400,300,400)
> 
> and here one that only contains one cluster and would therefore not
> need to be clustered at all.
> 
> one <- c(400,402,405, 401,410,415, 407,412)
> 
> Given a sufficiently large amount of data, a statistical test or an
> effect size should be able to determined if a data set makes sense to
> be divided i.e. if there are two groups that differ well enough. I am
> not familiar with the underlying techniques in kmeans, but I know that
> it blindly divides both data sets based on the predefined number of
> clusters. Are there any more sophisticated methods that allow me to
> determine the number of clusters in a data set based on statistical
> tests or effect sizes ?
> 
> Is it possible that this is not a clustering problem but a
> classification problem?
> 
> Ralf
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more possibly parallel threads

R help - May 2010 - Dynamic clustering?

[R] Dynamic clustering?

[R] Dynamic clustering?

[R] Dynamic clustering?

[R] Dynamic clustering?

[R] Dynamic clustering?

[R] Dynamic clustering?

Reasonably Related Threads