Are there R packages that allow for dynamic clustering, i.e. where the number of clusters are not predefined? I have a list of numbers that falls in either 2 or just 1 cluster. Here an example of one that should be clustered into two clusters: two <- c(1,2,3,2,3,1,2,3,400,300,400) and here one that only contains one cluster and would therefore not need to be clustered at all. one <- c(400,402,405, 401,410,415, 407,412) Given a sufficiently large amount of data, a statistical test or an effect size should be able to determined if a data set makes sense to be divided i.e. if there are two groups that differ well enough. I am not familiar with the underlying techniques in kmeans, but I know that it blindly divides both data sets based on the predefined number of clusters. Are there any more sophisticated methods that allow me to determine the number of clusters in a data set based on statistical tests or effect sizes ? Is it possible that this is not a clustering problem but a classification problem? Ralf
Hello, Ralf B wrote:> Are there R packages that allow for dynamic clustering, i.e. where the > number of clusters are not predefined? I have a list of numbers that > falls in either 2 or just 1 cluster. Here an example of one that > should be clustered into two clusters: > > two <- c(1,2,3,2,3,1,2,3,400,300,400) > > and here one that only contains one cluster and would therefore not > need to be clustered at all. > > one <- c(400,402,405, 401,410,415, 407,412) > > Given a sufficiently large amount of data, a statistical test or an > effect size should be able to determined if a data set makes sense to > be divided i.e. if there are two groups that differ well enough. I am > not familiar with the underlying techniques in kmeans, but I know that > it blindly divides both data sets based on the predefined number of > clusters. Are there any more sophisticated methods that allow me to > determine the number of clusters in a data set based on statistical > tests or effect sizes ?Caveat: I have very little experience with clustering methods, but maybe this could get you started: http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set If you only want to make 2 clusters when the means of the data are an order of magnitude apart or more, that's easy enough to do without a statistical test. For your examples above, I naively tried some functions in the mclust package, which I've never used before: mclustModel(one, (mclustBIC(one, G=1:2)))$G # gives 1 mclustModel(two, (mclustBIC(two, G=1:2)))$G # gives 2 You'll have to decide for yourself to determine if this is appropriate for your data...or if I'm even using these functions correctly. :)
On Wed, 5 May 2010, Ralf B wrote:> Are there R packages that allow for dynamic clustering, i.e. where the > number of clusters are not predefined?Yes.> I have a list of numbers that > falls in either 2 or just 1 cluster. Here an example of one that > should be clustered into two clusters: > > two <- c(1,2,3,2,3,1,2,3,400,300,400) > > and here one that only contains one cluster and would therefore not > need to be clustered at all. > > one <- c(400,402,405, 401,410,415, 407,412) > > Given a sufficiently large amount of data, a statistical test or an > effect size should be able to determined if a data set makes sense to > be divided i.e. if there are two groups that differ well enough. I am > not familiar with the underlying techniques in kmeans, but I know that > it blindly divides both data sets based on the predefined number of > clusters. Are there any more sophisticated methods that allow me to > determine the number of clusters in a data set based on statistical > tests or effect sizes ?There are loads of techniques, e.g., cluster indices, or information criteria, etc. Inference is more difficult but there are also certain tools available. In any case, there is a multitude of methods and many of them are discussed in standard textbooks about clustering and/or multivariate analysis etc.> Is it possible that this is not a clustering problem but a > classification problem?That depends on the terminology. "Clustering" is rather unambiguous while "classification" can have different meanings. - In statistical learning, for example, one often distinguishes between "supervised" learning (a response variable is modeled using certain explanatory variables) versus "unsupervised" learning (there is no response). In this terminology: clustering would be unsupervised learning (i.e., what you are trying to do). Supervised learning would encompass "regression" (numeric response) and "classification" (categorical response). - In other statistical communities "classification" is used as term that encompasses "clustering". For example, Gordon's textbook (see ?hclust) is called "Classification". So in the latter terminology the answer to your question is: Yes, it is classification (= clustering). In the former terminology the answer is: No, it's unsupervised learning (= clustering), not supervised learning (= regression/classification). Best, Z
> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Erik Iverson > Sent: Wednesday, May 05, 2010 2:33 PM > To: Ralf B > Cc: r-help at r-project.org > Subject: Re: [R] Dynamic clustering? > > Hello, > > Ralf B wrote: > > Are there R packages that allow for dynamic clustering, i.e. where > the > > number of clusters are not predefined? I have a list of numbers that > > falls in either 2 or just 1 cluster. Here an example of one that > > should be clustered into two clusters: > > > > two <- c(1,2,3,2,3,1,2,3,400,300,400) > > > > and here one that only contains one cluster and would therefore not > > need to be clustered at all. > > > > one <- c(400,402,405, 401,410,415, 407,412) > > > > Given a sufficiently large amount of data, a statistical test or an > > effect size should be able to determined if a data set makes sense to > > be divided i.e. if there are two groups that differ well enough. I am > > not familiar with the underlying techniques in kmeans, but I know > that > > it blindly divides both data sets based on the predefined number of > > clusters. Are there any more sophisticated methods that allow me to > > determine the number of clusters in a data set based on statistical > > tests or effect sizes ? ><<<snip>>> Ralf, There is no procedure in R or any other stat package that can make these kinds of decisions without a whole lot more specification of the problem. You give two examples above. What would you want done with c(380, 400, 402, 405, 401, 410, 415, 407, 412), or c(350, 400, 402, 405, 401, 410, 415, 407, 412), or c(300, 400, 402, 405, 401, 410, 415, 407, 412), or c(100, 400, 402, 405, 401, 410, 415, 407, 412), or ... i.e. what difference counts as big enough or variable enough or ...? Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204
The problem here is that distances between the two cases change dynamically across different sets, I have 100 of such sets. I guess there is no better solution than finding an experience value from a training set, isn't it? Ralf On Wed, May 5, 2010 at 6:04 PM, Phil Spector <spector at stat.berkeley.edu> wrote:> Ralf - > ? I think you're making things more complicated than they > need to be. ?All clustering methods are based on the distances > between observations. ?If the observations are all close > together, the distances between them won't be very large. > If some are farther away than others, then the distances will > be larger. ? The first case would suggest just one cluster, > while the second case would suggest more than one. ?For your > example: > >> two <- c(1,2,3,2,3,1,2,3,400,300,400) >> one <- c(400,402,405, 401,410,415, 407,412) >> max(dist(one)) > > [1] 15 >> >> max(dist(two)) > > [1] 399 > > A little experimentation should provide you with a cut off > that should reliably tell you whether there are 0 or 1 clusters in your > data. > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Phil Spector > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Statistical Computing Facility > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Department of Statistics > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? UC Berkeley > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? spector at stat.berkeley.edu > > > On Wed, 5 May 2010, Ralf B wrote: > >> Are there R packages that allow for dynamic clustering, i.e. where the >> number of clusters are not predefined? I have a list of numbers that >> falls in either 2 or just 1 cluster. Here an example of one that >> should be clustered into two clusters: >> >> two <- c(1,2,3,2,3,1,2,3,400,300,400) >> >> and here one that only contains one cluster and would therefore not >> need to be clustered at all. >> >> one <- c(400,402,405, 401,410,415, 407,412) >> >> Given a sufficiently large amount of data, a statistical test or an >> effect size should be able to determined if a data set makes sense to >> be divided i.e. if there are two groups that differ well enough. I am >> not familiar with the underlying techniques in kmeans, but I know that >> it blindly divides both data sets based on the predefined number of >> clusters. Are there any more sophisticated methods that allow me to >> determine the number of clusters in a data set based on statistical >> tests or effect sizes ? >> >> Is it possible that this is not a clustering problem but a >> classification problem? >> >> Ralf >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >
You could do a hierarchical clustering, then look at the height of the last combination relative to the other heights, for your data:> tmp <- hclust( dist( c(1,2,3,2,3,1,2,3,400,300,400) ) ) > tmp2 <- hclust( dist( c(400,402,405, 401,410,415, 407,412) ) ) > tmp$height[1] 0 0 0 0 0 0 1 2 100 399> tmp2$height[1] 1 2 2 2 5 7 15 You still need to make some assumptions and come up with a method for choosing a cutoff, but this may help get you started. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Ralf B > Sent: Wednesday, May 05, 2010 3:18 PM > To: r-help at r-project.org > Subject: [R] Dynamic clustering? > > Are there R packages that allow for dynamic clustering, i.e. where the > number of clusters are not predefined? I have a list of numbers that > falls in either 2 or just 1 cluster. Here an example of one that > should be clustered into two clusters: > > two <- c(1,2,3,2,3,1,2,3,400,300,400) > > and here one that only contains one cluster and would therefore not > need to be clustered at all. > > one <- c(400,402,405, 401,410,415, 407,412) > > Given a sufficiently large amount of data, a statistical test or an > effect size should be able to determined if a data set makes sense to > be divided i.e. if there are two groups that differ well enough. I am > not familiar with the underlying techniques in kmeans, but I know that > it blindly divides both data sets based on the predefined number of > clusters. Are there any more sophisticated methods that allow me to > determine the number of clusters in a data set based on statistical > tests or effect sizes ? > > Is it possible that this is not a clustering problem but a > classification problem? > > Ralf > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.