Hello all, I am using the clustering functions in R in order to work with large masses of binary time series data, however the clustering functions do not seem able to fit this size of practical problem. Library 'hclust' is good (though it may be sub par for this size of problem, thus doubly poor for this application) in that I do not want to make assumptions about the number of clusters present, also due to computational resources and time hclust is not functionally good enough; furthermore k-means works fine assuming the number of clusters within the data, which is not realistic. The silhouette functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem to be really bad through very thorough experimentation of data generation with known clusters. I am left then with either theoretical abstractions such as pruning hclust trees with minimal spanning trees or perhaps hand-rolling a hierarchical k-medoids which works extremely efficiently and without cluster number assumptions. Anybody have any suggestions as to possible libraries which I have missed or suggestions in general? Note: this is not a question for 'Bigkmeans' unless there exists a 'findbigkmeansnumberofclusters' function also. Thank you in advance for your assistance, Ken [[alternative HTML version deleted]]
Try the flow cytometry clustering functions in Bioconductor. -thomas On Thu, Aug 11, 2011 at 7:07 AM, Ken Hutchison <vicvoncastle at gmail.com> wrote:> Hello all, > ? I am using the clustering functions in R in order to work with large > masses of binary time series data, however the clustering functions do not > seem able to fit this size of practical problem. Library 'hclust' is good > (though it may be sub par for this size of problem, thus doubly poor for > this application) in that I do not want to make assumptions about the number > of clusters present, also due to computational resources and time hclust is > not functionally good enough; furthermore k-means works fine assuming the > number of clusters within the data, which is not realistic. The silhouette > functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem > to be really bad through very thorough experimentation of data generation > with known clusters. I am left then with either theoretical abstractions > such as pruning hclust trees with minimal spanning trees or perhaps > hand-rolling a hierarchical k-medoids which works extremely efficiently and > without cluster number assumptions. Anybody have any suggestions as to > possible libraries which I have missed or suggestions in general? Note: this > is not a question for 'Bigkmeans' unless there exists a > 'findbigkmeansnumberofclusters' function also. > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thank you in advance for your > assistance, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Ken > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Thomas Lumley Professor of Biostatistics University of Auckland
On Wed, Aug 10, 2011 at 12:07 PM, Ken Hutchison <vicvoncastle at gmail.com> wrote:> Hello all, > ? I am using the clustering functions in R in order to work with large > masses of binary time series data, however the clustering functions do not > seem able to fit this size of practical problem. Library 'hclust' is good > (though it may be sub par for this size of problem, thus doubly poor for > this application) in that I do not want to make assumptions about the number > of clusters present, also due to computational resources and time hclust is > not functionally good enough;How big is your problem? If your distance (dissimilarity) fits in the memory of your machine, packages flashClust and fastCluster provide much faster implementations of hierarchical clustering than the stock R function hclust. Peter
There is a number of methods in the literature to decide the number of clusters for k-means. Probably the most popular one is the Calinski and Harabasz index, implemented as calinhara in package fpc. A distance based version (and several other indexes to do this) is in function cluster.stats in the same package. Christian On Wed, 10 Aug 2011, Ken Hutchison wrote:> Hello all, > I am using the clustering functions in R in order to work with large > masses of binary time series data, however the clustering functions do not > seem able to fit this size of practical problem. Library 'hclust' is good > (though it may be sub par for this size of problem, thus doubly poor for > this application) in that I do not want to make assumptions about the number > of clusters present, also due to computational resources and time hclust is > not functionally good enough; furthermore k-means works fine assuming the > number of clusters within the data, which is not realistic. The silhouette > functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem > to be really bad through very thorough experimentation of data generation > with known clusters. I am left then with either theoretical abstractions > such as pruning hclust trees with minimal spanning trees or perhaps > hand-rolling a hierarchical k-medoids which works extremely efficiently and > without cluster number assumptions. Anybody have any suggestions as to > possible libraries which I have missed or suggestions in general? Note: this > is not a question for 'Bigkmeans' unless there exists a > 'findbigkmeansnumberofclusters' function also. > Thank you in advance for your > assistance, > Ken > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
PS to my previous posting: Also have a look at kmeansruns in fpc. This runs kmeans for several numbers of clusters and decides the number of clusters by either Calinski&Harabasz or Average Silhouette Width. Christian On Wed, 10 Aug 2011, Ken Hutchison wrote:> Hello all, > I am using the clustering functions in R in order to work with large > masses of binary time series data, however the clustering functions do not > seem able to fit this size of practical problem. Library 'hclust' is good > (though it may be sub par for this size of problem, thus doubly poor for > this application) in that I do not want to make assumptions about the number > of clusters present, also due to computational resources and time hclust is > not functionally good enough; furthermore k-means works fine assuming the > number of clusters within the data, which is not realistic. The silhouette > functions in 'Pam' and 'Clara' and (if I remember correctly) 'cluster' seem > to be really bad through very thorough experimentation of data generation > with known clusters. I am left then with either theoretical abstractions > such as pruning hclust trees with minimal spanning trees or perhaps > hand-rolling a hierarchical k-medoids which works extremely efficiently and > without cluster number assumptions. Anybody have any suggestions as to > possible libraries which I have missed or suggestions in general? Note: this > is not a question for 'Bigkmeans' unless there exists a > 'findbigkmeansnumberofclusters' function also. > Thank you in advance for your > assistance, > Ken > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
Possibly Parallel Threads
- Duda sobre cómo analizar un experimento factorial con algoritmos de extracción de características, clustering y clasificación como factores
- -means, hybrid clustering or similar implementations on R
- Advice on exploration of sub-clusters in hierarchical dendrogram
- Silhouette function problem
- Duda sobre cómo analizar un experimento factorial con algoritmos de extracción de características, clustering y clasificación como factores