thr3ads.net - R help - [R] Clustering Large Applications..sort of [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Ken Hutchison

2011-Aug-10 19:07 UTC

[R] Clustering Large Applications..sort of

Hello all,
   I am using the clustering functions in R in order to work with large
masses of binary time series data, however the clustering functions do not
seem able to fit this size of practical problem. Library 'hclust' is
good
(though it may be sub par for this size of problem, thus doubly poor for
this application) in that I do not want to make assumptions about the number
of clusters present, also due to computational resources and time hclust is
not functionally good enough; furthermore k-means works fine assuming the
number of clusters within the data, which is not realistic. The silhouette
functions in 'Pam' and 'Clara' and (if I remember correctly)
'cluster' seem
to be really bad through very thorough experimentation of data generation
with known clusters. I am left then with either theoretical abstractions
such as pruning hclust trees with minimal spanning trees or perhaps
hand-rolling a hierarchical k-medoids which works extremely efficiently and
without cluster number assumptions. Anybody have any suggestions as to
possible libraries which I have missed or suggestions in general? Note: this
is not a question for 'Bigkmeans' unless there exists a
'findbigkmeansnumberofclusters' function also.
                                        Thank you in advance for your
assistance,
                                             Ken

	[[alternative HTML version deleted]]

Thomas Lumley

2011-Aug-10 20:51 UTC

head link

[R] Clustering Large Applications..sort of

Try the flow cytometry clustering functions in Bioconductor.

     -thomas

On Thu, Aug 11, 2011 at 7:07 AM, Ken Hutchison <vicvoncastle at gmail.com>
wrote:> Hello all,
> ? I am using the clustering functions in R in order to work with large
> masses of binary time series data, however the clustering functions do not
> seem able to fit this size of practical problem. Library 'hclust'
is good
> (though it may be sub par for this size of problem, thus doubly poor for
> this application) in that I do not want to make assumptions about the
number
> of clusters present, also due to computational resources and time hclust is
> not functionally good enough; furthermore k-means works fine assuming the
> number of clusters within the data, which is not realistic. The silhouette
> functions in 'Pam' and 'Clara' and (if I remember
correctly) 'cluster' seem
> to be really bad through very thorough experimentation of data generation
> with known clusters. I am left then with either theoretical abstractions
> such as pruning hclust trees with minimal spanning trees or perhaps
> hand-rolling a hierarchical k-medoids which works extremely efficiently and
> without cluster number assumptions. Anybody have any suggestions as to
> possible libraries which I have missed or suggestions in general? Note:
this
> is not a question for 'Bigkmeans' unless there exists a
> 'findbigkmeansnumberofclusters' function also.
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thank you in advance for your
> assistance,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Ken
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

Peter Langfelder

2011-Aug-10 21:18 UTC

head link

[R] Clustering Large Applications..sort of

On Wed, Aug 10, 2011 at 12:07 PM, Ken Hutchison <vicvoncastle at
gmail.com> wrote:> Hello all,
> ? I am using the clustering functions in R in order to work with large
> masses of binary time series data, however the clustering functions do not
> seem able to fit this size of practical problem. Library 'hclust'
is good
> (though it may be sub par for this size of problem, thus doubly poor for
> this application) in that I do not want to make assumptions about the
number
> of clusters present, also due to computational resources and time hclust is
> not functionally good enough;
How big is your problem? If your distance (dissimilarity) fits in the
memory of your machine, packages flashClust and fastCluster provide
much faster implementations of hierarchical clustering than the stock
R function hclust.

Peter

Christian Hennig

2011-Aug-10 23:13 UTC

head link

[R] Clustering Large Applications..sort of

There is a number of methods in the literature to decide the number of 
clusters for k-means. Probably the most popular one is the Calinski and 
Harabasz index, implemented as calinhara in package fpc. A distance 
based version (and several other indexes to do this) is in function 
cluster.stats in the same package.

Christian

On Wed, 10 Aug 2011, Ken Hutchison wrote:
> Hello all,
>   I am using the clustering functions in R in order to work with large
> masses of binary time series data, however the clustering functions do not
> seem able to fit this size of practical problem. Library 'hclust'
is good
> (though it may be sub par for this size of problem, thus doubly poor for
> this application) in that I do not want to make assumptions about the
number
> of clusters present, also due to computational resources and time hclust is
> not functionally good enough; furthermore k-means works fine assuming the
> number of clusters within the data, which is not realistic. The silhouette
> functions in 'Pam' and 'Clara' and (if I remember
correctly) 'cluster' seem
> to be really bad through very thorough experimentation of data generation
> with known clusters. I am left then with either theoretical abstractions
> such as pruning hclust trees with minimal spanning trees or perhaps
> hand-rolling a hierarchical k-medoids which works extremely efficiently and
> without cluster number assumptions. Anybody have any suggestions as to
> possible libraries which I have missed or suggestions in general? Note:
this
> is not a question for 'Bigkmeans' unless there exists a
> 'findbigkmeansnumberofclusters' function also.
>                                        Thank you in advance for your
> assistance,
>                                             Ken
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Christian Hennig

2011-Aug-10 23:15 UTC

head link

[R] Clustering Large Applications..sort of

PS to my previous posting: Also have a look at kmeansruns in fpc. This 
runs kmeans for several numbers of clusters and decides the number of 
clusters by either Calinski&Harabasz or Average Silhouette Width.

Christian

On Wed, 10 Aug 2011, Ken Hutchison wrote:
> Hello all,
>   I am using the clustering functions in R in order to work with large
> masses of binary time series data, however the clustering functions do not
> seem able to fit this size of practical problem. Library 'hclust'
is good
> (though it may be sub par for this size of problem, thus doubly poor for
> this application) in that I do not want to make assumptions about the
number
> of clusters present, also due to computational resources and time hclust is
> not functionally good enough; furthermore k-means works fine assuming the
> number of clusters within the data, which is not realistic. The silhouette
> functions in 'Pam' and 'Clara' and (if I remember
correctly) 'cluster' seem
> to be really bad through very thorough experimentation of data generation
> with known clusters. I am left then with either theoretical abstractions
> such as pruning hclust trees with minimal spanning trees or perhaps
> hand-rolling a hierarchical k-medoids which works extremely efficiently and
> without cluster number assumptions. Anybody have any suggestions as to
> possible libraries which I have missed or suggestions in general? Note:
this
> is not a question for 'Bigkmeans' unless there exists a
> 'findbigkmeansnumberofclusters' function also.
>                                        Thank you in advance for your
> assistance,
>                                             Ken
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Aug 2011 - Clustering Large Applications..sort of

[R] Clustering Large Applications..sort of

[R] Clustering Large Applications..sort of

[R] Clustering Large Applications..sort of

[R] Clustering Large Applications..sort of

[R] Clustering Large Applications..sort of

Possibly Parallel Threads