Hi there,
I notice that some of the clustering methods in R are not appropriate to
deal with large data set. Here is the list I make to see which are
appropriate or which are not appropriate for large dataset. Could you
please take a look and check if it is right or not? I need this
information to decide which methods I should choose.
Thank you!
P.S.: List:
Appropriate for large data set:
clara: k-mean
mclust: fits mixtures of Gaussians using the EM algorithm
clue: implements ensemble methods for both hierarchical and partitioning
cluster
methods.
cmeans: Fuzzy clustering
bclust: bagged clustering
hopach: a hybrid between hierarchical methods and PAM and builds a tree
by recursively
partitioning a data set.
som: Self-organizing maps are available
Not appropriate for large data set:
(a) Hierarchical clustering: not appropriate for large data set
because of the quadratic computational complexities in both execution
time and store space.
(b) pam: implement partitioning around medoids and can work with
arbitrary
distances.
[[alternative HTML version deleted]]