Basic question:
Is it correct to assume that when using cutree to set the # clusters
(say k=4), cutree determines the clusters by the largest distances
among all potential clusters?
I've read the R help for cutree and am using it to define the number
of groups to obtain Dunn Index scores (using clValid library) for
cluster analysis (using Euclidean Distance and Ward's method)
More specific (if helpful):
I understand that cutree is used to set the number of clusters for
which the Dunn Index will base it's score on. But the r help doesn't
explain how the groups are determined. Prior to measuring the Dunn
Index, the cluster hierarchy formed using Euclidean Distance and
Ward's provides a certain number of connected pairs of samples. For
example:
Say at the 1st iteration (hierarchy level 1), my n=68 samples are
connected into k=32 groups. The next iteration connects these 32 into
k=16 groups (hierarchy level 2). 3rd iteration = 8; 4th iteration = 4,
and 5th iteration = 2. The distances from one hierarchy level to the
next will differ for each group.
Is it correct to assume that I could cut the tree into anywhere from
k=2 to k=32+16+8+4+2=62 groups? That is, cutree(data,k=2) though
cutree(data,k=62) is valid, whereas anything outside those values is
not?
Now say, I use cutree(data,k=3) to define 3 clusters. Will cutree look
back at the cluster tree created by the Ward's method and then take
the 3 largest distance values from among these 62 potential groups so
that when I use Dunn index, those will be the only distances
considered?
I can post code and/or data if helpful.
Thanks,
kbrownk