Hi David,
That area/topic you flagged is unusual to say the least in the grand scheme of
what I have read in the coverage of k-means.
I have been using k-means for many years, and have never come across this before
(maybe out of ignorance and not keeping abreast of all the issues associated
with this algorithm). Using google, I came across this journal article title
and am assuming this is what sphericity refers to in this case.
"K-means cluster analysis is known for its tendency to produce spherical
and equally sized clusters". Sphericity was used in this context. This
has not been my experience at all. I assume here that they mean that individual
cases are spread equally around the centroid/exemplar that forms the
'centre' point of the cluster.
As a distance based algorithm, I would have thought that the treatment of
outliers that can distort the positioning of cases within a set of solutions and
the location of centroids that can be drawn into unusual locations etc because
of outliers is very important, but as far as assessing violations of sphericity
and then dealing with these, this is an interesting prospect to say the least.
As I think through this, the only way I have ever visualised clusters for
distributional patterns in 2-dimensional space, and when doing so have never
seen spherical and similar sized solutions from k-means. Admittedly, I use
clustan graphics, but know also in 'r' that there are a lot of great
algorithms/options for usage.
I would be more interested in having a rigorous process around the following
than assessing the sphericity of my solutions (order here does not imply
importance)
1. determining the optimal no of clusters (using transitional matrices for 2-5
solutions etc to determine how the cases are moving between solutions and
splitting as you form sub-clusters)
2. multiple seeding implementations (randomise the seeding at the start) and
then using some method of assessment of reproducibilty of the solutions (global
versus local solutions) that have these multiple seeding points
3. assessing convergence of the algorithm versus whether or not the algorithm
stops after a no of iterations and prior to convergence
4. meaningfulness of solutions (a number of criteria can be applied here)
5. meaningfulness of input variables
6. a robust way to generate matrices to deal with different variable types
included in the matrix (eg nominal, ordinal, continuous)
7. ensuring that the variables driving the solution are not impacted on by noisy
variables (eg a way to assess/downweight the influence of noisy variables)
8. treatment of missing data to ensure that bias is mitigated
9. treatment and exclusion of outliers that can pull the centroids into a less
meaningful relationship
Cheers Paul
> darteta001@ikasle.ehu.es wrote:
>
> Dear list, first apologies for this is not strictly an R question but
> a theoretical one.
>
> I have read that use of k-means clustering assumes sphericity of data
> distribution. Can anyone explain me what this means? My statistical
> background is too poor. Is it another kind of distribution, like
> gaussian or binomial? What does it happen if the distribution is not
> spherical? Could you give me an example or a link to information about
> this?
>
> Thanks for your help
>
> David
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.