thr3ads.net - R help - [R] How a clustering algorithm in R can end up with negative silhouette values? [Feb 2016]

If this information is useful, please help other people find it:
Share via:

ABABAEI, Behnam

2016-Feb-19 19:55 UTC

[R] How a clustering algorithm in R can end up with negative silhouette values?

Hi Sarah,

Thank you for the response. But it is said in its description that after each
run (sample), each observation in the whole dataset is assigned to the closest
cluster. So how is it possible for one observation to be wrongly allocated, even
with clara?

Behnam

Behnam

On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
<sarah.goslee at gmail.com<mailto:sarah.goslee at gmail.com>> wrote:

That means that points have been assigned to the wrong groups. This
may readily happen with a clustering method like cluster::clara() that
uses a subset of the data to cluster a dataset too large to analyze as
a unit. Negative silhouette numbers strongly suggest that your
clustering parameters should be changed.

Sarah

On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
<Behnam.ABABAEI at limagrain.com> wrote:> Hi,
>
>
> We know that clustering methods in R assign observations to the closest
medoids. Hence, it is supposed to be the closest cluster each observation can
have. So, I wonder how it is possible to have negative values of silhouette ,
while we are supposedly assign each observation to the closest cluster and the
formula in silhouette method cannot get negative?
>
>
> Behnam.
>
	[[alternative HTML version deleted]]

Sarah Goslee

2016-Feb-19 19:58 UTC

head link

[R] How a clustering algorithm in R can end up with negative silhouette values?

You need to think more carefully about the details of the clara() method.

The algorithm draws repeated samples of sampsize from the larger
dataset, as specified by the arguments to the function.
It clusters each sample in turn, and saves the best one.
It uses the medoids from the best one to assign all of the points to a cluster.

But because the clustering is based on a subsample, it may not be
representative of the dataset as a whole, and may not provide a good
clustering overall. Just because it clusters the subsample well,
doesn't mean it clusters the entirety. The details section of the help
describes this, and the book references goes into more detail.

Sarah

On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
<Behnam.ABABAEI at limagrain.com> wrote:> Hi Sarah,
>
> Thank you for the response. But it is said in its description that after
> each run (sample), each observation in the whole dataset is assigned to the
> closest cluster. So how is it possible for one observation to be wrongly
> allocated, even with clara?
>
> Behnam
>
> Behnam
>
>
>
>
> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
> <sarah.goslee at gmail.com> wrote:
>
> That means that points have been assigned to the wrong groups. This
> may readily happen with a clustering method like cluster::clara() that
> uses a subset of the data to cluster a dataset too large to analyze as
> a unit. Negative silhouette numbers strongly suggest that your
> clustering parameters should be changed.
>
> Sarah
>
> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
> <Behnam.ABABAEI at limagrain.com> wrote:
>> Hi,
>>
>>
>> We know that clustering methods in R assign observations to the closest
>> medoids. Hence, it is supposed to be the closest cluster each
observation
>> can have. So, I wonder how it is possible to have negative values of
>> silhouette , while we are supposedly assign each observation to the
closest
>> cluster and the formula in silhouette method cannot get negative?
>>
>>
>> Behnam.
>>

Sarah Goslee

2016-Feb-19 20:22 UTC

head link

[R] How a clustering algorithm in R can end up with negative silhouette values?

Ah, my guess about the confusion was wrong, then. You're
misunderstanding silhouette() instead.
>From ?silhouette:
     Observations with a large s(i) (almost 1) are very well clustered,
     a small s(i) (around 0) means that the observation lies between
     two clusters, and observations with a negative s(i) are probably
     placed in the wrong cluster.


In more detail, they're looking at different things.
clara() assigns each point to a cluster based on the distance to the
nearest medoid.

silhouette() does something different: instead of comparing the
distances to the closest medoid and the next closest medoid, which is
what you seem to be assuming, silhouette() looks at the mean distance
to ALL other points assigned to that cluster, vs the mean distance to
all points in other clusters. The distance to the medoid is
irrelevant, except as it is one of the points in that cluster.

So a negative silhouette value is entirely possible, and means that
the cluster produced doesn't represent the dataset very well.



On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
<Behnam.ABABAEI at limagrain.com> wrote:> Sarah,
> sorry for taking up your time.
>
> I totally agree with you about how it works. But please let's take a
look at this part of the description:
>
> "Once k representative objects have been selected from the
sub-dataset, each observation of the entire dataset is assigned to the nearest
medoid. The mean (equivalent to the sum) of the dissimilarities of the
observations to their closest medoid is used as a measure of the quality of the
clustering. The sub-dataset for which the mean (or sum) is minimal, is retained.
A further analysis is carried out on the final partition."
>
> It says each observation is finally assigned to the closest medoid. The
whole clustering process may be imperfect in terms of isolation of clusters, but
each observation is already assigned to the closest one and according to the
silhouette formula, the silhouette value cannot be negative, as a must be always
less than b.
>
> Regards,
> Behnam.
>
> ________________________________________
> From: Sarah Goslee <sarah.goslee at gmail.com>
> Sent: 19 February 2016 20:58
> To: ABABAEI, Behnam
> Cc: r-help at r-project.org
> Subject: Re: [R] How a clustering algorithm in R can end up with negative
silhouette values?
>
> You need to think more carefully about the details of the clara() method.
>
> The algorithm draws repeated samples of sampsize from the larger
> dataset, as specified by the arguments to the function.
> It clusters each sample in turn, and saves the best one.
> It uses the medoids from the best one to assign all of the points to a
cluster.
>
> But because the clustering is based on a subsample, it may not be
> representative of the dataset as a whole, and may not provide a good
> clustering overall. Just because it clusters the subsample well,
> doesn't mean it clusters the entirety. The details section of the help
> describes this, and the book references goes into more detail.
>
> Sarah
>
>
>
> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
> <Behnam.ABABAEI at limagrain.com> wrote:
>> Hi Sarah,
>>
>> Thank you for the response. But it is said in its description that
after
>> each run (sample), each observation in the whole dataset is assigned to
the
>> closest cluster. So how is it possible for one observation to be
wrongly
>> allocated, even with clara?
>>
>> Behnam
>>
>> Behnam
>>
>>
>>
>>
>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
>> <sarah.goslee at gmail.com> wrote:
>>
>> That means that points have been assigned to the wrong groups. This
>> may readily happen with a clustering method like cluster::clara() that
>> uses a subset of the data to cluster a dataset too large to analyze as
>> a unit. Negative silhouette numbers strongly suggest that your
>> clustering parameters should be changed.
>>
>> Sarah
>>
>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
>> <Behnam.ABABAEI at limagrain.com> wrote:
>>> Hi,
>>>
>>>
>>> We know that clustering methods in R assign observations to the
closest
>>> medoids. Hence, it is supposed to be the closest cluster each
observation
>>> can have. So, I wonder how it is possible to have negative values
of
>>> silhouette , while we are supposedly assign each observation to the
closest
>>> cluster and the formula in silhouette method cannot get negative?
>>>
>>>
>>> Behnam.
>>>

David L Carlson

2016-Feb-21 16:37 UTC

head link

[R] How a clustering algorithm in R can end up with negative silhouette values?

Each observation is assigned to the closest medoid, a single observation. An
observation that is between two medoids will be assigned to the closer one even
if its distances to members of the other cluster are closer on average (but the
medoid of that cluster is slightly farther away). If the clusters are not well
separated, this can happen easily.

You could always change the cluster assignment vector to see what happens to the
silhouette plot. That will affect more than just the single observation since
silhouette values of all of the points in those two clusters will change
slightly (very slightly if there are lots of observations in those two
clusters).

-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of ABABAEI,
Behnam
Sent: Friday, February 19, 2016 1:55 PM
To: sarah.goslee at gmail.com
Cc: r-help at r-project.org
Subject: Re: [R] How a clustering algorithm in R can end up with negative
silhouette values?

Hi Sarah,

Thank you for the response. But it is said in its description that after each
run (sample), each observation in the whole dataset is assigned to the closest
cluster. So how is it possible for one observation to be wrongly allocated, even
with clara?

Behnam

Behnam

On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
<sarah.goslee at gmail.com<mailto:sarah.goslee at gmail.com>> wrote:

That means that points have been assigned to the wrong groups. This
may readily happen with a clustering method like cluster::clara() that
uses a subset of the data to cluster a dataset too large to analyze as
a unit. Negative silhouette numbers strongly suggest that your
clustering parameters should be changed.

Sarah

On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
<Behnam.ABABAEI at limagrain.com> wrote:> Hi,
>
>
> We know that clustering methods in R assign observations to the closest
medoids. Hence, it is supposed to be the closest cluster each observation can
have. So, I wonder how it is possible to have negative values of silhouette ,
while we are supposedly assign each observation to the closest cluster and the
formula in silhouette method cannot get negative?
>
>
> Behnam.
>
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Martin Maechler

2016-Feb-22 15:48 UTC

head link

[R] How a clustering algorithm in R can end up with negative silhouette values?

>>>>> Sarah Goslee <sarah.goslee at gmail.com>
>>>>>     on Fri, 19 Feb 2016 15:22:22 -0500 writes:
    > Ah, my guess about the confusion was wrong, then. You're
    > misunderstanding silhouette() instead.

    >> From ?silhouette:

    >      Observations with a large s(i) (almost 1) are very
    > well clustered, a small s(i) (around 0) means that the
    > observation lies between two clusters, and observations
    > with a negative s(i) are probably placed in the wrong
    > cluster.


    > In more detail, they're looking at different things.
    > clara() assigns each point to a cluster based on the
    > distance to the nearest medoid.

    > silhouette() does something different: instead of
    > comparing the distances to the closest medoid and the next
    > closest medoid, which is what you seem to be assuming,
    > silhouette() looks at the mean distance to ALL other
    > points assigned to that cluster, vs the mean distance to
    > all points in other clusters. The distance to the medoid
    > is irrelevant, except as it is one of the points in that
    > cluster.

    > So a negative silhouette value is entirely possible, and
    > means that the cluster produced doesn't represent the
    > dataset very well.

Indeed ... and this extends to pam(), even; as you say above,
 " silhouette() does something different " :

If your look at the plots of

    example(silhouette)

where the silhouettes of   pam(ruspini, k = k')  ,  k' = 2,..,6
are displayed, or if you directly look at

   plot( silhouette(ruspini, k = 6) )

you will notice that pam() itself can easily lead to negative
silhouette values.

Martin Maechler  [  == maintainer("cluster")  ]

    

    > On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
    > <Behnam.ABABAEI at limagrain.com> wrote:
    >> Sarah, sorry for taking up your time.
    >> 
    >> I totally agree with you about how it works. But please
    >> let's take a look at this part of the description:
    >> 
    >> "Once k representative objects have been selected from
    >> the sub-dataset, each observation of the entire dataset
    >> is assigned to the nearest medoid. The mean (equivalent
    >> to the sum) of the dissimilarities of the observations to
    >> their closest medoid is used as a measure of the quality
    >> of the clustering. The sub-dataset for which the mean (or
    >> sum) is minimal, is retained. A further analysis is
    >> carried out on the final partition."
    >> 
    >> It says each observation is finally assigned to the
    >> closest medoid. The whole clustering process may be
    >> imperfect in terms of isolation of clusters, but each
    >> observation is already assigned to the closest one and
    >> according to the silhouette formula, the silhouette value
    >> cannot be negative, as a must be always less than b.
    >> 
    >> Regards, Behnam.
    >> 
    >> ________________________________________ From: Sarah
    >> Goslee <sarah.goslee at gmail.com> Sent: 19 February 2016
    >> 20:58 To: ABABAEI, Behnam Cc: r-help at r-project.org
    >> Subject: Re: [R] How a clustering algorithm in R can end
    >> up with negative silhouette values?
    >> 
    >> You need to think more carefully about the details of the
    >> clara() method.
    >> 
    >> The algorithm draws repeated samples of sampsize from the
    >> larger dataset, as specified by the arguments to the
    >> function.  It clusters each sample in turn, and saves the
    >> best one.  It uses the medoids from the best one to
    >> assign all of the points to a cluster.
    >> 
    >> But because the clustering is based on a subsample, it
    >> may not be representative of the dataset as a whole, and
    >> may not provide a good clustering overall. Just because
    >> it clusters the subsample well, doesn't mean it clusters
    >> the entirety. The details section of the help describes
    >> this, and the book references goes into more detail.
    >> 
    >> Sarah
    >> 
    >> 
    >> 
    >> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
    >> <Behnam.ABABAEI at limagrain.com> wrote:
    >>> Hi Sarah,
    >>> 
    >>> Thank you for the response. But it is said in its
    >>> description that after each run (sample), each
    >>> observation in the whole dataset is assigned to the
    >>> closest cluster. So how is it possible for one
    >>> observation to be wrongly allocated, even with clara?
    >>> 
    >>> Behnam
    >>> 
    >>> Behnam
    >>> 
    >>> 
    >>> 
    >>> 
    >>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah
Goslee"
    >>> <sarah.goslee at gmail.com> wrote:
    >>> 
    >>> That means that points have been assigned to the wrong
    >>> groups. This may readily happen with a clustering method
    >>> like cluster::clara() that uses a subset of the data to
    >>> cluster a dataset too large to analyze as a
    >>> unit. Negative silhouette numbers strongly suggest that
    >>> your clustering parameters should be changed.
    >>> 
    >>> Sarah
    >>> 
    >>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
    >>> <Behnam.ABABAEI at limagrain.com> wrote:
    >>>> Hi,
    >>>> 
    >>>> 
    >>>> We know that clustering methods in R assign
    >>>> observations to the closest medoids. Hence, it is
    >>>> supposed to be the closest cluster each observation can
    >>>> have. So, I wonder how it is possible to have negative
    >>>> values of silhouette , while we are supposedly assign
    >>>> each observation to the closest cluster and the formula
    >>>> in silhouette method cannot get negative?
    >>>> 
    >>>> 
    >>>> Behnam.
    >>>> 

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
    > more, see https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide
    > http://www.R-project.org/posting-guide.html and provide
    > commented, minimal, self-contained, reproducible code.

R help - Feb 2016 - How a clustering algorithm in R can end up with negative silhouette values?

[R] How a clustering algorithm in R can end up with negative silhouette values?

[R] How a clustering algorithm in R can end up with negative silhouette values?

[R] How a clustering algorithm in R can end up with negative silhouette values?

[R] How a clustering algorithm in R can end up with negative silhouette values?

[R] How a clustering algorithm in R can end up with negative silhouette values?