ABABAEI, Behnam
2016-Feb-19 19:55 UTC
[R] How a clustering algorithm in R can end up with negative silhouette values?
Hi Sarah, Thank you for the response. But it is said in its description that after each run (sample), each observation in the whole dataset is assigned to the closest cluster. So how is it possible for one observation to be wrongly allocated, even with clara? Behnam Behnam On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee" <sarah.goslee at gmail.com<mailto:sarah.goslee at gmail.com>> wrote: That means that points have been assigned to the wrong groups. This may readily happen with a clustering method like cluster::clara() that uses a subset of the data to cluster a dataset too large to analyze as a unit. Negative silhouette numbers strongly suggest that your clustering parameters should be changed. Sarah On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam <Behnam.ABABAEI at limagrain.com> wrote:> Hi, > > > We know that clustering methods in R assign observations to the closest medoids. Hence, it is supposed to be the closest cluster each observation can have. So, I wonder how it is possible to have negative values of silhouette , while we are supposedly assign each observation to the closest cluster and the formula in silhouette method cannot get negative? > > > Behnam. >[[alternative HTML version deleted]]
Sarah Goslee
2016-Feb-19 19:58 UTC
[R] How a clustering algorithm in R can end up with negative silhouette values?
You need to think more carefully about the details of the clara() method. The algorithm draws repeated samples of sampsize from the larger dataset, as specified by the arguments to the function. It clusters each sample in turn, and saves the best one. It uses the medoids from the best one to assign all of the points to a cluster. But because the clustering is based on a subsample, it may not be representative of the dataset as a whole, and may not provide a good clustering overall. Just because it clusters the subsample well, doesn't mean it clusters the entirety. The details section of the help describes this, and the book references goes into more detail. Sarah On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam <Behnam.ABABAEI at limagrain.com> wrote:> Hi Sarah, > > Thank you for the response. But it is said in its description that after > each run (sample), each observation in the whole dataset is assigned to the > closest cluster. So how is it possible for one observation to be wrongly > allocated, even with clara? > > Behnam > > Behnam > > > > > On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee" > <sarah.goslee at gmail.com> wrote: > > That means that points have been assigned to the wrong groups. This > may readily happen with a clustering method like cluster::clara() that > uses a subset of the data to cluster a dataset too large to analyze as > a unit. Negative silhouette numbers strongly suggest that your > clustering parameters should be changed. > > Sarah > > On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam > <Behnam.ABABAEI at limagrain.com> wrote: >> Hi, >> >> >> We know that clustering methods in R assign observations to the closest >> medoids. Hence, it is supposed to be the closest cluster each observation >> can have. So, I wonder how it is possible to have negative values of >> silhouette , while we are supposedly assign each observation to the closest >> cluster and the formula in silhouette method cannot get negative? >> >> >> Behnam. >>
Sarah Goslee
2016-Feb-19 20:22 UTC
[R] How a clustering algorithm in R can end up with negative silhouette values?
Ah, my guess about the confusion was wrong, then. You're misunderstanding silhouette() instead.>From ?silhouette:Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster. In more detail, they're looking at different things. clara() assigns each point to a cluster based on the distance to the nearest medoid. silhouette() does something different: instead of comparing the distances to the closest medoid and the next closest medoid, which is what you seem to be assuming, silhouette() looks at the mean distance to ALL other points assigned to that cluster, vs the mean distance to all points in other clusters. The distance to the medoid is irrelevant, except as it is one of the points in that cluster. So a negative silhouette value is entirely possible, and means that the cluster produced doesn't represent the dataset very well. On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam <Behnam.ABABAEI at limagrain.com> wrote:> Sarah, > sorry for taking up your time. > > I totally agree with you about how it works. But please let's take a look at this part of the description: > > "Once k representative objects have been selected from the sub-dataset, each observation of the entire dataset is assigned to the nearest medoid. The mean (equivalent to the sum) of the dissimilarities of the observations to their closest medoid is used as a measure of the quality of the clustering. The sub-dataset for which the mean (or sum) is minimal, is retained. A further analysis is carried out on the final partition." > > It says each observation is finally assigned to the closest medoid. The whole clustering process may be imperfect in terms of isolation of clusters, but each observation is already assigned to the closest one and according to the silhouette formula, the silhouette value cannot be negative, as a must be always less than b. > > Regards, > Behnam. > > ________________________________________ > From: Sarah Goslee <sarah.goslee at gmail.com> > Sent: 19 February 2016 20:58 > To: ABABAEI, Behnam > Cc: r-help at r-project.org > Subject: Re: [R] How a clustering algorithm in R can end up with negative silhouette values? > > You need to think more carefully about the details of the clara() method. > > The algorithm draws repeated samples of sampsize from the larger > dataset, as specified by the arguments to the function. > It clusters each sample in turn, and saves the best one. > It uses the medoids from the best one to assign all of the points to a cluster. > > But because the clustering is based on a subsample, it may not be > representative of the dataset as a whole, and may not provide a good > clustering overall. Just because it clusters the subsample well, > doesn't mean it clusters the entirety. The details section of the help > describes this, and the book references goes into more detail. > > Sarah > > > > On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam > <Behnam.ABABAEI at limagrain.com> wrote: >> Hi Sarah, >> >> Thank you for the response. But it is said in its description that after >> each run (sample), each observation in the whole dataset is assigned to the >> closest cluster. So how is it possible for one observation to be wrongly >> allocated, even with clara? >> >> Behnam >> >> Behnam >> >> >> >> >> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee" >> <sarah.goslee at gmail.com> wrote: >> >> That means that points have been assigned to the wrong groups. This >> may readily happen with a clustering method like cluster::clara() that >> uses a subset of the data to cluster a dataset too large to analyze as >> a unit. Negative silhouette numbers strongly suggest that your >> clustering parameters should be changed. >> >> Sarah >> >> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam >> <Behnam.ABABAEI at limagrain.com> wrote: >>> Hi, >>> >>> >>> We know that clustering methods in R assign observations to the closest >>> medoids. Hence, it is supposed to be the closest cluster each observation >>> can have. So, I wonder how it is possible to have negative values of >>> silhouette , while we are supposedly assign each observation to the closest >>> cluster and the formula in silhouette method cannot get negative? >>> >>> >>> Behnam. >>>
David L Carlson
2016-Feb-21 16:37 UTC
[R] How a clustering algorithm in R can end up with negative silhouette values?
Each observation is assigned to the closest medoid, a single observation. An observation that is between two medoids will be assigned to the closer one even if its distances to members of the other cluster are closer on average (but the medoid of that cluster is slightly farther away). If the clusters are not well separated, this can happen easily. You could always change the cluster assignment vector to see what happens to the silhouette plot. That will affect more than just the single observation since silhouette values of all of the points in those two clusters will change slightly (very slightly if there are lots of observations in those two clusters). ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of ABABAEI, Behnam Sent: Friday, February 19, 2016 1:55 PM To: sarah.goslee at gmail.com Cc: r-help at r-project.org Subject: Re: [R] How a clustering algorithm in R can end up with negative silhouette values? Hi Sarah, Thank you for the response. But it is said in its description that after each run (sample), each observation in the whole dataset is assigned to the closest cluster. So how is it possible for one observation to be wrongly allocated, even with clara? Behnam Behnam On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee" <sarah.goslee at gmail.com<mailto:sarah.goslee at gmail.com>> wrote: That means that points have been assigned to the wrong groups. This may readily happen with a clustering method like cluster::clara() that uses a subset of the data to cluster a dataset too large to analyze as a unit. Negative silhouette numbers strongly suggest that your clustering parameters should be changed. Sarah On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam <Behnam.ABABAEI at limagrain.com> wrote:> Hi, > > > We know that clustering methods in R assign observations to the closest medoids. Hence, it is supposed to be the closest cluster each observation can have. So, I wonder how it is possible to have negative values of silhouette , while we are supposedly assign each observation to the closest cluster and the formula in silhouette method cannot get negative? > > > Behnam. >[[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Martin Maechler
2016-Feb-22 15:48 UTC
[R] How a clustering algorithm in R can end up with negative silhouette values?
>>>>> Sarah Goslee <sarah.goslee at gmail.com> >>>>> on Fri, 19 Feb 2016 15:22:22 -0500 writes:> Ah, my guess about the confusion was wrong, then. You're > misunderstanding silhouette() instead. >> From ?silhouette: > Observations with a large s(i) (almost 1) are very > well clustered, a small s(i) (around 0) means that the > observation lies between two clusters, and observations > with a negative s(i) are probably placed in the wrong > cluster. > In more detail, they're looking at different things. > clara() assigns each point to a cluster based on the > distance to the nearest medoid. > silhouette() does something different: instead of > comparing the distances to the closest medoid and the next > closest medoid, which is what you seem to be assuming, > silhouette() looks at the mean distance to ALL other > points assigned to that cluster, vs the mean distance to > all points in other clusters. The distance to the medoid > is irrelevant, except as it is one of the points in that > cluster. > So a negative silhouette value is entirely possible, and > means that the cluster produced doesn't represent the > dataset very well. Indeed ... and this extends to pam(), even; as you say above, " silhouette() does something different " : If your look at the plots of example(silhouette) where the silhouettes of pam(ruspini, k = k') , k' = 2,..,6 are displayed, or if you directly look at plot( silhouette(ruspini, k = 6) ) you will notice that pam() itself can easily lead to negative silhouette values. Martin Maechler [ == maintainer("cluster") ] > On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam > <Behnam.ABABAEI at limagrain.com> wrote: >> Sarah, sorry for taking up your time. >> >> I totally agree with you about how it works. But please >> let's take a look at this part of the description: >> >> "Once k representative objects have been selected from >> the sub-dataset, each observation of the entire dataset >> is assigned to the nearest medoid. The mean (equivalent >> to the sum) of the dissimilarities of the observations to >> their closest medoid is used as a measure of the quality >> of the clustering. The sub-dataset for which the mean (or >> sum) is minimal, is retained. A further analysis is >> carried out on the final partition." >> >> It says each observation is finally assigned to the >> closest medoid. The whole clustering process may be >> imperfect in terms of isolation of clusters, but each >> observation is already assigned to the closest one and >> according to the silhouette formula, the silhouette value >> cannot be negative, as a must be always less than b. >> >> Regards, Behnam. >> >> ________________________________________ From: Sarah >> Goslee <sarah.goslee at gmail.com> Sent: 19 February 2016 >> 20:58 To: ABABAEI, Behnam Cc: r-help at r-project.org >> Subject: Re: [R] How a clustering algorithm in R can end >> up with negative silhouette values? >> >> You need to think more carefully about the details of the >> clara() method. >> >> The algorithm draws repeated samples of sampsize from the >> larger dataset, as specified by the arguments to the >> function. It clusters each sample in turn, and saves the >> best one. It uses the medoids from the best one to >> assign all of the points to a cluster. >> >> But because the clustering is based on a subsample, it >> may not be representative of the dataset as a whole, and >> may not provide a good clustering overall. Just because >> it clusters the subsample well, doesn't mean it clusters >> the entirety. The details section of the help describes >> this, and the book references goes into more detail. >> >> Sarah >> >> >> >> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam >> <Behnam.ABABAEI at limagrain.com> wrote: >>> Hi Sarah, >>> >>> Thank you for the response. But it is said in its >>> description that after each run (sample), each >>> observation in the whole dataset is assigned to the >>> closest cluster. So how is it possible for one >>> observation to be wrongly allocated, even with clara? >>> >>> Behnam >>> >>> Behnam >>> >>> >>> >>> >>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee" >>> <sarah.goslee at gmail.com> wrote: >>> >>> That means that points have been assigned to the wrong >>> groups. This may readily happen with a clustering method >>> like cluster::clara() that uses a subset of the data to >>> cluster a dataset too large to analyze as a >>> unit. Negative silhouette numbers strongly suggest that >>> your clustering parameters should be changed. >>> >>> Sarah >>> >>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam >>> <Behnam.ABABAEI at limagrain.com> wrote: >>>> Hi, >>>> >>>> >>>> We know that clustering methods in R assign >>>> observations to the closest medoids. Hence, it is >>>> supposed to be the closest cluster each observation can >>>> have. So, I wonder how it is possible to have negative >>>> values of silhouette , while we are supposedly assign >>>> each observation to the closest cluster and the formula >>>> in silhouette method cannot get negative? >>>> >>>> >>>> Behnam. >>>> > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and > more, see https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide > commented, minimal, self-contained, reproducible code.