Alexander Sirotkin [at Yahoo]
2004-Mar-23 23:27 UTC
[R] statistical significance test for cluster agreement
I was wondering, whether there is a way to have statistical significance test for cluster agreement. I know that I can use classAgreement() function to get Rand index, which will give me some indication whether the clusters agree or not, but it would be interesting to have a formal test. Thanks.
Duncan Murdoch
2004-Mar-24 02:30 UTC
[R] statistical significance test for cluster agreement
On Tue, 23 Mar 2004 15:27:14 -0800 (PST), you wrote:>I was wondering, whether there is a way to have >statistical significance test for cluster agreement. > >I know that I can use classAgreement() function to get >Rand index, which will give me some indication whether >the clusters agree or not, but it would be interesting >to have a formal test.Why not simulate data from your hypothesized null distribution, cluster it, and see how your dataset's index value compares to the simulated ones? Duncan Murdoch
But what would such a test do that the rand index does not? Would you interpret the p-value from such a test, if exists, to have the meaning that a real test of hypothesis has? AFAIK you basically need to have the hypotheses pinned down even before you see any data, for the inference to be valid. Is that possible with clustering? Just my $0.02... Andy> From: Alexander Sirotkin [at Yahoo] > > I was wondering, whether there is a way to have > statistical significance test for cluster agreement. > > I know that I can use classAgreement() function to get > Rand index, which will give me some indication whether > the clusters agree or not, but it would be interesting > to have a formal test. > > Thanks. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}
[Apology to the list for the off-topic rant...] As it turned out, I also have a problem with LOF/GOL/etc. tests: I'd bet most of the time when such a test is carried out, it is _not_ the only test being done, but the p-values in the downstream analysis are almost never adjusted for this. How valid would the p-values be? IMHO, it's bad enough that users of statistical methods do things like this, but it's quite something else that statisticians do just the same, or even promote such tests. It's not a crime to do analysis like that, but to treat the p-values as if they actually are meaningful probably ought to be outlawed. OK, I better run for cover now... Andy> From: Alexander Sirotkin [at Yahoo] [mailto:alex_s_42 at yahoo.com] > > Like you said, such kind of test will not give me > anything that Rand index does not, except for p-value. > > The null hypothesis, in my case, is that clustering > results does not match a different clustering, that > someone alse did on the same data. > > And I do believe that this hypothesis is valid. > Basicly, it's not that different from chi-squared > goodness of fit test which check whether or not my > data comes from particular distribution. With an > exception that I don't know how to do chi-squared test > in this case :) > > > > --- "Liaw, Andy" <andy_liaw at merck.com> wrote: > > But what would such a test do that the rand index > > does not? Would you > > interpret the p-value from such a test, if exists, > > to have the meaning that > > a real test of hypothesis has? AFAIK you basically > > need to have the > > hypotheses pinned down even before you see any data, > > for the inference to be > > valid. Is that possible with clustering? > > > > Just my $0.02... > > Andy > > > > > From: Alexander Sirotkin [at Yahoo] > > > > > > I was wondering, whether there is a way to have > > > statistical significance test for cluster > > agreement. > > > > > > I know that I can use classAgreement() function to > > get > > > Rand index, which will give me some indication > > whether > > > the clusters agree or not, but it would be > > interesting > > > to have a formal test. > > > > > > Thanks. > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > > > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > -------------------------------------------------------------- > ---------------- > > Notice: This e-mail message, together with any > > attachments, contains > > information of Merck & Co., Inc. (One Merck Drive, > > Whitehouse Station, New > > Jersey, USA 08889), and/or its affiliates (which may > > be known outside the > > United States as Merck Frosst, Merck Sharp & Dohme > > or MSD and in Japan as > > Banyu) that may be confidential, proprietary > > copyrighted and/or legally > > privileged. It is intended solely for the use of the > > individual or entity > > named on this message. If you are not the intended > > recipient, and have > > received this message in error, please notify us > > immediately by reply e-mail > > and then delete it from your system. > > > -------------------------------------------------------------- > ---------------- > > > __________________________________ >
> From: Alexander Sirotkin [at Yahoo] [mailto:alex_s_42 at yahoo.com] > > Christian, > > I think I understand your point, but I do not > completely agree with you. I also did not describe > my problem clear enough. > > > If you see two > > clusterings on the same > > data, they are identical, if they are 100% > > identical, and if not, then > > not. > > What you are actually saying is that all values of > Rand index for cluster agreement other then 1 > inidicate that clusters do not agree. I believe > that many people would disagree with this statement. > > Let me explain my problem in a little bit more detail. > > I have some classified data set. These classes were > ontained using non-statistical methods. What I'm > trying > to do is run some clustering algorithm and compare > it's results to this known classification. > > I think that this is not very different from > calculating mean and comparing it to some known value.AFAICS they are most definitely not the same. The hypotheses in statistical tests are about `true', unknown, population mean, not the sample mean observed in the data. What exactly would be the hypotheses you intend to test? If you are testing whether the clustering algorithm produces something that disagree with the non-statistical classification, then one disagreement would have settled it, no? Before you think about what statistic to use, do try to figure out how you would write the null and alternative hypotheses, mathematically. Andy> I think that is should be theoretically possible to > use > Rand index as a test statistic. > > Or maybe I'm missing something... > > __________________________________ > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}