Dear Sir, This is Ms. Setsuko Kinoshita writing from Japan. I have a question about " missing value" in Hierarchical Clustering. Hierarchical Clustering was not available the data with missing value for earlier version of "R". I used Euclidean distance and complete linkage method for "plot(hclust(dist()),hang=-1)". How are missing values treated for Hierarchical Clustering in the latest "R 1.7.1" program? e.g. : Is an average replaced ? Yours Sincerely, ----- Setsuko Kinoshita Social?and Environmental Medicine,? Graduate School of Comprehensive Human Sciences, University of Tsukuba 1-1-1, Tennoudai, Tsukuba, Ibaraki, 305-8575, Japan Tel&Fax: +81-29-853-3489 E-mail:setsuko at epidemiology.md.tsukuba.ac.jp(office) E-mail:setsuko at mbj.ocn.ne.jp(private)
kjetil brinchmann halvorsen
2003-Sep-27 06:16 UTC
[R] Enquiry about Hierarchical Clustering
On 27 Sep 2003 at 13:30, Setsuko Kinoshita wrote: Try package cluster: library(cluster) ?daisy # computes dissimilarity matrix with missing data ?agnes # aglomerative nesting Kjetil Halvorsen> Dear Sir, > > This is Ms. Setsuko Kinoshita writing from Japan. > > I have a question about " missing value" in Hierarchical Clustering. > Hierarchical Clustering was not available the data with missing value for earlier version of "R". > I used Euclidean distance and complete linkage method for "plot(hclust(dist()),hang=-1)". > > How are missing values treated for Hierarchical Clustering in the latest "R 1.7.1" program? > e.g. : Is an average replaced ? > > Yours Sincerely, > > ----- > Setsuko Kinoshita > > Social $B!! (Band Environmental Medicine, $B!! (B > Graduate School of Comprehensive Human Sciences, > University of Tsukuba > 1-1-1, Tennoudai, Tsukuba, > Ibaraki, 305-8575, Japan > Tel&Fax: +81-29-853-3489 > E-mail:setsuko at epidemiology.md.tsukuba.ac.jp(office) > E-mail:setsuko at mbj.ocn.ne.jp(private) > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Hclust is unable to handle missing values in dist(). There will be missing values in dist() function if 1. all elements in a row are missing 2. all pairs between any two rows have at least one missing values. In the former case, it is better to remove the row with all missing as it is completely uninformative. The latter is harder to detect and I am not sure how to deal with this. Here is how dist() calculates its output for the following data: NA 3 5 2 4 6 dist( rbind( c(NA, 3, 5) , c(2,4,6) ) ) = 1.732051 = sqrt( [ (6-5)^2 + (4-3)^2 ] x 3/2 ) The factor 3/2 scales up the sum of squares of difference to account for the missing pair. Hope this helps. -- Adaikalavan Ramasamy> Dear Sir, > > This is Ms. Setsuko Kinoshita writing from Japan. > > I have a question about " missing value" in Hierarchical Clustering. > Hierarchical Clustering was not available the data with missing value > for earlier version of "R". I used Euclidean distance and complete > linkage method for "plot(hclust(dist()),hang=-1)". > > How are missing values treated for Hierarchical Clustering in the > latest "R 1.7.1" program? e.g. : Is an average replaced ? > > Yours Sincerely, > > ----- > Setsuko Kinoshita > > Social $B!! (Band Environmental Medicine, $B!! (B > Graduate School of Comprehensive Human Sciences, > University of Tsukuba > 1-1-1, Tennoudai, Tsukuba, > Ibaraki, 305-8575, Japan > Tel&Fax: +81-29-853-3489 > E-mail:setsuko at epidemiology.md.tsukuba.ac.jp(office) > E-mail:setsuko at mbj.ocn.ne.jp(private) > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>>>>> "Adaikalavan" == Adaikalavan RAMASAMY <ramasamya at gis.a-star.edu.sg> >>>>> on Sat, 27 Sep 2003 17:05:43 +0800 writes:Adaikalavan> Hclust is unable to handle missing values in Adaikalavan> dist(). There will be missing values in dist() Adaikalavan> function if 1. all elements in a row are Adaikalavan> missing 2. all pairs between any two rows have Adaikalavan> at least one missing values. As Kjetial Halvorsen said, use daisy() from the cluster package instead of dist(). The daisy() function has two advantages over dist(): 1. Handling of missing values 2. Handling of data with continuous *and* categorical variables. [Btw, this has not really anything to do with the clustering method used *after* the distance has been computed. You can use hclust() on a daisy result if you want] Regards, Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/ Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27 ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND phone: x-41-1-632-3408 fax: ...-1228 <><