Adrian Johnson
2016-Dec-05 03:52 UTC
[R] Clustering methods for data that has bimodal distribution
Dear group, pardon me for a naive question. I have data matrix (11K rows , 4K columns). The data range is between -1 to 1. Not strictly integers, but real numbers with at least place values in millionths. The data distribution is peculiar (if I do plot(density(myMatrix)), I get nice bimodal curve (nice standard distribution between -1 and 0 and another curve between 0 and 1) . I am interested in clustering the data (using conesnsus clustering (that uses K-means)). My question are: 1. If my data is range is between -1 and 1. Is K-means appropriate method. considering if the data might have ties. 2. Although K-means is non-parametric, would a bimodal distributed data be okay as input to K-means. I appreciate any suggestion. Thanks Adrian.
Ranjan Maitra
2016-Dec-05 05:53 UTC
[R] Clustering methods for data that has bimodal distribution
Hello Adrian, It all depends on what the structure of the dataset is. For instance, you said that all your values are betweenn -1 and 1. Do the data rown sum-squared up to 1? How about the means? Are they zero. I guess all this has to depend on the application and how the data were processed or what is sought to be answered? Even if Euclidean space is most apt, then you need to figure out what sort of structure you would like in your derived groups/clusters. For example again, k-means has an underlying philosophy: homoegenous spherical clusters of roughly equal sizes. Is this what yuo want? HTH, Ranjan On Sun, 4 Dec 2016 22:52:33 -0500 Adrian Johnson <oriolebaltimore at gmail.com> wrote:> Dear group, > pardon me for a naive question. I have data matrix (11K rows , 4K columns). > The data range is between -1 to 1. Not strictly integers, but real > numbers with at least place values in millionths. > > The data distribution is peculiar (if I do plot(density(myMatrix)), I > get nice bimodal curve (nice standard distribution between -1 and 0 > and another curve between 0 and 1) . > > I am interested in clustering the data (using conesnsus clustering > (that uses K-means)). > > My question are: > > 1. If my data is range is between -1 and 1. Is K-means appropriate > method. considering if the data might have ties. > > 2. Although K-means is non-parametric, would a bimodal distributed > data be okay as input to K-means. > > I appreciate any suggestion. > Thanks > Adrian. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Important Notice: This mailbox is ignored: e-mails are set to be deleted on receipt. Please respond to the mailing list if appropriate. For those needing to send personal or professional e-mail, please use appropriate addresses.