Dear R-users, I am trying to run kmeans on a set comprising of 100 observations. But R somehow can not figure out the true underlying groups, although other software such as Jmp, MINITAB are producing the desired result. Following is a brief example of what I am doing. library(stringdist) test=c('hematolgy','hemtology','oncology','onclogy', 'oncolgy','dermatolgy','dermatoloy','dematology', 'neurolog','nerology','neurolgy','nerology') dis=stringdistmatrix(test,test, method = "lv") set.seed(123) cl=kmeans(dis,4) grp_cl=vector('list',4) for(i in 1:4) { grp_cl[[i]]=test[which(cl$cluster==i)] } grp_cl [[1]] [1] "oncology" "onclogy" [[2]] [1] "neurolog" "nerology" "neurolgy" "nerology" [[3]] [1] "oncolgy" [[4]] [1] "hematolgy" "hemtology" "dermatolgy" "dermatoloy" "dematology" In the above example, the 'test' variable consists of a set of terminologies with various typos and I am trying to group the similar types of words based on their string distance. Unfortunately kmeans is not able to replicate the following result that the other software are able to produce. [[1]] [1] "oncology" "onclogy" "oncolgy" [[2]] [1] "neurolog" "nerology" "neurolgy" "nerology" [[3]] [1] "dermatolgy" "dermatoloy" "dematology" [[4]] [1] "hematolgy" "hemtology" Does anyone know if there is a way out, I have heard from a lot of people that multivariate analysis in R does not produce the desired result most of the time. Any help is really appreciated. Thanks in advance. Cassie [[alternative HTML version deleted]]
Cassie, I am sorry but do you even know what k-means does? That it is a locally optimal algorithm. That different software implement the same algorithm differently. FYI, R uses the Hartigan-Wong (1979) algorithm by default, which is probably the most efficient out there. I suggest you first go to a multivariate statistics class before passing such sweeping statements. (Btw, did these same "some people" tell you that most other software do not provide the kinds of broad abilities which R provides, and therefore are not even comparable.) And then, please read the help function for how to "improve" your run of k-means using R. HTH, Ranjan On Tue, 29 Apr 2014 09:45:18 +0530 cassie jones <cassiejones26 at gmail.com> wrote:> Dear R-users, > > I am trying to run kmeans on a set comprising of 100 observations. But R > somehow can not figure out the true underlying groups, although other > software such as Jmp, MINITAB are producing the desired result. > > Following is a brief example of what I am doing. > > library(stringdist) > test=c('hematolgy','hemtology','oncology','onclogy', > 'oncolgy','dermatolgy','dermatoloy','dematology', > 'neurolog','nerology','neurolgy','nerology') > > dis=stringdistmatrix(test,test, method = "lv") > > set.seed(123) > cl=kmeans(dis,4) > > > grp_cl=vector('list',4) > > for(i in 1:4) > { > grp_cl[[i]]=test[which(cl$cluster==i)] > } > grp_cl > > [[1]] > [1] "oncology" "onclogy" > > [[2]] > [1] "neurolog" "nerology" "neurolgy" "nerology" > > [[3]] > [1] "oncolgy" > > [[4]] > [1] "hematolgy" "hemtology" "dermatolgy" "dermatoloy" "dematology" > > In the above example, the 'test' variable consists of a set of > terminologies with various typos and I am trying to group the similar types > of words based on their string distance. Unfortunately kmeans is not able > to replicate the following result that the other software are able to > produce. > [[1]] > [1] "oncology" "onclogy" "oncolgy" > > [[2]] > [1] "neurolog" "nerology" "neurolgy" "nerology" > > [[3]] > [1] "dermatolgy" "dermatoloy" "dematology" > > [[4]] > [1] "hematolgy" "hemtology" > > > Does anyone know if there is a way out, I have heard from a lot of people > that multivariate analysis in R does not produce the desired result most of > the time. Any help is really appreciated. > > > Thanks in advance. > > > Cassie > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Important Notice: This mailbox is ignored: e-mails are set to be deleted on receipt. Please respond to the mailing list if appropriate. For those needing to send personal or professional e-mail, please use appropriate addresses. ____________________________________________________________ FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
You are using the wrong algorithm. You want Partitioning around Medoids (PAM, function pam), not k-means. PAM is also known as k-medoids, which is where the confusion may come from. use library(cluster) cl = pam(dis, 4) and see if you get what you want. HTH, Peter On Mon, Apr 28, 2014 at 9:15 PM, cassie jones <cassiejones26 at gmail.com> wrote:> Dear R-users, > > I am trying to run kmeans on a set comprising of 100 observations. But R > somehow can not figure out the true underlying groups, although other > software such as Jmp, MINITAB are producing the desired result. > > Following is a brief example of what I am doing. > > library(stringdist) > test=c('hematolgy','hemtology','oncology','onclogy', > 'oncolgy','dermatolgy','dermatoloy','dematology', > 'neurolog','nerology','neurolgy','nerology') > > dis=stringdistmatrix(test,test, method = "lv") > > set.seed(123) > cl=kmeans(dis,4) > > > grp_cl=vector('list',4) > > for(i in 1:4) > { > grp_cl[[i]]=test[which(cl$cluster==i)] > } > grp_cl > > [[1]] > [1] "oncology" "onclogy" > > [[2]] > [1] "neurolog" "nerology" "neurolgy" "nerology" > > [[3]] > [1] "oncolgy" > > [[4]] > [1] "hematolgy" "hemtology" "dermatolgy" "dermatoloy" "dematology" > > In the above example, the 'test' variable consists of a set of > terminologies with various typos and I am trying to group the similar types > of words based on their string distance. Unfortunately kmeans is not able > to replicate the following result that the other software are able to > produce. > [[1]] > [1] "oncology" "onclogy" "oncolgy" > > [[2]] > [1] "neurolog" "nerology" "neurolgy" "nerology" > > [[3]] > [1] "dermatolgy" "dermatoloy" "dematology" > > [[4]] > [1] "hematolgy" "hemtology" > > > Does anyone know if there is a way out, I have heard from a lot of people > that multivariate analysis in R does not produce the desired result most of > the time. Any help is really appreciated. > > > Thanks in advance. > > > Cassie > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.