Miha Staut
2008-Jun-03 11:44 UTC
[R] Cluster analysis with numeric and categorical variables
Dear all, I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established. As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set. By searching through the book "Modern Applied Statistics with S" I did not find a satisfactory solution. I will be grateful for any suggestions. Best regards Miha __________________________________________________________ can.html
Christian Hennig
2008-Jun-03 11:58 UTC
[R] Cluster analysis with numeric and categorical variables
Dear Miha, a general way to do this is as follows: Define a distance measure by aggregating the Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if category is the same) on the categorial variable. Perform cluster analysis (whichever you want) on the resulting distance matrix. Note that there is more than one way to do this. The 0-1-distance could be incorporated in the definition of the Euclidean distance (instead of (x_i-y_i)^2), or a weighted average of the distances in X-, Y- and categorial space could be computed. Weights of variables (including possibly rescaling) have to be decided. How to do this precisely should depend on the subject matter and prior information about variable importance etc. In absence of such information, you may standardise the variablewise sums of squared pairwise distances to be equal. Hope this helps (and you can figure out the relevant R code yourself). Christian On Tue, 3 Jun 2008, Miha Staut wrote:> Dear all, > > I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established. As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set. By searching through the book "Modern Applied Statistics with S" I did not find a satisfactory solution. > > I will be grateful for any suggestions. > > Best regards > Miha > > > > __________________________________________________________ > can.html > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche