Johan Jackson
2008-Apr-23 00:26 UTC
[R] k-means: should columns in dataset be in same scale?
Hi all, Simple question re k-means. If I have a data set with columns that are on different scales (say col 1 has var=100 and col2 var=2), will this make a difference to the k-means algorithm? It seems as though it does. If so, should we first standardize the columns of the dataset so that each column is given equal weight? JJ [[alternative HTML version deleted]]
Prof Brian Ripley
2008-Apr-23 05:46 UTC
[R] k-means: should columns in dataset be in same scale?
k-means uses Euclidean distance, so scaling of the variables does matter. Whether you want to standardize depends on the example (as it does in most multivariate analysis problems, e.g. PCA has the same issues). On Tue, 22 Apr 2008, Johan Jackson wrote:> Hi all, > > Simple question re k-means. If I have a data set with columns that are on > different scales (say col 1 has var=100 and col2 var=2), will this make a > difference to the k-means algorithm? It seems as though it does. If so, > should we first standardize the columns of the dataset so that each column > is given equal weight? > > JJ-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595