Dear All, I will be confronted (relatively soon) with the following problem: given a set of known statistical indicators {s_i} , i=1,2...N for a N countries I would like to be able to do some data clustering i.e. determining the best way to partition the N countries according to their known properties, encoded by the {s_i} set of indicators for those countries. Some properties of these countries may be categorical or anyway non-numerical variables (e.g. the fact of belonging/not belonging to a certain group; joining/not joining a certain treaty etc...). I have seen some data clustering examples, but without categorical variables and I wonder if this is an inherent limitation of the methodology (on the top of my head, I would not know how to define the distance between non-numerical variables). Any suggestions about the general methodology and R packages/code snippets is really appreciated. And also: do the units in which I express a statistical indicator play a role? For instance: for 2 given countries I could have the average age of the population, the average life expectancy and the average income per year in thousands of dollars. This would give rise e.g. to (40,72,26) and (44,75,36), but if I measure the average income in dollars, then I would get (40,72,26000) (44,75,36000). Would the units that I choose for an indicator impact on the clustering results? They should not, in my view, since the income does not change whichever way I express it, but I am not sure about the algorithm results. Many thanks Lorenzo
Look at the function daisy in the package cluster. require(cluster) ?daisy Jean Lorenzo Isella wrote on 09/02/2011 11:50:04 AM:> > Dear All, > I will be confronted (relatively soon) with the following problem: > given a set of known statistical indicators {s_i} , i=1,2...N for a N > countries I would like to be able to do some data clustering i.e. > determining the best way to partition the N countries according to their> known properties, encoded by the {s_i} set of indicators for those > countries. > Some properties of these countries may be categorical or anyway > non-numerical variables (e.g. the fact of belonging/not belonging to a > certain group; joining/not joining a certain treaty etc...). I have seen> some data clustering examples, but without categorical variables and I > wonder if this is an inherent limitation of the methodology (on the top > of my head, I would not know how to define the distance between > non-numerical variables). Any suggestions about the general methodology > and R packages/code snippets is really appreciated. > And also: do the units in which I express a statistical indicator play a> role? For instance: for 2 given countries I could have the average age > of the population, the average life expectancy and the average income > per year in thousands of dollars. This would give rise e.g. to > (40,72,26) and (44,75,36), but if I measure the average income in > dollars, then I would get (40,72,26000) (44,75,36000). Would the units > that I choose for an indicator impact on the clustering results? They > should not, in my view, since the income does not change whichever way I> express it, but I am not sure about the algorithm results. > Many thanks > > Lorenzo[[alternative HTML version deleted]]