Emma Gibson
2013-Mar-19 14:55 UTC
[R] Cluster analysis on weighted survey data with continuous and categorical variables
I am trying to perform cluster analysis on survey data where each respondent has answered several questions, some of which have categorical answers ("blue" "pink" "green" etc) and some of which have scale answers (rating from 1 to 10 etc).My problem is that certain age groups were over-sampled and I need to weight the data collected in order to accurately reflect the current population.Will it make a difference if I do the cluster analysis on the weighted data, and if so, how do I do cluster analysis on the weighted data?Any advice would be much appreciated!Thanks Emma [[alternative HTML version deleted]]
Thomas Lumley
2013-Mar-19 19:39 UTC
[R] Cluster analysis on weighted survey data with continuous and categorical variables
On Wed, Mar 20, 2013 at 3:55 AM, Emma Gibson <waterbabysa@hotmail.com>wrote:> I am trying to perform cluster analysis on survey data where each > respondent has answered several questions, some of which have categorical > answers ("blue" "pink" "green" etc) and some of which have scale answers > (rating from 1 to 10 etc).My problem is that certain age groups were > over-sampled and I need to weight the data collected in order to accurately > reflect the current population.Will it make a difference if I do the > cluster analysis on the weighted data, and if so, how do I do cluster > analysis on the weighted data?Any advice would be much appreciated!Thanks > Emma >The unequal sampling will have some effect on most clustering methods (eg not single-linkage, but k-means or average-linkage). Whether this matters depends on whether you have genuinely separate clusters in the population or a general mush that you are trying to segment in some convenient way. If you have genuine well-separated clusters, then ignoring the oversampling is likely to do well. If you don't, you will get a segementation into clusters that partitions the over-sampled people too finely and the under-sampled people too coarsely. I don't know of any R functions that cluster with sampling weights. If your data set is fairly small, you could expand it by making duplicates (perhaps jittered) of some points, and cluster the expanded data set. On the other hand, if it is very large, you can thin it out to a uniform sample by sampling from it with probability inversely proportional to the original sampling probability. - thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]]