Dylan Beaudette
2006-May-23 00:33 UTC
[R] standardization of values before call to pam() or clara()
Greetings, Experimenting with the cluster package, and am starting to scratch my head in regards to the *best* way to standardize my data. Both functions can pre-standardize columns in a dataframe. according to the manual: Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. This works well when input variables are all in the same units. When I include new variables with a different intrinsic range, the ones with the largest relative values tend to be _weighted_ . this is certainly not surprising, but complicates things. Does there exist a robust technique to effectively re-scale each of the variables, regardless of their intrinsic range to some set range, say from {0,1} ? I have tried dividing a variable by the maximum value of that variable, but I am not sure if this is statistically correct. Any ideas, thoughts would be greatly appreciated. Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341
Martin Maechler
2006-Jun-03 12:19 UTC
[R] standardization of values before call to pam() or clara()
>>>>> "Dylan" == Dylan Beaudette <dylan.beaudette at gmail.com> >>>>> on Mon, 22 May 2006 17:33:47 -0700 writes:Dylan> Greetings, Experimenting with the cluster package, Dylan> and am starting to scratch my head in regards to the Dylan> *best* way to standardize my data. Both functions can Dylan> pre-standardize columns in a dataframe. according to Dylan> the manual: Dylan> Measurements are standardized for each variable Dylan> (column), by subtracting the variable's mean value Dylan> and dividing by the variable's mean absolute Dylan> deviation. Dylan> This works well when input variables are all in the Dylan> same units. When I include new variables with a Dylan> different intrinsic range, the ones with the largest Dylan> relative values tend to be _weighted_ . this is Dylan> certainly not surprising, but complicates things. Dylan> Does there exist a robust technique to effectively Dylan> re-scale each of the variables, regardless of their Dylan> intrinsic range to some set range, say from {0,1} ? Dylan> I have tried dividing a variable by the maximum value Dylan> of that variable, but I am not sure if this is Dylan> statistically correct. A more usual scaling standardization is accomplished by the function -- guess what? -- scale() It defaults to standardize to mean 0 and std. 1. But you can use it as well to do a [0,1] scaling. Note that you are very wise to think about the importance of variable scaling / weighting for cluster analysis. But people have been "here" before, and invented the much more general notion of a distance/dissimilarity between observational units. --> function daisy() {in "cluster"} or dist() {from "stats"} provide such dissimilarity objects. These can be used as input for pam() or clara() as well, and in constructing them you are much more flexible than trying to find a proper scaling of your x-matrix. Note that daisy() in particular has been designed for computing sensible dissimilarities for the case when X-matrix has a collection of continuous {eg "interval scaled"} and of categorical (e.g binary) variables. I recommend you get a textbook on clustering, to read up more on the subject. Regards, Martin Maechler, ETH Zurich Dylan> Any ideas, thoughts would be greatly appreciated. Dylan> Cheers, Dylan> -- Dylan Beaudette Soils and Biogeochemistry Graduate Dylan> Group University of California at Davis 530.754.7341
Apparently Analagous Threads
- passing known medoids to clara() in the cluster package
- cross-validation / sensitivity anaylsis for logistic regression model
- compiling rgdal package on windows / macos
- Superimposing vector polygons over raster grid in a plot
- inter-rater agreement index kappa