thr3ads.net - R help - [R] standardization of values before call to pam() or clara() [May 2006]

If this information is useful, please help other people find it:
Share via:

Dylan Beaudette

2006-May-23 00:33 UTC

[R] standardization of values before call to pam() or clara()

Greetings,

Experimenting with the cluster package, and am starting to scratch my head in 
regards to the *best* way to standardize my data. Both functions can 
pre-standardize columns in a dataframe. according to the manual:

Measurements are standardized for each variable (column), by subtracting the 
variable's mean value and dividing by the variable's mean absolute
deviation.

This works well when input variables are all in the same units. When I include 
new variables with a different intrinsic range, the ones with the largest 
relative values tend to be _weighted_ . this is certainly not surprising, but 
complicates things. 

Does there exist a robust technique to effectively re-scale each of the 
variables, regardless of their intrinsic range to some set range, say from 
{0,1} ?

I have tried dividing a variable by the maximum value of that variable, but I 
am not sure if this is statistically correct. 

Any ideas, thoughts would be greatly appreciated.

Cheers,

-- 
Dylan Beaudette
Soils and Biogeochemistry Graduate Group
University of California at Davis
530.754.7341

Martin Maechler

2006-Jun-03 12:19 UTC

head link

[R] standardization of values before call to pam() or clara()

>>>>> "Dylan" == Dylan Beaudette <dylan.beaudette at
gmail.com>
>>>>>     on Mon, 22 May 2006 17:33:47 -0700 writes:
    Dylan> Greetings, Experimenting with the cluster package,
    Dylan> and am starting to scratch my head in regards to the
    Dylan> *best* way to standardize my data. Both functions can
    Dylan> pre-standardize columns in a dataframe. according to
    Dylan> the manual:

    Dylan> Measurements are standardized for each variable
    Dylan> (column), by subtracting the variable's mean value
    Dylan> and dividing by the variable's mean absolute
    Dylan> deviation.

    Dylan> This works well when input variables are all in the
    Dylan> same units. When I include new variables with a
    Dylan> different intrinsic range, the ones with the largest
    Dylan> relative values tend to be _weighted_ . this is
    Dylan> certainly not surprising, but complicates things.

    Dylan> Does there exist a robust technique to effectively
    Dylan> re-scale each of the variables, regardless of their
    Dylan> intrinsic range to some set range, say from {0,1} ?

    Dylan> I have tried dividing a variable by the maximum value
    Dylan> of that variable, but I am not sure if this is
    Dylan> statistically correct.

A more usual scaling standardization is accomplished by the
function -- guess what? -- scale()

It defaults to standardize to mean 0 and std. 1.
But you can use it as well to do a [0,1] scaling.

Note that you are very wise to think about the importance of
variable scaling / weighting for cluster analysis.
But people have been "here" before, and invented the much more
general notion of a distance/dissimilarity between observational
units.
--> function  daisy() {in "cluster"} or  dist() {from
"stats"}
provide such dissimilarity objects.
These can be used as input for  pam() or clara() as well,
and in constructing them you are much more flexible than trying
to find a proper scaling of your x-matrix.

Note that daisy() in particular has been designed for computing
sensible dissimilarities for the case when X-matrix has a
collection of continuous {eg "interval scaled"} and of
categorical (e.g binary) variables.

I recommend you get a textbook on clustering, to read up more on
the subject.

Regards, 
Martin Maechler, ETH Zurich


    Dylan> Any ideas, thoughts would be greatly appreciated.

    Dylan> Cheers,

    Dylan> -- Dylan Beaudette Soils and Biogeochemistry Graduate
    Dylan> Group University of California at Davis 530.754.7341

Apparently Analagous Threads

Search for more apparently analagous threads

R help - May 2006 - standardization of values before call to pam() or clara()

[R] standardization of values before call to pam() or clara()

[R] standardization of values before call to pam() or clara()

Apparently Analagous Threads