thr3ads.net - R help - [R] Help with clustering [Jan 2009]

If this information is useful, please help other people find it:
Share via:

mauede at alice.it

2009-Jan-26 09:41 UTC

[R] Help with clustering

I am going to try out a tentative clustering of some feature vectors.
The range of values spanned by the three items making up the features vector is
quite different:

Item-1 goes roughly from 70 to 525 (integer numbers only)
Item-2 is in-between 0 and 1 (all real numbers between 0 and 1)
Item-3 goes from 1 to 10 (integer numbers only)

In order to spread out Item-2 even further I might try to replace Item-2 with
Log10(Item-2).

My concern is that, regardless the distance measure used, the item whose order
of magnitude is the highest may carry the highest weight in the process of
calculating the similarity matrix therefore fading out the influence of the
items with smaller variation in the resulting clusters.
Should I normalize all feature vector elements to 1 in advance of generating the
similarity matrix ?

Thank you so much.
Maura 







tutti i telefonini TIM!


	[[alternative HTML version deleted]]

Christian Hennig

2009-Jan-26 13:09 UTC

head link

[R] Help with clustering

Generally, how to scale different variables when aggregating them in a 
dissimilarity measure is strongly dependent on the subject matter, what the 
aim of clustering and your "cluster comncept" is. This cannot be
answered
properly on such a mailing list.

A standard transformation before computing dissimilarities would be to 
scale all variables to variance 1 by dividing by their standard deviations. 
This gives in some well defined sense all 
variables the same weight (which may be somewhat affected by 
outliers, heavy tails, skewness; note, however, that normalising to the same 
range shares the same problems more severly).

Regards,
Christian

On Mon, 26 Jan 2009, mauede at alice.it wrote:
> I am going to try out a tentative clustering of some feature vectors.
> The range of values spanned by the three items making up the features
vector is quite different:
>
> Item-1 goes roughly from 70 to 525 (integer numbers only)
> Item-2 is in-between 0 and 1 (all real numbers between 0 and 1)
> Item-3 goes from 1 to 10 (integer numbers only)
>
> In order to spread out Item-2 even further I might try to replace Item-2
with Log10(Item-2).
>
> My concern is that, regardless the distance measure used, the item whose
order of magnitude is the highest may carry the highest weight in the process of
calculating the similarity matrix therefore fading out the influence of the
items with smaller variation in the resulting clusters.
> Should I normalize all feature vector elements to 1 in advance of
generating the similarity matrix ?
>
> Thank you so much.
> Maura
>
>
>
>
>
>
>
> tutti i telefonini TIM!
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Darin A. England

2009-Jan-27 21:20 UTC

head link

[R] Help with clustering

Have you tried using the cosine of the angle between two
observations as the similarity measure? If you want to account for
magnitudes, there is something called the jaccard coefficient (if I
remember correctly) that can be used.

Darin

On Mon, Jan 26, 2009 at 10:41:40AM +0100, mauede at alice.it
wrote:> I am going to try out a tentative clustering of some feature vectors.
> The range of values spanned by the three items making up the features
vector is quite different:
> 
> Item-1 goes roughly from 70 to 525 (integer numbers only)
> Item-2 is in-between 0 and 1 (all real numbers between 0 and 1)
> Item-3 goes from 1 to 10 (integer numbers only)
> 
> In order to spread out Item-2 even further I might try to replace Item-2
with Log10(Item-2).
> 
> My concern is that, regardless the distance measure used, the item whose
order of magnitude is the highest may carry the highest weight in the process of
calculating the similarity matrix therefore fading out the influence of the
items with smaller variation in the resulting clusters.
> Should I normalize all feature vector elements to 1 in advance of
generating the similarity matrix ?
> 
> Thank you so much.
> Maura 
> 
> 
> 
> 
> 
> 
> 
> tutti i telefonini TIM!
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Jan 2009 - Help with clustering

[R] Help with clustering

[R] Help with clustering

[R] Help with clustering

Possibly Parallel Threads