thr3ads.net - R help - [R] k-means with euclidian distance but no coordinates [Dec 2001]

If this information is useful, please help other people find it:
Share via:

Corrin Lakeland

2001-Dec-13 20:42 UTC

[R] k-means with euclidian distance but no coordinates

Hi,

I'm trying to build a thesaurus that will sensible values for rare words.  
I suspect the best algorithm to use is k-means although I'm not sure about
that -- I would have preferred a k dimensional space with a binary cluster
in each dimension so a word can belong to 0..k clusters, but I digress...

I can measure the strength of correlation between words fairly easily by
counting cooccurance divided by frequency of each word, giving a euclidian
distance, although this doesn't work especially well for rare words.  
However I don't have coordinates as such, and deriving them given distance
is non-trivial.

Now, as I understand k-means, it uses euclidian distance rather than
coordiantes, the first step given in texts is to derive the distance given
the coordinates. But I can't find a way to call the built in function
without coordinates.  I had a look at R-1.3.1/src/library/mva/src/kmns.f
but my Fortran isn't good and I had enough trouble following the code, so
I'm not up to making major changes.

Any help or ideas would be appreciated

Corrin
--
Corrin Lakeland <lakeland at cs.otago.ac.nz> 
Department of Computer Science
University of Otago, New Zealand


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Prof Brian Ripley

2001-Dec-13 21:39 UTC

head link

[R] k-means with euclidian distance but no coordinates

On Fri, 14 Dec 2001, Corrin Lakeland wrote:
> Hi,
>
> I'm trying to build a thesaurus that will sensible values for rare
words.
> I suspect the best algorithm to use is k-means although I'm not sure
about
> that -- I would have preferred a k dimensional space with a binary cluster
> in each dimension so a word can belong to 0..k clusters, but I digress...
>
> I can measure the strength of correlation between words fairly easily by
> counting cooccurance divided by frequency of each word, giving a euclidian
> distance, although this doesn't work especially well for rare words.
> However I don't have coordinates as such, and deriving them given
distance
> is non-trivial.
>
> Now, as I understand k-means, it uses euclidian distance rather than
> coordiantes, the first step given in texts is to derive the distance given
> the coordinates. But I can't find a way to call the built in function
> without coordinates.  I had a look at R-1.3.1/src/library/mva/src/kmns.f
> but my Fortran isn't good and I had enough trouble following the code,
so
> I'm not up to making major changes.
By definition K-means needs coordinates!  Try pam/clara in library
cluster.

I think you have a distance, but not a Euclidean (sic) one.  If you did
have a Euclidean distance, cmdscale would (easily) give you coordinates.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Huntsinger, Reid

2001-Dec-14 15:15 UTC

head link

[R] k-means with euclidian distance but no coordinates

K-means uses coordinates to actually calculate the k within-cluster means
after classifying points based on distance to the previous iteration's means
(centroids). The mean is used as it minimizes the sum of squared distances
to cluster points. You could try to find this minimizer another way. You
would probably restrict to minimizers from your data set as you can't
calculate distance for other words...

You could also try to get a low-dimensional representation with
multidimensional scaling (MDS). It takes a distance matrix as input and
provides for each input point a point in a low-dimensional Euclidean space.
One option is to do this for a sample, then approximate the mapping eg with
a flexible regression approach. I've seen this work well in some perhaps
similar cases.

There are a lot of approaches to mapping into a low-dimensional Euclidean
space based essentially on principal components of the co-occurrence matrix.
Are you looking for alternatives to these? These or the MDS approach above
would let you use stock k-means, and both can be done in R.

Reid Huntsinger





-----Original Message-----
From: Corrin Lakeland [mailto:lakeland at atlas.otago.ac.nz]
Sent: Thursday, December 13, 2001 3:43 PM
To: r-help at stat.math.ethz.ch
Subject: [R] k-means with euclidian distance but no coordinates


Hi,

I'm trying to build a thesaurus that will sensible values for rare words.  
I suspect the best algorithm to use is k-means although I'm not sure about
that -- I would have preferred a k dimensional space with a binary cluster
in each dimension so a word can belong to 0..k clusters, but I digress...

I can measure the strength of correlation between words fairly easily by
counting cooccurance divided by frequency of each word, giving a euclidian
distance, although this doesn't work especially well for rare words.  
However I don't have coordinates as such, and deriving them given distance
is non-trivial.

Now, as I understand k-means, it uses euclidian distance rather than
coordiantes, the first step given in texts is to derive the distance given
the coordinates. But I can't find a way to call the built in function
without coordinates.  I had a look at R-1.3.1/src/library/mva/src/kmns.f
but my Fortran isn't good and I had enough trouble following the code, so
I'm not up to making major changes.

Any help or ideas would be appreciated

Corrin
--
Corrin Lakeland <lakeland at cs.otago.ac.nz> 
Department of Computer Science
University of Otago, New Zealand


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Possibly Parallel Threads

Search for more maybe matching threads

R help - Dec 2001 - k-means with euclidian distance but no coordinates

[R] k-means with euclidian distance but no coordinates

[R] k-means with euclidian distance but no coordinates

[R] k-means with euclidian distance but no coordinates

Possibly Parallel Threads