Corrin Lakeland
2001-Dec-13 20:42 UTC
[R] k-means with euclidian distance but no coordinates
Hi, I'm trying to build a thesaurus that will sensible values for rare words. I suspect the best algorithm to use is k-means although I'm not sure about that -- I would have preferred a k dimensional space with a binary cluster in each dimension so a word can belong to 0..k clusters, but I digress... I can measure the strength of correlation between words fairly easily by counting cooccurance divided by frequency of each word, giving a euclidian distance, although this doesn't work especially well for rare words. However I don't have coordinates as such, and deriving them given distance is non-trivial. Now, as I understand k-means, it uses euclidian distance rather than coordiantes, the first step given in texts is to derive the distance given the coordinates. But I can't find a way to call the built in function without coordinates. I had a look at R-1.3.1/src/library/mva/src/kmns.f but my Fortran isn't good and I had enough trouble following the code, so I'm not up to making major changes. Any help or ideas would be appreciated Corrin -- Corrin Lakeland <lakeland at cs.otago.ac.nz> Department of Computer Science University of Otago, New Zealand -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Prof Brian Ripley
2001-Dec-13 21:39 UTC
[R] k-means with euclidian distance but no coordinates
On Fri, 14 Dec 2001, Corrin Lakeland wrote:> Hi, > > I'm trying to build a thesaurus that will sensible values for rare words. > I suspect the best algorithm to use is k-means although I'm not sure about > that -- I would have preferred a k dimensional space with a binary cluster > in each dimension so a word can belong to 0..k clusters, but I digress... > > I can measure the strength of correlation between words fairly easily by > counting cooccurance divided by frequency of each word, giving a euclidian > distance, although this doesn't work especially well for rare words. > However I don't have coordinates as such, and deriving them given distance > is non-trivial. > > Now, as I understand k-means, it uses euclidian distance rather than > coordiantes, the first step given in texts is to derive the distance given > the coordinates. But I can't find a way to call the built in function > without coordinates. I had a look at R-1.3.1/src/library/mva/src/kmns.f > but my Fortran isn't good and I had enough trouble following the code, so > I'm not up to making major changes.By definition K-means needs coordinates! Try pam/clara in library cluster. I think you have a distance, but not a Euclidean (sic) one. If you did have a Euclidean distance, cmdscale would (easily) give you coordinates. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Huntsinger, Reid
2001-Dec-14 15:15 UTC
[R] k-means with euclidian distance but no coordinates
K-means uses coordinates to actually calculate the k within-cluster means after classifying points based on distance to the previous iteration's means (centroids). The mean is used as it minimizes the sum of squared distances to cluster points. You could try to find this minimizer another way. You would probably restrict to minimizers from your data set as you can't calculate distance for other words... You could also try to get a low-dimensional representation with multidimensional scaling (MDS). It takes a distance matrix as input and provides for each input point a point in a low-dimensional Euclidean space. One option is to do this for a sample, then approximate the mapping eg with a flexible regression approach. I've seen this work well in some perhaps similar cases. There are a lot of approaches to mapping into a low-dimensional Euclidean space based essentially on principal components of the co-occurrence matrix. Are you looking for alternatives to these? These or the MDS approach above would let you use stock k-means, and both can be done in R. Reid Huntsinger -----Original Message----- From: Corrin Lakeland [mailto:lakeland at atlas.otago.ac.nz] Sent: Thursday, December 13, 2001 3:43 PM To: r-help at stat.math.ethz.ch Subject: [R] k-means with euclidian distance but no coordinates Hi, I'm trying to build a thesaurus that will sensible values for rare words. I suspect the best algorithm to use is k-means although I'm not sure about that -- I would have preferred a k dimensional space with a binary cluster in each dimension so a word can belong to 0..k clusters, but I digress... I can measure the strength of correlation between words fairly easily by counting cooccurance divided by frequency of each word, giving a euclidian distance, although this doesn't work especially well for rare words. However I don't have coordinates as such, and deriving them given distance is non-trivial. Now, as I understand k-means, it uses euclidian distance rather than coordiantes, the first step given in texts is to derive the distance given the coordinates. But I can't find a way to call the built in function without coordinates. I had a look at R-1.3.1/src/library/mva/src/kmns.f but my Fortran isn't good and I had enough trouble following the code, so I'm not up to making major changes. Any help or ideas would be appreciated Corrin -- Corrin Lakeland <lakeland at cs.otago.ac.nz> Department of Computer Science University of Otago, New Zealand -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Reasonably Related Threads
- SVD for reducing dimensions
- lmer coefficient distributions and p values
- Is LDAP + Kerberos without Active Directory no longer supported?
- error when compiling "stats" library in R-2.3.1 on Solaris x86
- Is LDAP + Kerberos without Active Directory no longer supported?