I'm working on a project related to document clustering. I know that R has clustering algorithms such as clara, but only supports two distance metrics: euclidian and manhattan, which are not very useful for clustering documents. I was wondering how easy it would be to extend the clustering package in R to support other distance metrics, such as cosine distance, or if there was an API for custom distance metrics. Best regards, Raymond Pon pon3 at llnl.gov x43062
I searched the help for "cosine distance" and this was the first hit http://finzi.psych.upenn.edu/R/Rhelp02a/archive/3946.html Tom> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of Raymond K Pon > Sent: Tuesday, 13 September 2005 3:48 AM > To: r-help at stat.math.ethz.ch > Subject: [R] Document clustering for R > > > I'm working on a project related to document clustering. I > know that R > has clustering algorithms such as clara, but only supports > two distance > metrics: euclidian and manhattan, which are not very useful for > clustering documents. I was wondering how easy it would be to > extend the > clustering package in R to support other distance metrics, such as > cosine distance, or if there was an API for custom distance metrics. > > Best regards, > Raymond Pon > pon3 at llnl.gov > x43062 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
If you are able to implement the computation of the distance matrix, you can use methods such as pam, agnes and hclust, which operate on dissimilarity matrices of any kind. You may also perform a multidimensional scaling with isoMDS, sammon or cmdscale and use any clustering algorithm for n*p data on the outcome. Best, Christian On Mon, 12 Sep 2005, Raymond K Pon wrote:> I'm working on a project related to document clustering. I know that R > has clustering algorithms such as clara, but only supports two distance > metrics: euclidian and manhattan, which are not very useful for > clustering documents. I was wondering how easy it would be to extend the > clustering package in R to support other distance metrics, such as > cosine distance, or if there was an API for custom distance metrics. > > Best regards, > Raymond Pon > pon3 at llnl.gov > x43062 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
Hi, We discovered that the package "amap" contain a distance calculation function call Dist which can calculate the distance according to a method call "pearson" which is in fact the "not centered Pearson" which seems to be the cosine distance. Could you tell me what do you think on that? Best regards, David On Sep 12, 2005, at 21:47, Raymond K Pon wrote:> I'm working on a project related to document clustering. I know that R > has clustering algorithms such as clara, but only supports two distance > metrics: euclidian and manhattan, which are not very useful for > clustering documents. I was wondering how easy it would be to extend > the > clustering package in R to support other distance metrics, such as > cosine distance, or if there was an API for custom distance metrics. > > Best regards, > Raymond Pon > pon3 at llnl.gov > x43062 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
On Mon, 2005-09-12 at 12:47 -0700, Raymond K Pon wrote:> I'm working on a project related to document clustering. I know that R > has clustering algorithms such as clara, but only supports two distance > metrics: euclidian and manhattan, which are not very useful for > clustering documents. I was wondering how easy it would be to extend the > clustering package in R to support other distance metrics, such as > cosine distance, or if there was an API for custom distance metrics. >You don't have to extend the "clustering package in R to support other distance metrics", but you should take care that you produce your dissimilarities (or distances) in the standard format so that they can be used in "clustering package" or in cmdscale or in isoMDS or any other function excepting a "dist" object. "Clustering package" will support new dissimilarities if they were written in standard conforming way. There are several packages that offer alternative dissimilarities (and some even distances) that can be used in clustering functions. Look for "distances" or "dissimilarities" in the R Site. Some of these could be the one for you... I would be surprised if cosine index is missing (and if needed, I could write it for you in C, but I don't think that is necessary). cheers, jari oksanen