Hi, I have ~40,000 rows in a database, each of which contains an id column and 20 additional columns of count data. I want to cluster the rows based on these count vectors. Their are ~1.6 billion possible 'distances' between pairs of vectors (cells in my distance matrix), so I need to do something smart. Can R somehow handle this? My first thought was to index the database with something that makes nearest neighbour lookup more efficient, and then use single linkage clustering. Is this kind of index implemented in R (by default when using single linkage)? Also 'grouping' identical vectors is very easy. I tried making groups more fuzzy by using a hashing function over the count vectors, but my hash was too crude. Any way to do fuzzy grouping in R which scales well? For example, removing identical vectors gives me ~30,000 rows (and ~900 million pairs of distances). As an example of how fast I can group, the above query took 0.13 seconds in mysql (using an index over every element in the vector). However, if I tried to calculate a distance between every pair of non identical vectors (lets say I can calculate ~1000 eutlidian distances per second) it would take me ~10 days just to calculate the distance matrix. Sorry for all the information. Any suggestions on how to cluster such a huge dataset (using R) would be appreciated. Cheers, Dan.
Dear Dan, I would think about transforming your columns in such a way (square root, log?) that methods operating on n*p matrices and assuming roughly elliptical within-clusters distributions such as kmeans or clara, or, after dimension reduction, EMclust or fixmahal can be applied. Maybe you can even do that on untransformed data (take a look at the variable-wise distributions or 2-d scatterplots). You do not need a distance matrix then. Christian On Wed, 15 Dec 2004, Dan Bolser wrote:> > Hi, > > I have ~40,000 rows in a database, each of which contains an id column and > 20 additional columns of count data. > > I want to cluster the rows based on these count vectors. > > Their are ~1.6 billion possible 'distances' between pairs of vectors > (cells in my distance matrix), so I need to do something smart. > > Can R somehow handle this? > > My first thought was to index the database with something that makes > nearest neighbour lookup more efficient, and then use single linkage > clustering. Is this kind of index implemented in R (by default when using > single linkage)? > > Also 'grouping' identical vectors is very easy. I tried making groups more > fuzzy by using a hashing function over the count vectors, but my hash was > too crude. Any way to do fuzzy grouping in R which scales well? > > For example, removing identical vectors gives me ~30,000 rows (and ~900 > million pairs of distances). As an example of how fast I can group, the > above query took 0.13 seconds in mysql (using an index over every element > in the vector). However, if I tried to calculate a distance between every > pair of non identical vectors (lets say I can calculate ~1000 eutlidian > distances per second) it would take me ~10 days just to calculate the > distance matrix. > > Sorry for all the information. Any suggestions on how to cluster such a > huge dataset (using R) would be appreciated. > > Cheers, > Dan. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >*********************************************************************** Christian Hennig Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag-online.de
I did not find this in the archive (hope it isn't there...): the current release of R (2.0.1) for MacOS (10.3.6) seems not to handle german special characters like '??' correctly: > f <- '??' can be entered at the prompt, but echoing the variable yields [1] "\303\274" (I think the unicode of the character) and inserting, for instance text(1,2,f) in some plot seems to insert two characters (?????) (probably an interpretation of the first and second group of the unicode?). I believe, this is a R problem or is there a simple configuration switch? thanks joerg
It sounds like "clara" in package cluster might help. Regards, Matt Wiener -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser Sent: Wednesday, December 15, 2004 6:37 AM To: R mailing list Subject: [R] Massive clustering job? Hi, I have ~40,000 rows in a database, each of which contains an id column and 20 additional columns of count data. I want to cluster the rows based on these count vectors. Their are ~1.6 billion possible 'distances' between pairs of vectors (cells in my distance matrix), so I need to do something smart. Can R somehow handle this? My first thought was to index the database with something that makes nearest neighbour lookup more efficient, and then use single linkage clustering. Is this kind of index implemented in R (by default when using single linkage)? Also 'grouping' identical vectors is very easy. I tried making groups more fuzzy by using a hashing function over the count vectors, but my hash was too crude. Any way to do fuzzy grouping in R which scales well? For example, removing identical vectors gives me ~30,000 rows (and ~900 million pairs of distances). As an example of how fast I can group, the above query took 0.13 seconds in mysql (using an index over every element in the vector). However, if I tried to calculate a distance between every pair of non identical vectors (lets say I can calculate ~1000 eutlidian distances per second) it would take me ~10 days just to calculate the distance matrix. Sorry for all the information. Any suggestions on how to cluster such a huge dataset (using R) would be appreciated. Cheers, Dan. ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html