thr3ads.net - R help - [R] Massive clustering job? [Dec 2004]

If this information is useful, please help other people find it:
Share via:

Dan Bolser

2004-Dec-15 11:37 UTC

[R] Massive clustering job?

Hi, 

I have ~40,000 rows in a database, each of which contains an id column and
20 additional columns of count data.

I want to cluster the rows based on these count vectors.

Their are ~1.6 billion possible 'distances' between pairs of vectors
(cells in my distance matrix), so I need to do something smart.

Can R somehow handle this?

My first thought was to index the database with something that makes
nearest neighbour lookup more efficient, and then use single linkage
clustering. Is this kind of index implemented in R (by default when using
single linkage)?

Also 'grouping' identical vectors is very easy. I tried making groups
more
fuzzy by using a hashing function over the count vectors, but my hash was
too crude. Any way to do fuzzy grouping in R which scales well?

For example, removing identical vectors gives me ~30,000 rows (and ~900
million pairs of distances). As an example of how fast I can group, the
above query took 0.13 seconds in mysql (using an index over every element
in the vector). However, if I tried to calculate a distance between every
pair of non identical vectors (lets say I can calculate ~1000 eutlidian
distances per second) it would take me ~10 days just to calculate the
distance matrix.

Sorry for all the information. Any suggestions on how to cluster such a
huge dataset (using R) would be appreciated.

Cheers,
Dan.

Christian Hennig

2004-Dec-15 12:16 UTC

head link

[R] Massive clustering job?

Dear Dan,

I would think about transforming your columns in such a way (square
root, log?) that methods operating on n*p matrices and assuming
roughly elliptical within-clusters distributions such as kmeans or
clara, or, after dimension reduction, EMclust or fixmahal can be applied.
Maybe you can even do that on untransformed data (take a look at the
variable-wise distributions or 2-d scatterplots). 
You do not need a distance matrix then.

Christian

On Wed, 15 Dec 2004, Dan Bolser wrote:
> 
> Hi, 
> 
> I have ~40,000 rows in a database, each of which contains an id column and
> 20 additional columns of count data.
> 
> I want to cluster the rows based on these count vectors.
> 
> Their are ~1.6 billion possible 'distances' between pairs of
vectors
> (cells in my distance matrix), so I need to do something smart.
> 
> Can R somehow handle this?
> 
> My first thought was to index the database with something that makes
> nearest neighbour lookup more efficient, and then use single linkage
> clustering. Is this kind of index implemented in R (by default when using
> single linkage)?
> 
> Also 'grouping' identical vectors is very easy. I tried making
groups more
> fuzzy by using a hashing function over the count vectors, but my hash was
> too crude. Any way to do fuzzy grouping in R which scales well?
> 
> For example, removing identical vectors gives me ~30,000 rows (and ~900
> million pairs of distances). As an example of how fast I can group, the
> above query took 0.13 seconds in mysql (using an index over every element
> in the vector). However, if I tried to calculate a distance between every
> pair of non identical vectors (lets say I can calculate ~1000 eutlidian
> distances per second) it would take me ~10 days just to calculate the
> distance matrix.
> 
> Sorry for all the information. Any suggestions on how to cluster such a
> huge dataset (using R) would be appreciated.
> 
> Cheers,
> Dan.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> 
***********************************************************************
Christian Hennig
Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
#######################################################################
ich empfehle www.boag-online.de

joerg van den hoff

2004-Dec-15 13:56 UTC

head link

[R] german umlaut problem under MacOS

I did not find this in the archive (hope it isn't there...):

the current release of R (2.0.1) for MacOS (10.3.6) seems not to handle
german special characters like '??' correctly:


 > f <- '??'

can be entered at the prompt, but echoing the variable yields

[1] "\303\274"  (I think the unicode of the character)

and inserting, for instance

text(1,2,f)

in some plot seems to insert two characters (?????) (probably an 
interpretation of the first and second group of the unicode?).

I believe, this is a R problem or is there a simple configuration switch?


thanks

joerg

Wiener, Matthew

2004-Dec-15 14:25 UTC

head link

[R] Massive clustering job?

It sounds like "clara" in package cluster might help.

Regards,

Matt Wiener

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser
Sent: Wednesday, December 15, 2004 6:37 AM
To: R mailing list
Subject: [R] Massive clustering job?



Hi, 

I have ~40,000 rows in a database, each of which contains an id column and
20 additional columns of count data.

I want to cluster the rows based on these count vectors.

Their are ~1.6 billion possible 'distances' between pairs of vectors
(cells in my distance matrix), so I need to do something smart.

Can R somehow handle this?

My first thought was to index the database with something that makes
nearest neighbour lookup more efficient, and then use single linkage
clustering. Is this kind of index implemented in R (by default when using
single linkage)?

Also 'grouping' identical vectors is very easy. I tried making groups
more
fuzzy by using a hashing function over the count vectors, but my hash was
too crude. Any way to do fuzzy grouping in R which scales well?

For example, removing identical vectors gives me ~30,000 rows (and ~900
million pairs of distances). As an example of how fast I can group, the
above query took 0.13 seconds in mysql (using an index over every element
in the vector). However, if I tried to calculate a distance between every
pair of non identical vectors (lets say I can calculate ~1000 eutlidian
distances per second) it would take me ~10 days just to calculate the
distance matrix.

Sorry for all the information. Any suggestions on how to cluster such a
huge dataset (using R) would be appreciated.

Cheers,
Dan.

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Reasonably Related Threads

Search for more seemingly similar threads

R help - Dec 2004 - Massive clustering job?

[R] Massive clustering job?

[R] Massive clustering job?

[R] german umlaut problem under MacOS

[R] Massive clustering job?

Reasonably Related Threads