thr3ads.net - R help - [R] hierarchical clustering of large dataset [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Massimo Di Stefano

2012-Mar-08 12:41 UTC

[R] hierarchical clustering of large dataset

Hello All,

i've a set of observations that is in the form :

a,    b,    c,    d,    e,    f
67.12,    4.28,    1.7825,    30,    3,    16001
67.12,    4.28,    1.7825,    30,    3,    16001
66.57,    4.28,    1.355,    30,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
66.2,    4.28,    1.3459,    13,    3,    16001
63.64,    9.726,    1.3004,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
63.28,    9.725,    1.2755,    6,    3,    11012
?
?.

55.000 observation in total.

where :

a,    b,    c,    d,    e  
are environmental parameters
and f  is a label.

as you can see some rows are duplicated,
this means that the observation occurred more times 

(in my use cases the observation is the presence of a specific  biological
specie in a photo,
if in the photo there are more than one individual of the same species i have a
duplicated row )


i'm trying to learn how to use R in order to build a dendrogram 
that will help me to 'group' several species in communities, based on
the similarity of the env. parameters.

i tried with 

d <- diet(as.matrix(my data))
hc <- hclust(d)

but it doesn't works.

is the 'redundancy' of my data (multiple rows with same information) a
problem?
should i remove all the rows that are exactly the same ? 
in this way how to take care about the fact that for the same environmental
parameters i've multiple observation ?
maybe this information is not relevant in order to build the dendrogram ?

Please, can you suggest me a valid approach in order to cluster a such dataset ?
forgive me, i've an evident lack of statistic knowledge, thank you very mach
for you help!

Sarah Goslee

2012-Mar-08 14:02 UTC

head link

[R] hierarchical clustering of large dataset

See inline:

On Thu, Mar 8, 2012 at 7:41 AM, Massimo Di Stefano
<massimodisasha at gmail.com> wrote:>
> Hello All,
>
> i've a set of observations that is in the form :
>
> a, ? ?b, ? ?c, ? ?d, ? ?e, ? ?f
> 67.12, ? ?4.28, ? ?1.7825, ? ?30, ? ?3, ? ?16001
> 67.12, ? ?4.28, ? ?1.7825, ? ?30, ? ?3, ? ?16001
> 66.57, ? ?4.28, ? ?1.355, ? ?30, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 63.64, ? ?9.726, ? ?1.3004, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> ?
> ?.
>
> 55.000 observation in total.
>
> where :
>
> a, ? ?b, ? ?c, ? ?d, ? ?e
> are environmental parameters
> and f ?is a label.
>
> as you can see some rows are duplicated,
> this means that the observation occurred more times
If you use dput() for the first 10 or 20 rows of your data, then you will
have provided the requested reproducible example.
> (in my use cases the observation is the presence of a specific ?biological
specie in a photo,
> if in the photo there are more than one individual of the same species i
have a duplicated row )
>
>
> i'm trying to learn how to use R in order to build a dendrogram
> that will help me to 'group' several species in communities, based
on the similarity of the env. parameters.
>
> i tried with
>
> d <- diet(as.matrix(my data))
> hc <- hclust(d)
>
> but it doesn't works.
I'm assuming you mean dist() instead of diet() ? I don't know of any
function named
diet().

What "doesn't work"? We can't answer your question unless we
know what it is.
> is the 'redundancy' of my data (multiple rows with same
information) a problem?
> should i remove all the rows that are exactly the same ?
Yes. Identical rows have a distance of 0, so they're clustered
together immediately,
so a dendrogram that includes them is identical to one that has only
unique rows.
> in this way how to take care about the fact that for the same environmental
parameters i've multiple observation ?
> maybe this information is not relevant in order to build the dendrogram ?
>
> Please, can you suggest me a valid approach in order to cluster a such
dataset ?
> forgive me, i've an evident lack of statistic knowledge, thank you very
mach for you help!
Perhaps some reading in one of the many excellent ecologically-based
multivariate
statistics books is called for?

Sarah



-- 
Sarah Goslee
http://www.functionaldiversity.org

Peter Langfelder

2012-Mar-09 20:54 UTC

head link

[R] hierarchical clustering of large dataset

On Thu, Mar 8, 2012 at 4:41 AM, Massimo Di Stefano
<massimodisasha at gmail.com> wrote:>
> Hello All,
>
> i've a set of observations that is in the form :
>
> a, ? ?b, ? ?c, ? ?d, ? ?e, ? ?f
> 67.12, ? ?4.28, ? ?1.7825, ? ?30, ? ?3, ? ?16001
> 67.12, ? ?4.28, ? ?1.7825, ? ?30, ? ?3, ? ?16001
> 66.57, ? ?4.28, ? ?1.355, ? ?30, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 66.2, ? ?4.28, ? ?1.3459, ? ?13, ? ?3, ? ?16001
> 63.64, ? ?9.726, ? ?1.3004, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> 63.28, ? ?9.725, ? ?1.2755, ? ?6, ? ?3, ? ?11012
> ?
> ?.
>
> 55.000 observation in total.
Hi Massimo,

you don't want to use the entire matrix to calculate the distance. You
will want to select the environmental columns and you may want to
standardize them to prevent one of them having more influence than
others.

Second, if you want to cluster such a huge data set using hierarchical
clustering, you need a lot of memory, at least 32GB but preferably
64GB. If you don't have that much, you cannot use hierarchical
clustering.

Third, if you do have enough memory, use package flashClust or
fastcluster (I am the maintainer of flashClust.)
For flashClust, you can install it using
install.packages("flashClust") and load it using library(flashClust).
The standard R implementation of hclust is unnecessarily slow (order
n^3). flashClust provides a replacement (function hclust) that is
approximately n^2. I have clustered data sets of 30000 variables in a
minute or two, so 55000 shouldn't take more than 4-5 minutes, again
assuming your computer has enough memory.

HTH,

Peter

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Mar 2012 - hierarchical clustering of large dataset

[R] hierarchical clustering of large dataset

[R] hierarchical clustering of large dataset

[R] hierarchical clustering of large dataset

Seemingly Similar Threads