thr3ads.net - R help - [R] Cluster analysis, factor variables, large data set [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Hans Ekbrand

2011-Mar-31 17:46 UTC

[R] Cluster analysis, factor variables, large data set

Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The
variabels represent labour market status during 36 months, there are 8
different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is
included in the data set.

To analyse sub sets of the data, I have used daisy in the
cluster-package to create a distance matrix and then used pam (or pamk
in the fpc-package), to get a k-medoids cluster-solution. Now I want
to analyse the whole set.

clara is said to cope with large data sets, but the first step in the
cluster analysis, the creation of the distance matrix must be done by
another function since clara only works with numeric data.

Is there an alternative to the daisy -> clara route that does not
require as much RAM?

What functions would you recommend for a cluster analysis of this kind
of data on large data set?


regards,

Hans Ekbrand

Christian Hennig

2011-Mar-31 18:06 UTC

head link

[R] Cluster analysis, factor variables, large data set

Dear Hans,

clara doesn't require a distance matrix as input (and therefore doesn't 
require you to run daisy), it will work with the raw data matrix using
Euclidean distances implicitly.
I can't tell you whether Euclidean distances are appropriate in this 
situation (this depends on the interpretation and variables and 
particularly on how they are scaled), but they may be fine at least after 
some transformation and standardisation of your variables.

Hope this helps,
Christian

On Thu, 31 Mar 2011, Hans Ekbrand wrote:
> Dear R helpers,
>
> I have a large data set with 36 variables and about 50.000 cases. The
> variabels represent labour market status during 36 months, there are 8
> different variable values (e.g. Full-time Employment, Student,...)
>
> Only cases with at least one change in labour market status is
> included in the data set.
>
> To analyse sub sets of the data, I have used daisy in the
> cluster-package to create a distance matrix and then used pam (or pamk
> in the fpc-package), to get a k-medoids cluster-solution. Now I want
> to analyse the whole set.
>
> clara is said to cope with large data sets, but the first step in the
> cluster analysis, the creation of the distance matrix must be done by
> another function since clara only works with numeric data.
>
> Is there an alternative to the daisy -> clara route that does not
> require as much RAM?
>
> What functions would you recommend for a cluster analysis of this kind
> of data on large data set?
>
>
> regards,
>
> Hans Ekbrand
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Maybe Matching Threads

Search for more possibly parallel threads

R help - Mar 2011 - Cluster analysis, factor variables, large data set

[R] Cluster analysis, factor variables, large data set

[R] Cluster analysis, factor variables, large data set

Maybe Matching Threads