thr3ads.net - R help - [R] Clustering large data matrix [Mar 2008]

If this information is useful, please help other people find it:
Share via:

Dani Valverde

2008-Mar-06 10:52 UTC

[R] Clustering large data matrix

Hello,
I have a large data matrix (68x13112), each row corresponding to one 
observation (patients) and each column corresponding to the variables 
(points within an NMR spectrum). I would like to carry out some kind of 
clustering on these data to see how many clusters are there. I have 
tried the function clara() from the package cluster. If I use the matrix 
as is, I can perform the clara analysis but when I call clusplot() I get 
this error:

Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :   
'princomp' can only be used with more units than variables

Then, I reduce the dimensionality by using the function prcomp(). Then I 
take the 13 first principal components (80%< variability) and I carry 
out the clara() analysis again. Then, I call the clusplot() function 
again and voil?!, it works. The problem is that clusplot() only 
represents the two first components of my prcomp() analysis, which 
represents only 15% of the variability.
So, my questions are 1) is clara() a proper way to analyze such a large 
data set? and 2) Is there an appropiate method for graphic plotting of 
my data, that takes into account the whole variability if my data, not 
just two principal components?
Many thanks.
Best,

Dani

-- 
Daniel Valverde Saub?

Grup de Biologia Molecular de Llevats
Facultat de Veterin?ria de la Universitat Aut?noma de Barcelona
Edifici V, Campus UAB
08193 Cerdanyola del Vall?s- SPAIN

Centro de Investigaci?n Biom?dica en Red
en Bioingenier?a, Biomateriales y
Nanomedicina (CIBER-BBN)

Grup d'Aplicacions Biom?diques de la RMN
Facultat de Bioci?ncies
Universitat Aut?noma de Barcelona
Edifici Cs, Campus UAB
08193 Cerdanyola del Vall?s- SPAIN
+34 93 5814126

Andris Jankevics

2008-Mar-06 11:47 UTC

head link

[R] Clustering large data matrix

Hi Dani,

If you are working with NMR data, which data pretreatment methods you
are using? 13112 variables for NMR data sounds too lot, you should apply
some data binning or peak picking methods for data reduction.
Also you must consider multicollinearity problems related to
spectroscopic data, therefore data reduction with PCA or similar methods
is essential step in your analysis.
But PCA method is also very sensitive to the noise and suprevised
classification method could be more acceptable, for example PLS-DA.

You should take a look on pls package. And caret package has very well
writen routines for model reproducibility and stability tests, no only
for PLS-DA but also otherm methods.Also package mclust could be useful.

Also you can take alook on this package:
http://sourceforge.net/projects/kopls/

http://www.jstatsoft.org/v18/i06
http://cran.r-project.org/web/packages/caret/caret.pdf
http://www.jstatsoft.org/v18/i02

http://dx.doi.org/10.1002/cem.887
http://dx.doi.org/10.1186/1471-2105-9-106

Best regards

Dani Valverde wrote:> Hello,
> I have a large data matrix (68x13112), each row corresponding to one 
> observation (patients) and each column corresponding to the variables 
> (points within an NMR spectrum). I would like to carry out some kind of 
> clustering on these data to see how many clusters are there. I have 
> tried the function clara() from the package cluster. If I use the matrix 
> as is, I can perform the clara analysis but when I call clusplot() I get 
> this error:
> 
> Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :   
> 'princomp' can only be used with more units than variables
> 
> Then, I reduce the dimensionality by using the function prcomp(). Then I 
> take the 13 first principal components (80%< variability) and I carry 
> out the clara() analysis again. Then, I call the clusplot() function 
> again and voil?!, it works. The problem is that clusplot() only 
> represents the two first components of my prcomp() analysis, which 
> represents only 15% of the variability.
> So, my questions are 1) is clara() a proper way to analyze such a large 
> data set? and 2) Is there an appropiate method for graphic plotting of 
> my data, that takes into account the whole variability if my data, not 
> just two principal components?
> Many thanks.
> Best,
> 
> Dani
> 

-- 
Andris Jankevics
Assistant
Department of Medicinal Chemistry
Latvian Institute of Organic Synthesis
Aizkraukles 21, LV-1006, Riga, Latvia

Christian Hennig

2008-Mar-06 12:18 UTC

head link

[R] Clustering large data matrix

Hi there,

whether clara is a proper way of clustering depends strongly on what your data 
are and particularly what interpretation or use you want for your 
clustering. You may do better with a hierarchical method after having defined a 
proper distance (however this would rather go into statistical consultation and 
not just R help).

Assuming that you use some reasonable dimension reduction and clustering
method, you may get a good visualization of you clustering using the methods 
available via functions plotcluster/discrproj in package fpc.

Best,
Christian

On Thu, 6 Mar 2008, Dani Valverde wrote:
> Hello,
> I have a large data matrix (68x13112), each row corresponding to one
> observation (patients) and each column corresponding to the variables
> (points within an NMR spectrum). I would like to carry out some kind of
> clustering on these data to see how many clusters are there. I have
> tried the function clara() from the package cluster. If I use the matrix
> as is, I can perform the clara analysis but when I call clusplot() I get
> this error:
>
> Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
> 'princomp' can only be used with more units than variables
>
> Then, I reduce the dimensionality by using the function prcomp(). Then I
> take the 13 first principal components (80%< variability) and I carry
> out the clara() analysis again. Then, I call the clusplot() function
> again and voil?!, it works. The problem is that clusplot() only
> represents the two first components of my prcomp() analysis, which
> represents only 15% of the variability.
> So, my questions are 1) is clara() a proper way to analyze such a large
> data set? and 2) Is there an appropiate method for graphic plotting of
> my data, that takes into account the whole variability if my data, not
> just two principal components?
> Many thanks.
> Best,
>
> Dani
>
> -- 
> Daniel Valverde Saub?
>
> Grup de Biologia Molecular de Llevats
> Facultat de Veterin?ria de la Universitat Aut?noma de Barcelona
> Edifici V, Campus UAB
> 08193 Cerdanyola del Vall?s- SPAIN
>
> Centro de Investigaci?n Biom?dica en Red
> en Bioingenier?a, Biomateriales y
> Nanomedicina (CIBER-BBN)
>
> Grup d'Aplicacions Biom?diques de la RMN
> Facultat de Bioci?ncies
> Universitat Aut?noma de Barcelona
> Edifici Cs, Campus UAB
> 08193 Cerdanyola del Vall?s- SPAIN
> +34 93 5814126
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Mar 2008 - Clustering large data matrix

[R] Clustering large data matrix

[R] Clustering large data matrix

[R] Clustering large data matrix

Possibly Parallel Threads