Hello, I have a large data matrix (68x13112), each row corresponding to one observation (patients) and each column corresponding to the variables (points within an NMR spectrum). I would like to carry out some kind of clustering on these data to see how many clusters are there. I have tried the function clara() from the package cluster. If I use the matrix as is, I can perform the clara analysis but when I call clusplot() I get this error: Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) : 'princomp' can only be used with more units than variables Then, I reduce the dimensionality by using the function prcomp(). Then I take the 13 first principal components (80%< variability) and I carry out the clara() analysis again. Then, I call the clusplot() function again and voil?!, it works. The problem is that clusplot() only represents the two first components of my prcomp() analysis, which represents only 15% of the variability. So, my questions are 1) is clara() a proper way to analyze such a large data set? and 2) Is there an appropiate method for graphic plotting of my data, that takes into account the whole variability if my data, not just two principal components? Many thanks. Best, Dani -- Daniel Valverde Saub? Grup de Biologia Molecular de Llevats Facultat de Veterin?ria de la Universitat Aut?noma de Barcelona Edifici V, Campus UAB 08193 Cerdanyola del Vall?s- SPAIN Centro de Investigaci?n Biom?dica en Red en Bioingenier?a, Biomateriales y Nanomedicina (CIBER-BBN) Grup d'Aplicacions Biom?diques de la RMN Facultat de Bioci?ncies Universitat Aut?noma de Barcelona Edifici Cs, Campus UAB 08193 Cerdanyola del Vall?s- SPAIN +34 93 5814126
Hi Dani, If you are working with NMR data, which data pretreatment methods you are using? 13112 variables for NMR data sounds too lot, you should apply some data binning or peak picking methods for data reduction. Also you must consider multicollinearity problems related to spectroscopic data, therefore data reduction with PCA or similar methods is essential step in your analysis. But PCA method is also very sensitive to the noise and suprevised classification method could be more acceptable, for example PLS-DA. You should take a look on pls package. And caret package has very well writen routines for model reproducibility and stability tests, no only for PLS-DA but also otherm methods.Also package mclust could be useful. Also you can take alook on this package: http://sourceforge.net/projects/kopls/ http://www.jstatsoft.org/v18/i06 http://cran.r-project.org/web/packages/caret/caret.pdf http://www.jstatsoft.org/v18/i02 http://dx.doi.org/10.1002/cem.887 http://dx.doi.org/10.1186/1471-2105-9-106 Best regards Dani Valverde wrote:> Hello, > I have a large data matrix (68x13112), each row corresponding to one > observation (patients) and each column corresponding to the variables > (points within an NMR spectrum). I would like to carry out some kind of > clustering on these data to see how many clusters are there. I have > tried the function clara() from the package cluster. If I use the matrix > as is, I can perform the clara analysis but when I call clusplot() I get > this error: > > Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) : > 'princomp' can only be used with more units than variables > > Then, I reduce the dimensionality by using the function prcomp(). Then I > take the 13 first principal components (80%< variability) and I carry > out the clara() analysis again. Then, I call the clusplot() function > again and voil?!, it works. The problem is that clusplot() only > represents the two first components of my prcomp() analysis, which > represents only 15% of the variability. > So, my questions are 1) is clara() a proper way to analyze such a large > data set? and 2) Is there an appropiate method for graphic plotting of > my data, that takes into account the whole variability if my data, not > just two principal components? > Many thanks. > Best, > > Dani >-- Andris Jankevics Assistant Department of Medicinal Chemistry Latvian Institute of Organic Synthesis Aizkraukles 21, LV-1006, Riga, Latvia
Hi there, whether clara is a proper way of clustering depends strongly on what your data are and particularly what interpretation or use you want for your clustering. You may do better with a hierarchical method after having defined a proper distance (however this would rather go into statistical consultation and not just R help). Assuming that you use some reasonable dimension reduction and clustering method, you may get a good visualization of you clustering using the methods available via functions plotcluster/discrproj in package fpc. Best, Christian On Thu, 6 Mar 2008, Dani Valverde wrote:> Hello, > I have a large data matrix (68x13112), each row corresponding to one > observation (patients) and each column corresponding to the variables > (points within an NMR spectrum). I would like to carry out some kind of > clustering on these data to see how many clusters are there. I have > tried the function clara() from the package cluster. If I use the matrix > as is, I can perform the clara analysis but when I call clusplot() I get > this error: > > Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) : > 'princomp' can only be used with more units than variables > > Then, I reduce the dimensionality by using the function prcomp(). Then I > take the 13 first principal components (80%< variability) and I carry > out the clara() analysis again. Then, I call the clusplot() function > again and voil?!, it works. The problem is that clusplot() only > represents the two first components of my prcomp() analysis, which > represents only 15% of the variability. > So, my questions are 1) is clara() a proper way to analyze such a large > data set? and 2) Is there an appropiate method for graphic plotting of > my data, that takes into account the whole variability if my data, not > just two principal components? > Many thanks. > Best, > > Dani > > -- > Daniel Valverde Saub? > > Grup de Biologia Molecular de Llevats > Facultat de Veterin?ria de la Universitat Aut?noma de Barcelona > Edifici V, Campus UAB > 08193 Cerdanyola del Vall?s- SPAIN > > Centro de Investigaci?n Biom?dica en Red > en Bioingenier?a, Biomateriales y > Nanomedicina (CIBER-BBN) > > Grup d'Aplicacions Biom?diques de la RMN > Facultat de Bioci?ncies > Universitat Aut?noma de Barcelona > Edifici Cs, Campus UAB > 08193 Cerdanyola del Vall?s- SPAIN > +34 93 5814126 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche