Dear R users, I'm trying to cluster 30 gene chips using principal component analysis in package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is probably just one of several methods to cluster the 30 chips. However, I don't know how to run prcomp, and I don't know how to interpret it's output. If there are 30 data points in 1,000 dimensions each, do I have to provide the data in a 1,000x30 matrix or data frame (i.e. 1000 columns)?> data[1:5,1:5]x.HU.04h.Ctr.118.01.4.ctrl x.HU.04h.010.118.04.4.0.1 1 21 45 2 24 35 3 109 173 4 86 99 5 130 204 x.HU.04h.050.118.05.4.0.5 x.HU.04h.100.118.06.4.1 x.HU.24h.Ctr.118.07.24.ctrl 1 24 28 22 2 25 25 20 3 107 125 95 4 72 79 61 5 126 166 128> m <- t(data) > m[1:5,1:5]1 2 3 4 5 x.HU.04h.Ctr.118.01.4.ctrl 21 24 109 86 130 x.HU.04h.010.118.04.4.0.1 45 35 173 99 204 x.HU.04h.050.118.05.4.0.5 24 25 107 72 126 x.HU.04h.100.118.06.4.1 28 25 125 79 166 x.HU.24h.Ctr.118.07.24.ctrl 22 20 95 61 128> pca <- prcomp(m, retx = TRUE)there are 30 "PC"s displayed (I've truncated the output). Shouldn't tere be 1000 PCs, with the 1st PC beeing the most discriminativePC? In a principal comp. Alanysis, aren't there as many PCs as dimensions? On the other hand I thought that PCA somehow collapses dimensionality ... . What is are PCs for my 30 data points. Afterwards I'd also like to display the results in a diagram, e.g. in 2 or 3 dimensions, to visualise clusters. I'm not sure I'm doing the right thing. I'm happy for any comments and explanations, kind regards, Arne> pca["x"]$x PC1 PC2 PC3 PC4 PC5 PC6 x.HU.04h.Ctr.118.01.4.ctrl -1272.1203 -249.465634 -2185.20558 1083.15814 421.67755 100.26612 x.HU.04h.010.118.04.4.0.1 -1493.8623 1483.260490 -1090.31102 -286.70562 1274.34804 37.88463 x.HU.04h.050.118.05.4.0.5 -2688.5157 2055.336930 -83.70279 154.24116 1202.58763 -604.08124 x.HU.04h.100.118.06.4.1 -2477.3271 2029.248507 -14.37922 -314.08755 1422.88800 -509.37791 x.HU.24h.Ctr.118.07.24.ctrl -3198.7071 -2264.516725 209.04504 763.56664 -762.61481 -542.35302 x.HU.24h.010.118.10.24.0.1 -3370.0556 -2190.205040 298.17498 702.80862 -783.48849 -509.22595 x.HU.24h.050.118.11.24.0.5 -2662.8329 -1436.400955 1478.81635 129.83910 406.10451 337.88507 x.HU.24h.100.118.12.24.1 -4193.3836 -1210.594052 1844.22923 914.84373 -11.33207 11.58916 x.HU.04h.Ctr.206.13.4.ctrl 2305.5848 -180.584730 -2017.05340 1274.07436 132.14756 930.35799 x.HU.04h.010.206.14.4.0.1 1703.4976 2032.883878 -78.67578 1697.50799 -301.93647 234.25139 x.HU.04h.025.206.15.4.0.25 1294.1932 2876.862370 534.11002 1229.73355 -68.31220 226.47566 x.HU.04h.050.206.16.4.0.5 3666.8441 3520.249397 1187.37289 -45.83772 -271.06706 145.75181 x.HU.04h.100.206.17.4.1 3657.9687 3432.347857 1318.94834 -484.73817 -405.36077 349.88323 x.HU.24h.Ctr.206.18.24.ctrl 5796.1801 -2985.085353 -1052.08033 -306.45667 265.22940 -732.59152 x.HU.24h.010.206.19.24.0.1 4429.6809 -2685.801572 -1027.66157 822.76848 171.15959 -1118.12987 x.HU.24h.025.206.20.24.0.25 5672.4279 -1559.896071 1177.74742 -734.37026 336.46183 -132.25625 x.HU.24h.050.206.21.24.0.5 4855.8534 -809.112994 1825.99459 -594.09109 190.00907 -234.33254 x.HU.24h.100.206.22.24.1 4015.2594 -166.349964 1015.96643 622.86202 -267.17075 400.45741 x.HU.04h.Ctr.821.23.4.ctrl -485.9779 91.410337 -2446.35100 -263.83351 -453.89005 491.14145 x.HU.04h.Ctr.821.24.4.ctrl 390.5580 -8.264721 -2707.56580 -1265.35762 -156.67885 555.41157 x.HU.04h.010.821.25.4.0.1 -1138.4096 1733.090222 -885.89460 -460.04065 -276.68619 -200.20132 x.HU.04h.025.821.26.4.0.25 -1622.0565 2333.333749 -297.50664 -838.12742 -783.19740 -206.76327 x.HU.04h.050.821.27.4.0.5 -1920.9992 2462.596326 -213.80507 -463.02219 -683.90138 -731.04753 x.HU.04h.100.821.28.4.1 -2288.0687 2251.971783 223.28215 -472.78173 -668.16917 -623.88411 x.HU.24h.Ctr.821.29.24.ctrl -599.7405 -2105.800732 -792.89966 -902.43731 -158.37800 314.34868 x.HU.24h.Ctr.821.30.24.ctrl -743.5533 -2154.937309 -350.37118 -744.69040 -479.01087 172.03340 x.HU.24h.010.821.31.24.0.1 -2240.3848 -1963.626249 306.05426 -178.59331 -166.16473 266.24216 x.HU.24h.025.821.32.24.0.25 -1840.1627 -1667.075636 1271.79029 -333.21614 -178.28014 477.06373 x.HU.24h.050.821.33.24.0.5 -1575.7248 -1431.615872 1059.90748 -531.84286 537.76332 502.46140 x.HU.24h.100.821.34.24.1 -1976.1656 -1233.258236 1492.02417 -175.17357 515.26288 590.73966 [...]
On 12/09/02 11:38, Arne.Muller at aventis.com wrote:>Dear R users, > >I'm trying to cluster 30 gene chips using principal component analysis in >package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is >probably just one of several methods to cluster the 30 chips. However, I >don't know how to run prcomp, and I don't know how to interpret it's output.PCA is almost certainly not what you want. Kmeans might work (or other functions designed for clustering). The reason your output is limited to 30 components is (roughly) that, once you have this many, all the other 970 are predictable from these, because you have only 30 observations. -- Jonathan Baron, Professor of Psychology, University of Pennsylvania R page: http://finzi.psych.upenn.edu/
On Mon, 9 Dec 2002 Arne.Muller at aventis.com wrote:> Dear R users, > > I'm trying to cluster 30 gene chips using principal component analysis in > package mva.prcomp. Each chip is a point with 1,000 dimensions. PCA is > probably just one of several methods to cluster the 30 chips. However, I > don't know how to run prcomp, and I don't know how to interpret it's output. > > If there are 30 data points in 1,000 dimensions each, do I have to provide > the data in a 1,000x30 matrix or data frame (i.e. 1000 columns)?None of those. A 30x1000 matrix.> > data[1:5,1:5] > x.HU.04h.Ctr.118.01.4.ctrl x.HU.04h.010.118.04.4.0.1 > 1 21 45 > 2 24 35 > 3 109 173 > 4 86 99 > 5 130 204 > x.HU.04h.050.118.05.4.0.5 x.HU.04h.100.118.06.4.1 > x.HU.24h.Ctr.118.07.24.ctrl > 1 24 28 > 22 > 2 25 25 > 20 > 3 107 125 > 95 > 4 72 79 > 61 > 5 126 166 > 128 > > > m <- t(data) > > m[1:5,1:5] > 1 2 3 4 5 > x.HU.04h.Ctr.118.01.4.ctrl 21 24 109 86 130 > x.HU.04h.010.118.04.4.0.1 45 35 173 99 204 > x.HU.04h.050.118.05.4.0.5 24 25 107 72 126 > x.HU.04h.100.118.06.4.1 28 25 125 79 166 > x.HU.24h.Ctr.118.07.24.ctrl 22 20 95 61 128 > > > pca <- prcomp(m, retx = TRUE) > > there are 30 "PC"s displayed (I've truncated the output). Shouldn't tere be > 1000 PCs, with the 1st PC beeing the most discriminativePC? In a principalNo. 970 of them span the null space: you have massive over-fitting.> comp. Alanysis, aren't there as many PCs as dimensions? On the other hand I > thought that PCA somehow collapses dimensionality ... . What is are PCs for > my 30 data points. Afterwards I'd also like to display the results in a > diagram, e.g. in 2 or 3 dimensions, to visualise clusters. I'm not sure I'm > doing the right thing.Well, statistically neither am I. But mathematically at least, the PCs for your 30 data points are the `x' component of the result, and you can plot them via plot(pca$x[1:2]) in two dimensions, or use scatterplot3d (a package) or (preferably as it is dynamic) the ggobi or xgobi interfaces in 3D. This sort of thing *is* covered in many of the texts about S (or S-PLUS or R). -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595