Talbot Katz
2007-Feb-13 20:33 UTC
[R] Questions about results from PCAproj for robust principal component analysis
Hi. I have been looking at the PCAproj function in package pcaPP (R 2.4.1) for robust principal components, and I'm trying to interpret the results. I started with a data matrix of dimensions RxC (R is the number of rows / observations, C the number of columns / variables). PCAproj returns a list of class princomp, similar to the output of the function princomp. In a case where I can run princomp, I would get the following, from executing dmpca = princomp(datamatrix) : - the vector, sdev, of length C, contains the standard deviations of the components in order by descending value; the squares are the eigenvalues of the covariance matrix - the matrix, loadings, has dimension CxC, and the columns are the eigenvectors of the covariance matrix, in the same order as the sdev vector; the columns are orthonormal: sum(dmpca$loadings[,i]*dmpca$loadings[,j]) = 1 if i == j, ~ 0 if i != j - the vector, center, of length C, contains the means of the variable columns in the original data matrix, in the same order as the original columns - the vector, scale, of length C, contains the scalings applied to each variable, in the same order as the original columns - n.obs contains the number of observations used in the computation; this number equals R when there is no missing data - the matrix, scores, has dimension RxC, and it can be thought of as the projection of the eigenvector matrix, loadings, back onto the original data; these columns of scores are the principal components. princomp typically removes the mean, so the formula is: dmpca$scores = t(t(datamatrix) - dmpca$center)%*%dmpca$loadings and apply(dmpca$scores,2,mean) returns a length C vector of (effectively) zeroes; also the principal components (columns of scores) are orthogonal (but not orthonormal): sum(dmpca$scores[,i]*dmpca$scores[,j]) ~ 0 if i != j, > 0 if i == j - call contains the function call, in this case princomp(x = datamatrix) That is all as it should be. In my case R < C, which produces singular results for standard PCA, but robust methods, like PCAproj, are designed to handle this. Also, I had "de-meaned" the data beforehand, so apply(datamatrix,2,mean) produces a length C vector of (effectively) zeroes. I ran the following: dmpcaprj=PCAproj(datamatrix,k=4,CalcMethod="sphere",update=TRUE) to get the first four robust components. When I look at the princomp object returned as dmpcaprj, some of the results are just what I expect. For example, - dmpcaprj$loadings has dimensions Cx4, as expected, and the first four eigenvectors of the (robust) covariance matrix are orthonormal: sum(dmpcaprj$loadings[,i]*dmpcaprj$loadings[,j]) = 1 if i == j, ~ 0 if i != j - dmpcaprj$sdev contains the square roots of the four corresponding eigenvalues. - dmpcaprj$n.obs equals R. - dmpcaprj$scores has dimensions Rx4, as it should. HOWEVER, the columns of dmpcaprj$scores are neither de-meaned nor orthogonal. So, apply(dmpcaprj$scores,2,mean) is a non-zero vector, and sum(dmpcaprj$scores[,i]*dmpcaprj$scores[,j]) != 0 if i != j, > 0 if i == j ALSO, - dmpcaprj$scale is in this case a vector of all 1's, as expected. But the length is C, not R. - dmpcaprj$center is a vector of length C, not R, and the entries are not equal to either apply(datamatrix,1,mean) or apply(datamatrix,2,mean); I can't figure out where they came from. One interesting thing is that the columns of the Rx4 matrix, dmpcaprj$scores - datamatrix%*%dmpcaprj$loadings are all identically constant vectors, such that each row equals apply(dmpcaprj$scores,2,mean), since apply(datamatrix%*%dmpcaprj$loadings,2,mean) is a length four vector of (effectively) zeroes, but I can't interpret the values of these means of dmpcaprj$scores. Can anyone please explain to me what is happening with the scores, scale, and center parts of the PCAproj results? Thanks! -- TMK -- 212-460-5430 home 917-656-5351 cell
Peter Filzmoser
2007-Feb-14 19:43 UTC
[R] Questions about results from PCAproj for robust principal component analysis
Hi, PCAproj is mainly designed for robust PCA and not for classical PCA. Therefore, when applying classical estimators to the results of a robust PCA, like the mean to the robust PCA scores, this will usually not give zeros. The robust PCs have been centred robustly, and not classically by the mean. In your case you ran PCAproj with the default method="mad", thus robust PCA was performed, maximizing the mad instead of the usual variance (standard deviation). Moreover, you used the default for centring the PCs, which is center="l1median", so a robust centring rather than centring by the classical mean (which would have been the zero vector because your data were already classically centred). Run the procedure with the options center=mean and method="sd" which then gives classical PCA. The mean of the resulting PCA scores will be 0. However, the scores will not be orthogonal (or uncorrelated), but close to. The reason is that PCAproj is an axxproximative algorithm for finding the (robust) PCs. The eigen-analysis of princomp gives the exact solution (but it is not robust). The wrong length of result$scale and result$center is definitely an error in the procedure that I will have to change quickly. For data sets with more columns than rows we automatically apply a singular value decomposition (SVD) to reduce the dimensionality (without information loss). Then we perform centring and scaling in the space of reduced dimension, but we did not back-transform these values - sorry. Will be repaired soon. Best regards, Peter Talbot Katz wrote:> Hi. > > I have been looking at the PCAproj function in package pcaPP (R 2.4.1) > for robust principal components, and I'm trying to interpret the > results. I started with a data matrix of dimensions RxC (R is the > number of rows / observations, C the number of columns / variables). > PCAproj returns a list of class princomp, similar to the output of the > function princomp. In a case where I can run princomp, I would get the > following, from executing dmpca = princomp(datamatrix) : > - the vector, sdev, of length C, contains the standard deviations of > the components in > order by descending value; the squares are the eigenvalues of the > covariance matrix > - the matrix, loadings, has dimension CxC, and the columns are the > eigenvectors of the > covariance matrix, in the same order as the sdev vector; the > columns are > orthonormal: > sum(dmpca$loadings[,i]*dmpca$loadings[,j]) = 1 if i == j, ~ 0 if > i != j > - the vector, center, of length C, contains the means of the variable > columns in the original > data matrix, in the same order as the original columns > - the vector, scale, of length C, contains the scalings applied to > each variable, in the same > order as the original columns > - n.obs contains the number of observations used in the computation; > this number equals > R when there is no missing data > - the matrix, scores, has dimension RxC, and it can be thought of as > the projection of the > eigenvector matrix, loadings, back onto the original data; these > columns of > scores are the principal components. princomp typically removes > the mean, > so the formula is: > dmpca$scores = t(t(datamatrix) - dmpca$center)%*%dmpca$loadings > and apply(dmpca$scores,2,mean) returns a length C vector of > (effectively) > zeroes; also the principal components (columns of scores) are > orthogonal > (but not orthonormal): > sum(dmpca$scores[,i]*dmpca$scores[,j]) ~ 0 if i != j, > 0 if i == j > - call contains the function call, in this case princomp(x = datamatrix) > > That is all as it should be. > > > In my case R < C, which produces singular results for standard PCA, but > robust methods, like PCAproj, are designed to handle this. Also, I had > "de-meaned" the data beforehand, so apply(datamatrix,2,mean) produces a > length C vector of (effectively) zeroes. I ran the following: > dmpcaprj=PCAproj(datamatrix,k=4,CalcMethod="sphere",update=TRUE) > to get the first four robust components. When I look at the princomp > object returned as dmpcaprj, some of the results are just what I > expect. For example, > - dmpcaprj$loadings has dimensions Cx4, as expected, and the first > four eigenvectors of > the (robust) covariance matrix are orthonormal: > sum(dmpcaprj$loadings[,i]*dmpcaprj$loadings[,j]) = 1 if i == j, > ~ 0 if i != j > - dmpcaprj$sdev contains the square roots of the four corresponding > eigenvalues. > - dmpcaprj$n.obs equals R. > - dmpcaprj$scores has dimensions Rx4, as it should. > > HOWEVER, the columns of dmpcaprj$scores are neither de-meaned nor > orthogonal. So, > apply(dmpcaprj$scores,2,mean) is a non-zero vector, and > sum(dmpcaprj$scores[,i]*dmpcaprj$scores[,j]) != 0 if i != j, > 0 > if i == j > ALSO, > - dmpcaprj$scale is in this case a vector of all 1's, as expected. > But the length is C, not R. > - dmpcaprj$center is a vector of length C, not R, and the entries are > not equal to either > apply(datamatrix,1,mean) or apply(datamatrix,2,mean); I can't > figure out > where they came from. > > One interesting thing is that the columns of the Rx4 matrix, > dmpcaprj$scores - datamatrix%*%dmpcaprj$loadings > are all identically constant vectors, such that each row equals > apply(dmpcaprj$scores,2,mean), since > apply(datamatrix%*%dmpcaprj$loadings,2,mean) is a length four vector of > (effectively) zeroes, > but I can't interpret the values of these means of dmpcaprj$scores. > > > Can anyone please explain to me what is happening with the scores, > scale, and center parts of the PCAproj results? > > > Thanks! > > > -- TMK -- > 212-460-5430 home > 917-656-5351 cell > > > >-- ------------------------------------------------------- From: Prof. Dr. Peter Filzmoser Dept. of Statistics & Probability Theory Vienna University of Technology Wiedner Hauptstrasse 8-10 A-1040 Vienna, Austria Tel. +43 1 58801/10733 Fax. +43 1 58801/10799 E-mail: P.Filzmoser at tuwien.ac.at Internet: http://www.statistik.tuwien.ac.at/public/filz/