My previous post about prcomp and princomp was done in some haste as I had long ago indicated to Kurt that I would try to have this ready for the June release, and it appeared that I would miss yet another release. I also need to get it out before it becomes hopelessly buried by other work. Brian Ripley kindly pointed out some errors, and also pointed out that I was suggesting replacing some functions which were already working well in R. Other than changing prcomp and princomp this was unintentional on my part. It seems that some of these functions have either appeared since I first started this, or that I was just careless (hard to imagine but it happens) and perhaps checked only for prcomp methods and not princomp methods. In any event, I decided it was necessary to check and indicate the changes I am proposing more carefully, despite my time pressures, and also, first of all, to indicate my interest and intentions more clearly. I have no special attachment to principal components and there are many people on this list who could do a better job at this than I can. I want to use PCA for another problem I am working on, and I noticed several months ago that prcomp and princomp were returning different results in R and Splus. Unless this is fixed it puts me in the difficult position of having to choose between Splus and R, which I would prefer not to do for reasons I have expressed from time to time. (Another solution is to over-ride the mva library with my own functions, but this has undesirable consequences when others use my library.) My intention at this point is to submit these changes to the mva library and then not do any more work on it. I am not suggesting that these methods are better, only that they are more consistent with Splus. There is room for improvement (in both Splus and R) and perhaps someone else would like to work on it. Of all the changes below, the only ones I consider really important are the documentation and the new version of prcomp which gives results like Splus. I have some misgivings about adding yet another principal components function (prcomponents below). However, I would like prcomp and princomp to return the same results as Splus, and yet I see that there is need for improvement. I would like this improvement to happen in a function which has a different name from the functions in Splus, and perhaps be moved into an expanded "compatibility library" at some point. Paul Gilbert _______ The code below makes the following changes relative to R Version 0.62.3 in progress-release (August 19, 1998). - prcomp is changed so that it takes similar arguments and usually reproduces Splus results. The criterion used for determining if a singular value is zero (following suggestions of Martin Maechler) is slightly different and this can occasionally give a different result. - prcomp.Rd is changed to reflect the changes to prcomp - summary.prcomp is defined to give a result more like summary.princomp. In Splus summary.prcomp does not exist and the default method is applied, but this difference seems reasonable? - screeplot is made generic - function(...) is changed to function(x, ...) for some plot methods (to be consistent with plot). - print.prcomp is changed to be more like the result from Splus (Splus just uses print.default, so this is slightly different) - plot.prcomp is renamed screeplot.prcomp to be consistent with the way this is done for princomp, but I have not worked on this function and it does not appear to do scaling or other nice things like screeplot.princomp. (Hopefully Brian Ripley will volunteer to fix it.) - plot.prcomp calls - screeplot.prcomp - princomp is still the function I submitted previously but has a few additional comments. It reproduces Splus results when the argument is a data matrix, but does not work with a covariance argument and the Splus version does. - princomp.Rd is new. - print.princomp is defined in princomp.R. This version is a bit more like the result from Splus and the function of the same name is removed from princomp-add.R. - programs in the file princomp-add.R could be included in princomp.R but I was unsure how to deal with the V&R copyright. Perhaps it should be included in each function? I've left princomp-add.R as a separate file and done only the minimal number of changes necessary to make functions work with the changes in other files. (As far as my contributions in the other files are concerned I am happy to assign copyright to "The R Development Core Team".) - a tentative and so far undocumented function prcomponents in the file prcomponents.R has been added. This is an incomplete attempt to deal with shortcomings of prcomp and princomp, the main problems with those functions being that the preferred computational method, svd, is used in prcomp, but princomp returns additional useful information. The function princomp uses eigen so that it returns results consistent with Splus, but does not yet support a covariance as an argument, which Splus does (and is the reason eigen is used). My function prcomponents also tries to give the user the ability to control what is used in the normalization (e.g. N or N-1), as suggested by Bill Venables. It does not include use="all.obs" which was in the prcomp I am proposing to replace, but it should. The difficulty is that this argument is passed to cor or cov, which are not called if svd is used. ############## prcomp.R replacement file ############## screeplot <- function(x, ...) {UseMethod("screeplot")} plot.prcomp <- function(x, ...) {screeplot(x, ...)} prcomp <- function(x, retx=TRUE, center=TRUE, scale=FALSE) { # s <- svd(scale(x, center=center, scale=scale),nu=0) # above produces warning since scale is both a function and a variable so s <- svd(get("scale",envir=.GlobalEnv) (x, center=center, scale=scale),nu=0) # rank <- sum(s$d > 0) rank <- sum(s$d > (s$d[1]*sqrt(.Machine$double.eps))) if (rank < ncol(x)) s$v <- s$v[,1:rank, drop=F] s$d <- s$d/sqrt(max(1,nrow(x) -1)) if(retx) r <- list(sdev=s$d, rotation=s$v, x=as.matrix(x) %*% s$v) else r <- list(sdev=s$d, rotation=s$v) class(r) <- "prcomp" r } print.prcomp <- function(x) { cat("Standard deviations:\n") print(x$sdev) cat("\nRotation:\n") print(x$rotation) if (!is.null(x$x)) {cat("\nRotated variables:\n") print(x$x) } invisible(x) } summary.prcomp <- function(obj, digits=3 ) { vars <- obj$sdev^2 vars <- vars/sum(vars) cat("Importance of components:\n") print(rbind("Standard deviation" = obj$sdev, "Proportion of Variance" = vars, "Cumulative Proportion" = cumsum(vars))) invisible(obj) } screeplot.prcomp <- function(x, main="Scree Plot", ylab="Variance", xlab="Component", ...) { plot(x$var, main=main, xlab=xlab, ylab=ylab, ...) } ############### prcomp.Rd replacement file ############### \name{prcomp} \title{Principal Components Analysis} \usage{ prcomp(x, retx=TRUE, center=TRUE, scale=FALSE) } \arguments{ \item{x}{a matrix (or data frame) which provides the data for the principal components analysis.} \item{retx}{a logical value indicating whether the rotated variables should be returned.} \item{center}{a logical value indicating whether the variables should be shifted to be zero centered.} \item{scale}{a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is F for consistency with S, but in general scaling is advisable.} } \description{ This function performs a principal components analysis on the given data matrix and returns the results as a \code{prcomp} object. The calculation is done with svd on the data matrix, not by using eigen on the covariance matrix. This is generally the preferred method for numerical accuracy. The print method for the these objects prints the results in a nice format and the plot method produces a scree plot. The method \code{biplot} plots two selected components against one another. } \value{ \code{prcomp} returns an list with class \code{"prcomp"} containing the following components: \item{sdev}{the standard deviation of the principal components (i.e. the eigenvalues of the cov matrix - though the calculation is actually done with the singular values of the data matrix.)} \item{rotation}{the matrix of variable loadings (i.e. a matrix whose columns contain the eigenvectors). The function princomp returns this in the element \code{loadings}.} \item{x}{if retx is true the value of the rotated data (the data multiplied by the \code{rotation} matrix) is returned.} } \references{ Mardia, K. V., J. T. Kent, J and M. Bibby (1979), \emph{Multivariate Analysis}, London: Academic Press. Venerables, W. N. and B. D. Ripley (1994). \emph{Modern Applied Statistics with S-Plus}, Springer-Verlag. } \seealso{ \code{\link{princomp}}, \code{\link{prcomponents}}, \code{\link{cor}}, \code{\link{cov}}, \code{\link{svd}}, \code{\link{eigen}}. } \examples{ # the variances of the variables in the # crimes data vary by orders of magnitude data(crimes) prcomp(crimes) prcomp(crimes,scale=TRUE) plot(prcomp(crimes)) summary(prcomp(crimes)) } ############## princomp.R replacement file ############## princomp <- function(x, cor=FALSE, scores=TRUE, subset=rep(TRUE, nrow(as.matrix(x)))) { # it is tempting to add use="all.obs" which could be passed to cov or # cor but then the calculation of N is complicated. z<- as.matrix(x)[subset,, drop=F] N <- nrow(z) if(cor) cv <- get("cor",envir=.GlobalEnv)(z) else cv <- cov(z) # (svd can be used but gives different signs for some vectors) edc <- eigen(cv) cn <- paste("Comp.", 1:ncol(cv), sep="") names(edc$values) <- cn dimnames(edc$vectors) <- list(dimnames(x)[[2]], cn) scr<- NULL if (cor) {sdev <- sqrt(edc$values) sc <- (apply(z,2,var)*(N-1)/N)^0.5 if (scores) scr<-(scale(z,center=T,scale=T) %*% edc$vectors)*sqrt(N/(N-1)) } else {sdev <- sqrt(edc$values*(N-1)/N) sc <- rep(1, ncol(z)) if (scores) scr<- (scale(z,center=T,scale=F) %*% edc$vectors) } names(sc) <- dimnames(x)[[2]] edc <-list(sdev=sdev, loadings=edc$vectors, center=apply(z,2,mean), scale=sc, n.obs=N, scores=scr) # The Splus function also return list elements factor.sdev, correlations # and coef, but these are not documented in the help. coef seems to equal # load. The Splus function also return list elements call and terms which # are not supported here. class(edc) <- "princomp" edc } print.princomp <- function(x) { cat("Standard deviations:\n") print(x$sdev) cat(length(x$scale), " variables and ", x$n.obs, "observations.\n") cat("Scale:\n") print(x$scale) invisible(x) } ############### princomp.Rd new file ############### \name{princomp} \title{Principal Components Analysis} \usage{ princomp(x, cor=FALSE, scores=TRUE, subset=rep(TRUE, nrow(as.matrix(x)))) print.princomp(obj) summary.princomp(obj) plot.princomp(obj) } \alias{print.princomp} \alias{plot.princomp} \arguments{ \item{x}{a matrix (or data frame) which provides the data for the principal components analysis.} \item{cor}{a logical value indicating whether the calculation should use the correlation matrix or the covariance matrix.} \item{score}{a logical value indicating whether the score on each principal component should be calculated.} \item{subset}{a vector used to select rows (observations) of the data matrix \code{x}.} } \description{ This function performs a principal components analysis on the given data matrix and returns the results as a \code{princomp} object. The calculation is done using \code{eigen} on the correlation or covariance matrix, as determined by \code{cor}. This is done for compatibility with the Splus result (even though alternate forms for \code{x} - e.g. a covariance matrix - are not supported as they are in Splus). A preferred method of calculation is to use svd on \code{x}, as is done in \code{prcomp}. Note that the scaling of results is affected by the setting of \code{cor}. If \code{cor} is T then the divisor in the calculation of the sdev is N-1, otherwise it is N. This has the effect that the result is slightly different depending on whether scaling is done first on the data and cor set to F, or done automatically in \code{princomp} with cor=T. The print method for the these objects prints the results in a nice format and the plot method produces a scree plot. } \value{ \code{princomp} returns an list with class \code{"princomp"} containing the following components: \item{var}{the variances of the principal components (i.e. the eigenvalues)} \item{load}{the matrix of variable loadings (i.e. a matrix whose columns contain the eigenvectors).} \item{scale}{the value of the \code{scale} argument.} } \references{ Mardia, K. V., J. T. Kent and J. M. Bibby (1979). \emph{Multivariate Analysis}, London: Academic Press. Venerables, W. N. and B. D. Ripley (1994). \emph{Modern Applied Statistics with S-Plus}, Springer-Verlag. } \seealso{ \code{\link{prcomp}}, \code{\link{prcomponents}}, \code{\link{cor}}, \code{\link{cov}}, \code{\link{eigen}}. } \examples{ # the variances of the variables in the # crimes data vary by orders of magnitude data(crimes) princomp(crimes) princomp(crimes,cor=TRUE) princomp(scale(crimes, scale=T, center=T), cor=F) plot(princomp(crimes)) biplot(princomp(crimes)) summary(princomp(crimes)) loadings(princomp(crimes)) } ############## princomp-add.R replacement file ############## # copyright (C) 1998 W. N. Venables and B. D. Ripley # predict.princomp <- function(object, newdata,...) { if (missing(newdata)) return(object$scores) scale(newdata, object$center, object$scale) %*% object$loadings } summary.princomp <- function(object, loadings = F, cutoff = 0.1, digits=3, ...) { vars <- object$sdev^2 vars <- vars/sum(vars) cat("Importance of components:\n") print(rbind("Standard deviation" = object$sdev, "Proportion of Variance" = vars, "Cumulative Proportion" = cumsum(vars))) if(loadings) { cat("\nLoadings:\n") cx <- format(round(object$loadings, digits=digits)) cx[abs(object$loadings) < cutoff] <- substring(" ", 1, nchar(cx[1,1])) print(cx, quote = F, ...) } invisible(object) } plot.princomp <- function(x, ...) {screeplot(x, ...)} screeplot.prcomp <- screeplot.princomp <- function(x, npcs=min(10, length(x$sdev)), type=c("barplot", "lines"), main = deparse(substitute(x)), ...) { eval(main) type <- match.arg(type) pcs <- x$sdev^2 xp <- seq(length=npcs) if(type=="barplot") barplot(pcs[xp], names = names(pcs[xp]), main = main, ylab = "Variances", ...) else { plot(xp, pcs[xp], type = "b", axes = F, main = main, xlab="", ylab = "Variances", ...) axis(2) axis(1, at = xp, labels = names(pcs[xp])) } invisible() } loadings <- function(x) x$loadings ############## prcomponents.R new file ############## prcomponents <- function(x, center=TRUE, scale=TRUE, N=nrow(x)-1) {if (center) center <- apply(x,2,mean) else center <- rep(0, ncol(x)) if (scale) scale <- sqrt(apply(x,2,var)) else scale <- rep(1, ncol(x)) s <- svd(sweep(sweep(as.matrix(x),2, center),2, scale, FUN="/")) # remove anything corresponding to effectively zero singular values. rank <- sum(s$d > (s$d[1]*sqrt(.Machine$double.eps))) if (rank < ncol(x)) s$v <- s$v[,1:rank, drop=FALSE] s$d <- s$d/sqrt(N) # r <- list(sdev=s$d, proj=s$v,x=x %*% s$v, center=center, scale=scale) r <- list(sdev=s$d, proj=s$v, center=center, scale=scale) r } -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._