Dana Honeycutt
2005-Mar-24 00:09 UTC
[R] Mapping actual to expected columns for princomp object
I am working with data sets in which the number and order of columns may vary, but each column is uniquely identified by its name. E.g., one data set might have columns MW logP Num_Rings Num_H_Donors while another has columns Num_Rings Num_Atoms Num_H_Donors logP MW I would like to be able to perform a principal component analysis (PCA) on one data set and save the PCA object to a file. In a later R session, I would like to load the object and then apply the loadings to a new data set in order to compute the principal component (PC) values for each row of new data. I am trying to use the princomp method in R to do this. (I started with prcomp, but found that there is no predict method for objects created by prcomp.) The problem is that when using predict on a princomp object, R ignores the names of columns and simply assumes that the column order is the same as in the original data frame used to do the PCA. (This contrasts, for example, with the behavior of a model produced by lm, which is aware of column names in a data frame.) What I think I need to do is this: 1. After reloading the princomp object, extract the names and order of columns that it expects. (If you look at the loadings for the object, you can see that this info is there, but I would like to get at it directly somehow.) 2. Reorder the columns in the new data set to correspond to this expected order, and remove any extra columns. 3. Use the predict method to predict the PC values for the new data set. Is this the best approach to achieve what I am attempting? If so, can anyone tell me how to accomplish steps 1 and 2 above? Thanks, Dana Honeycutt P.S. Here's a script that demonstrates the problem: x1 <- rnorm(10) x2 <- rnorm(10) y <- rnorm(10) frx <- data.frame(x1,x2) frxy <- data.frame(x1,x2,y) lm1 <- lm(y~x1+x2,frxy) pca1 <- princomp(frx) rm(x1,x2,y,frx,frxy) z1 <- rnorm(10) z2 <- rnorm(10) frz <- data.frame(z1,z2) predict(lm1, frz) # gives error: Object "x1" not found predict(pca1, frz) # gives no error, indicating column names ignored z3 <- rnorm(10) fr3z <- data.frame(frz,z3) predict(pca1,fr3z) # gives error due to unexpected number of columns loadings(pca1) # shows linear combos of variables corresponding to PCs