Liaw, Andy
2005-Mar-24 14:34 UTC
[Rd] RE: [R] Mapping actual to expected columns for princomp object
[Re-directing to R-devel, as I think this needs changes to the code.] Can I suggest a modification to stats:predict.princomp so that it will check for column (variable) names? In src/library/stats/R/princomp-add.R, insert the following after line 4: if (!is.null(cn <- names(object$center))) newdata <- newdata[, cn] Now Dana's example looks like:> predict(pca1, frz)Error in "[.data.frame"(newdata, , names(object$center)) : undefined columns selected> names(frz) <- c("x2", "x1") > predict(pca1, frz)Comp.1 Comp.2 1 -3.29329963 -1.24675774 2 0.15760569 0.09364550 3 1.90206906 0.06292855 4 -0.92968723 0.64356801 5 -1.15298669 0.25451588 6 0.48466884 -0.87611668 7 0.98602646 -0.52156549 8 -1.53126034 -0.96259529 9 -0.79112984 -1.50831648 10 0.02997392 -0.18888807> names(frz) <- c("x1", "x2") > predict(pca1, frz)Comp.1 Comp.2 1 2.49603051 -2.42516162 2 -0.15633499 0.15754735 3 -1.77400454 0.81118427 4 1.05941012 0.23869214 5 1.11286213 -0.20669206 6 -0.83645436 -0.60720531 7 -1.15932677 -0.08488413 8 0.98526969 -1.47482877 9 0.09070675 -1.68781215 10 -0.14930067 -0.15239717 Best, Andy> From: Dana Honeycutt > > I am working with data sets in which the number and order of columns > may vary, but each column is uniquely identified by its name. E.g., > one data set might have columns > MW logP Num_Rings Num_H_Donors > while another has columns > Num_Rings Num_Atoms Num_H_Donors logP MW > > I would like to be able to perform a principal component > analysis (PCA) > on one data set and save the PCA object to a file. In a > later R session, > I would like to load the object and then apply the loadings to a new > data set in order to compute the principal component (PC) values for > each row of new data. > > I am trying to use the princomp method in R to do this. (I started > with prcomp, but found that there is no predict method for objects > created by prcomp.) The problem is that when using predict on a > princomp object, R ignores the names of columns and simply assumes > that the column order is the same as in the original data frame used > to do the PCA. (This contrasts, for example, with the behavior of a > model produced by lm, which is aware of column names in a data frame.) > > What I think I need to do is this: > > 1. After reloading the princomp object, extract the names and order > of columns that it expects. (If you look at the loadings for the > object, you can see that this info is there, but I would like to > get at it directly somehow.) > > 2. Reorder the columns in the new data set to correspond to this > expected order, and remove any extra columns. > > 3. Use the predict method to predict the PC values for the > new data set. > > Is this the best approach to achieve what I am attempting? > > If so, can anyone tell me how to accomplish steps 1 and 2 above? > > Thanks, > Dana Honeycutt > > P.S. Here's a script that demonstrates the problem: > > x1 <- rnorm(10) > x2 <- rnorm(10) > y <- rnorm(10) > > frx <- data.frame(x1,x2) > frxy <- data.frame(x1,x2,y) > > lm1 <- lm(y~x1+x2,frxy) > pca1 <- princomp(frx) > > rm(x1,x2,y,frx,frxy) > > z1 <- rnorm(10) > z2 <- rnorm(10) > frz <- data.frame(z1,z2) > > predict(lm1, frz) # gives error: Object "x1" not found > predict(pca1, frz) # gives no error, indicating column names ignored > > z3 <- rnorm(10) > fr3z <- data.frame(frz,z3) > predict(pca1,fr3z) # gives error due to unexpected number of columns > > loadings(pca1) # shows linear combos of variables corresponding to PCs > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >
Prof Brian Ripley
2005-Mar-24 15:00 UTC
[Rd] RE: [R] Mapping actual to expected columns for princomp object
I am currently working on this (and on the predict method for prcomp, which does exist, BTW). It needs a bit more in the way of sanity checks. Note that the predict method for lm is for a formula-driven fit, whereas that for princomp is not, hence some of the differences. It is not reasonable to apply the docs for predict.lm to predict.princomp, and they do not work the same way. On Thu, 24 Mar 2005, Liaw, Andy wrote:> [Re-directing to R-devel, as I think this needs changes to the code.] > > Can I suggest a modification to stats:predict.princomp so that it will check > for column (variable) names? > > In src/library/stats/R/princomp-add.R, insert the following after line 4: > > if (!is.null(cn <- names(object$center))) newdata <- newdata[, cn] > > Now Dana's example looks like: > >> predict(pca1, frz) > Error in "[.data.frame"(newdata, , names(object$center)) : > undefined columns selected >> names(frz) <- c("x2", "x1") >> predict(pca1, frz) > Comp.1 Comp.2 > 1 -3.29329963 -1.24675774 > 2 0.15760569 0.09364550 > 3 1.90206906 0.06292855 > 4 -0.92968723 0.64356801 > 5 -1.15298669 0.25451588 > 6 0.48466884 -0.87611668 > 7 0.98602646 -0.52156549 > 8 -1.53126034 -0.96259529 > 9 -0.79112984 -1.50831648 > 10 0.02997392 -0.18888807 >> names(frz) <- c("x1", "x2") >> predict(pca1, frz) > Comp.1 Comp.2 > 1 2.49603051 -2.42516162 > 2 -0.15633499 0.15754735 > 3 -1.77400454 0.81118427 > 4 1.05941012 0.23869214 > 5 1.11286213 -0.20669206 > 6 -0.83645436 -0.60720531 > 7 -1.15932677 -0.08488413 > 8 0.98526969 -1.47482877 > 9 0.09070675 -1.68781215 > 10 -0.14930067 -0.15239717 > > Best, > Andy > >> From: Dana Honeycutt >> >> I am working with data sets in which the number and order of columns >> may vary, but each column is uniquely identified by its name. E.g., >> one data set might have columns >> MW logP Num_Rings Num_H_Donors >> while another has columns >> Num_Rings Num_Atoms Num_H_Donors logP MW >> >> I would like to be able to perform a principal component >> analysis (PCA) >> on one data set and save the PCA object to a file. In a >> later R session, >> I would like to load the object and then apply the loadings to a new >> data set in order to compute the principal component (PC) values for >> each row of new data. >> >> I am trying to use the princomp method in R to do this. (I started >> with prcomp, but found that there is no predict method for objects >> created by prcomp.) The problem is that when using predict on a >> princomp object, R ignores the names of columns and simply assumes >> that the column order is the same as in the original data frame used >> to do the PCA. (This contrasts, for example, with the behavior of a >> model produced by lm, which is aware of column names in a data frame.) >> >> What I think I need to do is this: >> >> 1. After reloading the princomp object, extract the names and order >> of columns that it expects. (If you look at the loadings for the >> object, you can see that this info is there, but I would like to >> get at it directly somehow.) >> >> 2. Reorder the columns in the new data set to correspond to this >> expected order, and remove any extra columns. >> >> 3. Use the predict method to predict the PC values for the >> new data set. >> >> Is this the best approach to achieve what I am attempting? >> >> If so, can anyone tell me how to accomplish steps 1 and 2 above? >> >> Thanks, >> Dana Honeycutt >> >> P.S. Here's a script that demonstrates the problem: >> >> x1 <- rnorm(10) >> x2 <- rnorm(10) >> y <- rnorm(10) >> >> frx <- data.frame(x1,x2) >> frxy <- data.frame(x1,x2,y) >> >> lm1 <- lm(y~x1+x2,frxy) >> pca1 <- princomp(frx) >> >> rm(x1,x2,y,frx,frxy) >> >> z1 <- rnorm(10) >> z2 <- rnorm(10) >> frz <- data.frame(z1,z2) >> >> predict(lm1, frz) # gives error: Object "x1" not found >> predict(pca1, frz) # gives no error, indicating column names ignored >> >> z3 <- rnorm(10) >> fr3z <- data.frame(frz,z3) >> predict(pca1,fr3z) # gives error due to unexpected number of columns >> >> loadings(pca1) # shows linear combos of variables corresponding to PCs >> >> ______________________________________________ >> R-help@stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide! >> http://www.R-project.org/posting-guide.html >> >> >> > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595