Alison Callahan
2011-Apr-18 22:18 UTC
[R] Predicting with a principal component regression model: "non-conformable arguments" error
Hello all, I have generated a principal components regression model using the pcr() function from the PLS package (R version 2.12.0). I am getting a "non-conformable arguments" error when I try to use the predict() function on new data, but only when I try to read in the new data from a separate file. More specifically, when my data looks like this #########training data #1################# var1 var2 var3 response train 1 2 type1 33 TRUE 2 23 type2 44 TRUE ..... ....... 18 11 type1 45 FALSE and I use the predict() function from the PLS package as in the example from http://rss.acs.unt.edu/Rdoc/library/pls/html/predict.mvr.html, e.g. ################################### mydata <- read.csv("mydata.csv", header=TRUE) mydata <- data.frame(mydata) pcrmodel <- pcr(response ~ var1+var2+var3, data = mydata[mydata$train,]) predict(pcrmodel, type = "response", newdata = mydata[!mydata$train,]) ################################### the code works, and the model predicts new values for the "response" variable rows where train=FALSE. However, as soon as I put the rows where train = FALSE into a separate file and remove the "train" column so that my training data looks like this: #########training data #2 ################ var1 var2 var3 response 1 2 type1 33 2 23 type2 44 ..... and my new test data, saved in a separate file (say "newdata.csv") looks like this ########test data in separate file, newdata.csv ############### var1 var2 var3 response 3 5 type1 23 4 7 type2 30 ..... 18 11 type1 45 if I train a PCR model using the training data #2 and try to predict with the resulting model and the data from "newdata.csv", e.g., ################################## trainingdata <- read.csv("mydata_without_train_column.csv", header=TRUE) trainingdata <- data.frame(trainingdata) testingdata <- read.csv("newdata.csv", header=TRUE) testingdata <- data.frame(testingdata) pcrmodel2 <- pcr(response ~ var1+var2+var3, data = trainingdata) predict(pcrmodel, type = "response", newdata = testingdata) ############################## I get the following error: "Error in newX %*% B : non-conformable arguments" I don't understand why I get this error only when I put the non-training data into a separate file from the training data and load it as a separate object. Any help is appreciated, Alison [[alternative HTML version deleted]]
Alison Callahan
2011-Apr-26 18:26 UTC
[R] Predicting with a principal component regression model: "non-conformable arguments" error
Hello again all, I am responding to my own earlier post about a "non-conformable arguments" error with the predict() function of the pls package ( http://cran.r-project.org/web/packages/pls/) in R 2.13.0 (running in Ubuntu 10.10). I believe I have narrowed down the cause of the error. My new understanding is that if the test data to be predicted using a regression model (where the test data is passed in as 'newdata' to the predict() function) does not contain all possible levels of factors in the training data then the predict() function returns a "non-conformable arguments" error. However, this seems like an odd behaviour to me. Surely not all new data for which the dependent variable(s) are to be predicted will contain all levels of a factor present in the training data. Can someone shed some light on why the predict() function of the pls package has this behaviour? And how to avoid it if possible in a way that doesn't involve users having to insert dummy values in new data? Thanks, Alison On Mon, Apr 18, 2011 at 6:18 PM, Alison Callahan <alison.callahan@gmail.com>wrote:> Hello all, > > I have generated a principal components regression model using the pcr() > function from the PLS package (R version 2.13.0). I am getting a > "non-conformable arguments" error when I try to use the predict() function > on new data, but only when I try to read in the new data from a separate > file. > > More specifically, when my data looks like this > > #########training data #1################# > > var1 var2 var3 response train > 1 2 type1 33 > TRUE > 2 23 type2 44 TRUE > ..... > ....... > 18 11 type1 45 > FALSE > > > and I use the predict() function from the PLS package as in the example > from http://rss.acs.unt.edu/Rdoc/library/pls/html/predict.mvr.html, e.g. > > ################################### > mydata <- read.csv("mydata.csv", header=TRUE) > > mydata <- data.frame(mydata) > > pcrmodel <- pcr(response ~ var1+var2+var3, data = mydata[mydata$train,]) > > predict(pcrmodel, type = "response", newdata = mydata[!mydata$train,]) > > ################################### > > the code works, and the model predicts new values for the "response" > variable rows where train=FALSE. > > However, as soon as I put the rows where train = FALSE into a separate file > and remove the "train" column so that my training data looks like this: > > #########training data #2 ################ > var1 var2 var3 response > 1 2 type1 33 > 2 23 type2 44 > ..... > > > and my new test data, saved in a separate file (say "newdata.csv") looks > like this > > ########test data in separate file, newdata.csv ############### > var1 var2 var3 response > 3 5 type1 23 > 4 7 type2 30 > ..... > 18 11 type1 45 > > if I train a PCR model using the training data #2 and try to predict with > the resulting model and the data from "newdata.csv", e.g., > > ################################## > trainingdata <- read.csv("mydata_without_train_column.csv", header=TRUE) > > trainingdata <- data.frame(trainingdata) > > testingdata <- read.csv("newdata.csv", header=TRUE) > > testingdata <- data.frame(testingdata) > > pcrmodel2 <- pcr(response ~ var1+var2+var3, data = trainingdata) > > predict(pcrmodel, type = "response", newdata = testingdata) > ############################## > > I get the following error: > > "Error in newX %*% B : non-conformable arguments" > > I don't understand why I get this error only when I put the non-training > data into a separate file from the training data and load it as a separate > object. Any help is appreciated, > > Alison >[[alternative HTML version deleted]]
Alison Callahan
2011-Apr-27 14:16 UTC
[R] Predicting with a principal component regression model: "non-conformable arguments" error
Hi Dennis, My replies are in-line. On Tue, Apr 26, 2011 at 9:15 PM, Dennis Murphy <djmuser@gmail.com> wrote:> Hi: > > My view, which may well be narrow, is that techniques like PLS and PCR > are useful fit procedures, but I would be very leery about using them > as prediction machines. With new data, why should a similar set of > principal components emerge? Why should the ordering be (close to) the > same? Why should features present in the training data necessarily be > present in test data? And if the PCs vary considerably from one set of > data to another, what's the point of prediction, since the covariate > set is variable from one iteration to the next? Thinking a little more > mathematically, why should I believe that the same set of basis > functions (covariates + PCs) would reasonably apply to future data? > One problem, as I see it, is that the principal components, when used > as basis functions, are functions of the training data; in that > context, why is it believable that they would well predict future > data? [If this is Greek to you (or 'Kling-on', as one of my friends > says), the basis functions in regression are the columns of the model > matrix X, which map to the terms in the 'linear predictor'.] One of > the potential problems is that the effective dimension of the reduced > PC space may well change from one data set to the next. If all PCs are > retained, then there is a serious danger of overfitting, which is a > serious problem in prediction. > > If you're going to contemplate using such models for prediction, I > would seriously consider looking into model validation procedures; > they should provide some clue about how well a fitted model predicts > to new cases. One of the best treatments of the subject I know is > Frank Harrell's Regression Modeling Strategies book (which I believe > will have a new edition out within the next couple of months). There > is a current thread about this topic re logistic regression validation > where the OP has done a nice job of working through the process - > Prof. Harrell has chimed in a few times with some nice comments and > observations. Most of the code to do this kind of thing in R resides > in the rms package; see ?validate and its related functions. I don't > know if it can be applied to PLS/PCR models (rather doubtful) but the > methodology is what is important; e.g., the estimation of optimism in > various figures of merit (e.g., R^2, MSE) when applied over a number > of test sets, which provides an indication of how much bias is present > in the fitted model due to potential overfitting. The process relies > heavily on bootstrapping, so is in some sense vulnerable to the issues > that arise with the bootstrap (e.g., population undercoverage), but in > very large training sets this becomes less of a problem. If you can > validate a PCR model and provide evidence to back it up, then most > people (present company included) would have less ammunition to attack > your prediction model. > > Thank you for these suggestions. The PLS package I am using does includemethods for cross validation to evaluate the quality of PCR/PLSR models, as well as for selecting the optimal number of components to use for predicting using a given model to avoid over fitting. I will also have a look at the RMS package.> > On Tue, Apr 26, 2011 at 11:26 AM, Alison Callahan > <alison.callahan@gmail.com> wrote: > > Hello again all, > > > > I am responding to my own earlier post about a "non-conformable > arguments" > > error with the predict() function of the pls package ( > > http://cran.r-project.org/web/packages/pls/) in R 2.13.0 (running in > Ubuntu > > 10.10). > > > > I believe I have narrowed down the cause of the error. My new > understanding > > is that if the test data to be predicted using a regression model (where > the > > test data is passed in as 'newdata' to the predict() function) does not > > contain all possible levels of factors in the training data then the > > predict() function returns a "non-conformable arguments" error. > > > > However, this seems like an odd behaviour to me. Surely not all new data > for > > which the dependent variable(s) are to be predicted will contain all > levels > > of a factor present in the training data. Can someone shed some light on > why > > the predict() function of the pls package has this behaviour? And how to > > avoid it if possible in a way that doesn't involve users having to insert > > dummy values in new data? > > I don't find this odd at all; rather, I find it comforting. From an R > programming perspective, the factors in your newdata should have > exactly the same defined levels as those in the training data. You > could do this with something like > > newdata$somefactor <- factor(newdata$somefactor, levels > levels(trainingdata$somefactor)) > > What happens if, in future data, one or more new levels of a factor > arise that were not in the training data used to build the prediction > model? > >I absolutely agree with you. New levels for factors in future data that didn't exist in training data used would of course be a problem for predicting. However, in my case, I am trying to use predict() on new data that has a *subset* of the factors present in the training data, and I am getting a "non-conformable arguments" error. For example, my training data has levels A,B,C,D and E for a given factor, while my test data contains only levels B,C and D. Being somewhat new to R, I confused the values of the factor in the new data with the possible levels of that factor. When I specified that the levels of the factor in my test data were to be the same as in the training data, I did not get the "non-conformable arguments" error. Thanks! Alison Dennis> > > > Thanks, > > > > Alison > > > > On Mon, Apr 18, 2011 at 6:18 PM, Alison Callahan > > <alison.callahan@gmail.com>wrote: > > > >> Hello all, > >> > >> I have generated a principal components regression model using the pcr() > >> function from the PLS package (R version 2.13.0). I am getting a > >> "non-conformable arguments" error when I try to use the predict() > function > >> on new data, but only when I try to read in the new data from a separate > >> file. > >> > >> More specifically, when my data looks like this > >> > >> #########training data #1################# > >> > >> var1 var2 var3 response train > >> 1 2 type1 33 > >> TRUE > >> 2 23 type2 44 > TRUE > >> ..... > >> ....... > >> 18 11 type1 45 > >> FALSE > >> > >> > >> and I use the predict() function from the PLS package as in the example > >> from http://rss.acs.unt.edu/Rdoc/library/pls/html/predict.mvr.html, > e.g. > >> > >> ################################### > >> mydata <- read.csv("mydata.csv", header=TRUE) > >> > >> mydata <- data.frame(mydata) > >> > >> pcrmodel <- pcr(response ~ var1+var2+var3, data = mydata[mydata$train,]) > >> > >> predict(pcrmodel, type = "response", newdata = mydata[!mydata$train,]) > >> > >> ################################### > >> > >> the code works, and the model predicts new values for the "response" > >> variable rows where train=FALSE. > >> > >> However, as soon as I put the rows where train = FALSE into a separate > file > >> and remove the "train" column so that my training data looks like this: > >> > >> #########training data #2 ################ > >> var1 var2 var3 response > >> 1 2 type1 33 > >> 2 23 type2 44 > >> ..... > >> > >> > >> and my new test data, saved in a separate file (say "newdata.csv") looks > >> like this > >> > >> ########test data in separate file, newdata.csv ############### > >> var1 var2 var3 response > >> 3 5 type1 23 > >> 4 7 type2 30 > >> ..... > >> 18 11 type1 45 > >> > >> if I train a PCR model using the training data #2 and try to predict > with > >> the resulting model and the data from "newdata.csv", e.g., > >> > >> ################################## > >> trainingdata <- read.csv("mydata_without_train_column.csv", header=TRUE) > >> > >> trainingdata <- data.frame(trainingdata) > >> > >> testingdata <- read.csv("newdata.csv", header=TRUE) > >> > >> testingdata <- data.frame(testingdata) > >> > >> pcrmodel2 <- pcr(response ~ var1+var2+var3, data = trainingdata) > >> > >> predict(pcrmodel, type = "response", newdata = testingdata) > >> ############################## > >> > >> I get the following error: > >> > >> "Error in newX %*% B : non-conformable arguments" > >> > >> I don't understand why I get this error only when I put the non-training > >> data into a separate file from the training data and load it as a > separate > >> object. Any help is appreciated, > >> > >> Alison > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > >[[alternative HTML version deleted]]