[This email is either empty or too large to be displayed at this time]
[This email is either empty or too large to be displayed at this time]
Hello everyone, I am having a problem using the predict (or the predict.glm) function in R. Basically, I run the glm model on a "training" data set and try to obtain predictions for a set of new predictors from a "test" data set (i.e., not the predictors that were utilized to obtain the glm parameter estimates). Unfortunately, every time that I attempt this, I obtain the predictions for the predictors that were used to fit the glm model. I have looked at the R mailing list archives and don't believe I am making the same mistakes that have been made in the past and also have tried to closely follow the predict.glm example in the help file. Here is an example of what I am trying to do: ######################################################## set.seed(545345) ################ # Necessary Variables # ################ p <- 2 train.n <- 20 test.n <- 25 mean.vec.1 <- c(1,1) mean.vec.2 <- c(0,0) Sigma.1 <- matrix(c(1,.5,.5,1),p,p) Sigma.2 <- matrix(c(1,.5,.5,1),p,p) ############### # Load MASS Library # ############### library(MASS) ################################### # Data to Parameters for Logistic Regression Model # ################################### train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1) train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2) train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n))) predictors.train <- rbind(train.data.1,train.data.2) ############################################## # Test Data Where Predictions for Probabilities Using Logistic Reg. # # From Training Data are of Interest # ############################################## test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1) test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2) predictors.test <- rbind(test.data.1,test.data.2) ############################## # Run Logistic Regression on Training Data # ############################## log.reg <- glm(train.class.var~predictors.train, family=binomial(link="logit")) log.reg #> log.reg #Call: glm(formula = train.class.var ~ predictors.train, family #binomial(link = "logit")) # #Coefficients: # (Intercept) predictors.train1 predictors.train2 # 0.5105 -0.2945 -1.0811 # #Degrees of Freedom: 39 Total (i.e. Null); 37 Residual #Null Deviance: 55.45 #Residual Deviance: 41.67 AIC: 47.67 ########################### # Predicted Probabilities for Test Data # ########################### New.Data <- data.frame(predictors.train1=predictors.test[,1], predictors.train2=predictors.test[,2]) logreg.pred.prob.test <- predict.glm(log.reg,New.Data,type="response") logreg.pred.prob.test #logreg.pred.prob.test # [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589 0.71331091 # [7] 0.17320087 0.14176632 0.30966718 0.61878952 0.12525988 0.21271139 #[13] 0.70068113 0.18340723 0.10295501 0.44591568 0.72285161 0.31499339 #[19] 0.65789420 0.42750139 0.14435889 0.93008117 0.70798465 0.80109005 #[25] 0.89161472 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882 #[31] 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955 #[37] 0.80855782 0.90835588 0.54809117 0.11568637 ######################################################## Of course, notice that the vector for the predicted probabilities has only 40 elements, while the "New.Data" has 50 elements (since n.test has 25 per group for 2 groups) and thus should have 50 predicted probabilities. As it turns out, the output is for the training data predictors and not for the "New.Data" as I would like it to be. I should also note that I have made sure that the names for the predictors in the "New.Data" are the same as the names for the predictors within the glm object (i.e., within "log.reg") as this is what is done in the example for predict.glm() within the help files. Could some one help me understand either what I am doing incorrectly or what problems there might be within the predict() function? I should mention that I tried the same program using predict.glm() and obtained the same problematic results. Thanks and take care, Joe Joe Rausch, M.A. Psychology Liaison Lab for Social Research 917 Flanner Hall University of Notre Dame Notre Dame, IN 46556 (574) 631-3910 "If we knew what it was we were doing, it would not be called research, would it?" - Albert Einstein
Perhaps your approach reflects a method of producing a prediction dataframe that is just unfamiliar to me, but it looks to me like you have created two predictor variables based on the names of the levels of the original predictor (predictors.train1, predictors.train2). I don't know how the glm function would know that predictors.train1 and predictors.train2 are two subs for predictors.train. Maybe try just using one prediction variable, and give it the original variable name (predictors.train). If this works, just repeat for your second set of values.> Mark Fowler > Marine Fish Division > Bedford Inst of Oceanography > Dept Fisheries & Oceans > Dartmouth NS Canada > fowlerm at mar.dfo-mpo.gc.ca >-----Original Message----- From: jrausch at nd.edu [mailto:jrausch at nd.edu] Sent: September 22, 2004 2:53 PM To: r-help at stat.math.ethz.ch Subject: [R] Issue with predict() for glm models Hello everyone, I am having a problem using the predict (or the predict.glm) function in R. Basically, I run the glm model on a "training" data set and try to obtain predictions for a set of new predictors from a "test" data set (i.e., not the predictors that were utilized to obtain the glm parameter estimates). Unfortunately, every time that I attempt this, I obtain the predictions for the predictors that were used to fit the glm model. I have looked at the R mailing list archives and don't believe I am making the same mistakes that have been made in the past and also have tried to closely follow the predict.glm example in the help file. Here is an example of what I am trying to do: ######################################################## set.seed(545345) ################ # Necessary Variables # ################ p <- 2 train.n <- 20 test.n <- 25 mean.vec.1 <- c(1,1) mean.vec.2 <- c(0,0) Sigma.1 <- matrix(c(1,.5,.5,1),p,p) Sigma.2 <- matrix(c(1,.5,.5,1),p,p) ############### # Load MASS Library # ############### library(MASS) ################################### # Data to Parameters for Logistic Regression Model # ################################### train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1) train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2) train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n))) predictors.train <- rbind(train.data.1,train.data.2) ############################################## # Test Data Where Predictions for Probabilities Using Logistic Reg. # # From Training Data are of Interest # ############################################## test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1) test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2) predictors.test <- rbind(test.data.1,test.data.2) ############################## # Run Logistic Regression on Training Data # ############################## log.reg <- glm(train.class.var~predictors.train, family=binomial(link="logit")) log.reg #> log.reg #Call: glm(formula = train.class.var ~ predictors.train, family #binomial(link = "logit")) # #Coefficients: # (Intercept) predictors.train1 predictors.train2 # 0.5105 -0.2945 -1.0811 # #Degrees of Freedom: 39 Total (i.e. Null); 37 Residual #Null Deviance: 55.45 #Residual Deviance: 41.67 AIC: 47.67 ########################### # Predicted Probabilities for Test Data # ########################### New.Data <- data.frame(predictors.train1=predictors.test[,1], predictors.train2=predictors.test[,2]) logreg.pred.prob.test <- predict.glm(log.reg,New.Data,type="response") logreg.pred.prob.test #logreg.pred.prob.test # [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589 0.71331091 # [7] 0.17320087 0.14176632 0.30966718 0.61878952 0.12525988 0.21271139 #[13] 0.70068113 0.18340723 0.10295501 0.44591568 0.72285161 0.31499339 #[19] 0.65789420 0.42750139 0.14435889 0.93008117 0.70798465 0.80109005 #[25] 0.89161472 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882 #[31] 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955 #[37] 0.80855782 0.90835588 0.54809117 0.11568637 ######################################################## Of course, notice that the vector for the predicted probabilities has only 40 elements, while the "New.Data" has 50 elements (since n.test has 25 per group for 2 groups) and thus should have 50 predicted probabilities. As it turns out, the output is for the training data predictors and not for the "New.Data" as I would like it to be. I should also note that I have made sure that the names for the predictors in the "New.Data" are the same as the names for the predictors within the glm object (i.e., within "log.reg") as this is what is done in the example for predict.glm() within the help files. Could some one help me understand either what I am doing incorrectly or what problems there might be within the predict() function? I should mention that I tried the same program using predict.glm() and obtained the same problematic results. Thanks and take care, Joe Joe Rausch, M.A. Psychology Liaison Lab for Social Research 917 Flanner Hall University of Notre Dame Notre Dame, IN 46556 (574) 631-3910 "If we knew what it was we were doing, it would not be called research, would it?" - Albert Einstein ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of John Fox > Sent: Thursday, September 23, 2004 8:49 AM > To: 'Uwe Ligges' > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] Issue with predict() for glm models > > > Dear Uwe, > > Unless I've somehow messed this up, as I mentioned yesterday, what you > suggest doesn't seem to work when the predictor is a matrix. Here's a > simplified example: > > > X <- matrix(rnorm(200), 100, 2) > > y <- (X %*% c(1,2) + rnorm(100)) > 0 > > dat <- data.frame(y=y, X=X) > > mod <- glm(y ~ X, family=binomial, data=dat) > > new <- data.frame(X = matrix(rnorm(20),2)) > > predict(mod, new) > 1 2 3 4 5 > 6 > 1.81224443 -5.92955128 1.98718051 -10.05331521 2.65065555 > -2.50635812 > 7 8 9 10 11 > 12 > 5.63728698 -0.94845276 -3.61657377 -1.63141320 5.03417372 > 1.80400271 > 13 14 15 16 17 > 18 > 9.32876273 -5.32723406 5.29373023 -3.90822713 -10.95065186 > 4.90038016 > > . . . > > 97 98 99 100 > -6.92509812 0.59357486 -1.17205723 0.04209578 > > > Note that there are 100 rather than 10 predicted values. > > But with individuals predictors (rather than a matrix), > > > x1 <- X[,1] > > x2 <- X[,2] > > dat.2 <- data.frame(y=y, x1=x1, x2=x2) > > mod.2 <- glm(y ~ x1 + x2, family=binomial, data=dat.2) > > new.2 <- data.frame(x1=rnorm(10), x2=rnorm(10)) > > predict(mod.2, new.2) > 1 2 3 4 5 > 6 7 > > 6.5723823 0.6356392 4.0291018 -4.7914650 2.1435485 > -3.1738096 -2.8261585 > > 8 9 10 > -1.5255329 -4.7087592 4.0619290 > > works as expected (?). > > Regards, > John >My apologies: I have not kept up with the earlier messages in this thread, so this might be covered already:> X <- matrix(rnorm(200), 100, 2) > y <- (X %*% c(1,2) + rnorm(100)) > 0 > dat <- data.frame(y=y, X=X) > mod <- glm(y ~ X, family=binomial, data=dat) > newmat <- matrix(rnorm(20), ncol=2) > new <- data.frame(X = newmat) > new2 <- data.frame(X = I(newmat)) > str(predict(mod, new))Named num [1:100] 3.30 8.08 4.65 4.72 2.15 ... - attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...> str(predict(mod, new2))Named num [1:10] 7.358 0.241 -5.418 2.094 -0.885 ... - attr(*, "names")= chr [1:10] "1" "2" "3" "4" ... The bottom line is that the data frame passed to predict() must have variables that matches exactly with what's used in the original formula. Passing `new' doesn't work as expected because it's a data frame with two variables, rather than one variable whose name is `X' and is a two-column matrix. Best, Andy> > From: Uwe Ligges > > > > jrausch at nd.edu wrote: > > > > > Hello everyone, > > > > > > I am having a problem using the predict (or the > > predict.glm) function in R. > > > Basically, I run the glm model on a "training" data set > and try to > > > obtain predictions for a set of new predictors from a > > "test" data set > > > (i.e., not the predictors that were utilized to obtain the > > glm parameter estimates). > > > Unfortunately, every time that I attempt this, I obtain the > > > predictions for the predictors that were used to fit the > > glm model. I > > > have looked at the R mailing list archives and don't believe I am > > > making the same mistakes that have been made in the past > > and also have > > > tried to closely follow the predict.glm example in the help > > file. Here is an example of what I am trying to do: > > > > > > ######################################################## > > > set.seed(545345) > > > > > > ################ > > > # Necessary Variables # > > > ################ > > > > > > p <- 2 > > > train.n <- 20 > > > test.n <- 25 > > > mean.vec.1 <- c(1,1) > > > mean.vec.2 <- c(0,0) > > > > > > Sigma.1 <- matrix(c(1,.5,.5,1),p,p) > > > Sigma.2 <- matrix(c(1,.5,.5,1),p,p) > > > > > > ############### > > > # Load MASS Library # > > > ############### > > > > > > library(MASS) > > > > > > ################################### > > > # Data to Parameters for Logistic Regression Model # > > > ################################### > > > > > > train.data.1 <- mvrnorm(train.n,mu=mean.vec.1,Sigma=Sigma.1) > > > train.data.2 <- mvrnorm(train.n,mu=mean.vec.2,Sigma=Sigma.2) > > > train.class.var <- as.factor(c(rep(1,train.n),rep(2,train.n))) > > > predictors.train <- rbind(train.data.1,train.data.2) > > > > > > ############################################## > > > # Test Data Where Predictions for Probabilities Using > > Logistic Reg. # > > > # From Training Data are of Interest > > # > > > ############################################## > > > > > > test.data.1 <- mvrnorm(test.n,mu=mean.vec.1,Sigma=Sigma.1) > > > test.data.2 <- mvrnorm(test.n,mu=mean.vec.2,Sigma=Sigma.2) > > > predictors.test <- rbind(test.data.1,test.data.2) > > > > > > ############################## > > > # Run Logistic Regression on Training Data # > > > ############################## > > > > > > log.reg <- glm(train.class.var~predictors.train, > > > family=binomial(link="logit")) > > > > Well, you haven't specified the "data" argument, but given > > the two variables directly. Exactly those variables will be > > used in the > > predict() step below! If you want the predict() step to work, > > use something like: > > > > train <- data.frame(class = train.class.var, > > predictors = predictors.train) > > log.reg <- glm(class ~ ., data = train, > > family=binomial(link="logit")) > > > > > > > > > log.reg > > > > > > #> log.reg > > > > > > #Call: glm(formula = train.class.var ~ predictors.train, > family = > > > #binomial(link = "logit")) # > > > #Coefficients: > > > # (Intercept) predictors.train1 predictors.train2 > > > # 0.5105 -0.2945 -1.0811 > > > # > > > #Degrees of Freedom: 39 Total (i.e. Null); 37 Residual > > > #Null Deviance: 55.45 > > > #Residual Deviance: 41.67 AIC: 47.67 > > > > > > ########################### > > > # Predicted Probabilities for Test Data # > > ########################### > > > > > > New.Data <- data.frame(predictors.train1=predictors.test[,1], > > > predictors.train2=predictors.test[,2]) > > > > > > logreg.pred.prob.test <- > > predict.glm(log.reg,New.Data,type="response") > > > logreg.pred.prob.test > > > > Instead, use: > > > > test <- data.frame(predictors = predictors.test) > > predict(log.reg, newdata = test, type="response") > > > > > > note also: please call the generic predict() rather than its > > glm method. > > > > > > Uwe Ligges > > > > > > > #logreg.pred.prob.test > > > # [1] 0.51106406 0.15597423 0.04948404 0.03863875 0.35587589 > > > 0.71331091 # [7] 0.17320087 0.14176632 0.30966718 0.61878952 > > > 0.12525988 0.21271139 #[13] 0.70068113 0.18340723 0.10295501 > > > 0.44591568 0.72285161 0.31499339 #[19] 0.65789420 0.42750139 > > > 0.14435889 0.93008117 0.70798465 0.80109005 #[25] 0.89161472 > > > 0.47480625 0.56520952 0.63981834 0.57595189 0.60075882 #[31] > > > 0.96493393 0.77015507 0.87643986 0.62973986 0.63043351 0.45398955 > > > #[37] 0.80855782 0.90835588 0.54809117 0.11568637 > > > ######################################################## > > > > > > Of course, notice that the vector for the predicted > > probabilities has > > > only 40 elements, while the "New.Data" has 50 elements > > (since n.test > > > has 25 per group for 2 groups) and thus should have 50 predicted > > > probabilities. As it turns out, the output is for the > training data > > > predictors and not for the "New.Data" as I would like it to be. I > > > should also note that I have made sure that the names for the > > > predictors in the "New.Data" are the same as the names for the > > > predictors within the glm object (i.e., within "log.reg") > > as this is what is done in the example for predict.glm() > > within the help files. > > > > > > Could some one help me understand either what I am doing > > incorrectly > > > or what problems there might be within the predict() function? I > > > should mention that I tried the same program using > > predict.glm() and > > > obtained the same problematic results. > > > > > > Thanks and take care, > > > > > > Joe > > > > > > > > > Joe Rausch, M.A. > > > Psychology Liaison > > > Lab for Social Research > > > 917 Flanner Hall > > > University of Notre Dame > > > Notre Dame, IN 46556 > > > (574) 631-3910 > > > > > > "If we knew what it was we were doing, it would not be called > > > research, would it?" > > > - Albert Einstein > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Thanks to John Fox, Andy Liaw, and Uwe Ligges for their help with my problem regarding the use of the predict() function to obtain predictions for a new set of predictor values. It appears that the bottom line (at least for my purposes) is that the names and the setup for the data of the predictors in the glm and the new data need to be consistent. The safest way that I know to do this from reading John, Andy, and Uwe's responses is to label each predictor separately and place them into the glm model separately. Then, when creating a new data frame to utilize in the predict() function, ensure to consistently name the predictors. For illustrative examples, see the reply emails of John, Andy, and Uwe. Thanks again, Joe Joe Rausch, M.A. Psychology Liaison Lab for Social Research 917 Flanner Hall University of Notre Dame Notre Dame, IN 46556 (574) 631-3910 www.nd.edu/~jrausch "If we knew what it was we were doing, it would not be called research, would it?" - Albert Einstein