Hi, It is most probably just my R-ignorance, but I have following problem with using predict(). I train the model using 164 cases and then I try to use it on the data set with 35 cases, but I am getting 164 predictions ? R-code below illustrates in more detail what I am doing. Truly yours, R train = read.csv("train.csv", header = TRUE, row.names = "mol", comment.char="") yr <- train[,1] # take Y from 1 column xr <- train[,-1] # X is the rest xr <- scale(xr) # matrix <- scale(data.frame) x.center <- attr(xr, "scaled:center") x.scale <- attr(xr, "scaled:scale") mask <- apply(xr, 2, function(x) any(is.na(x))) xr <- xr[,!mask] # rm NA's model <- lm(yr ~ xr) # fit linear model test <- read.csv("test.csv", header = TRUE, row.names = "mol", comment.char="") ys <- test[,1] xs <- test[,-1] xs <- scale(xs, center = x.center, scale = x.scale) xs <- xs[,!mask] xs <- as.data.frame(xs) pr <- predict(model, as.data.frame(xr)) ps <- predict(model, xs) cat("length(yr) =", length(yr), "; length(pr) =", length(pr),"\n") cat("dim(xr) =", dim(xr), "; dim(xs) =", dim(xs),"\n") cat("length(ys) =", length(ys), "; length(ps) =", length(ps), "\n") cat("why length(ps) != length(ys) ???\n") # my output: # # length(yr) = 164 ; length(pr) = 164 # dim(xr) = 164 118 ; dim(xs) = 35 118 # length(ys) = 35 ; length(ps) = 164 # why length(ps) != length(ys) ??? Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: Czerminski, Ryszard [mailto:ryszard at arqule.com] Sent: Thursday, June 20, 2002 12:15 PM To: r-help at stat.math.ethz.ch Subject: [R] dist(a,b) ??? Is there a function analogous to "dist" which would calculate distances between rows of two different data sets ? Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi, It is most probably just my R-ignorance, but I have following problem with predict(). I train the model using 164 cases and then I try to use it on the data set with 35 cases, but I am getting 164 predictions ? R-code below illustrates in more detail what I am doing. Truly yours, R train = read.csv("train.csv", header = TRUE, row.names = "mol", comment.char="") yr <- train[,1] # take Y from 1 column xr <- train[,-1] # X is the rest xr <- scale(xr) # matrix <- scale(data.frame) x.center <- attr(xr, "scaled:center") x.scale <- attr(xr, "scaled:scale") mask <- apply(xr, 2, function(x) any(is.na(x))) xr <- xr[,!mask] # rm NA's model <- lm(yr ~ xr) # fit linear model test <- read.csv("test.csv", header = TRUE, row.names = "mol", comment.char="") ys <- test[,1] xs <- test[,-1] xs <- scale(xs, center = x.center, scale = x.scale) xs <- xs[,!mask] xs <- as.data.frame(xs) pr <- predict(model, as.data.frame(xr)) ps <- predict(model, xs) cat("length(yr) =", length(yr), "; length(pr) =", length(pr),"\n") cat("dim(xr) =", dim(xr), "; dim(xs) =", dim(xs),"\n") cat("length(ys) =", length(ys), "; length(ps) =", length(ps), "\n") cat("why length(ps) != length(ys) ???\n") # my output: # # length(yr) = 164 ; length(pr) = 164 # dim(xr) = 164 118 ; dim(xs) = 35 118 # length(ys) = 35 ; length(ps) = 164 # why length(ps) != length(ys) ??? Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: Czerminski, Ryszard [mailto:ryszard at arqule.com] Sent: Thursday, June 20, 2002 12:15 PM To: r-help at stat.math.ethz.ch Subject: [R] dist(a,b) ??? Is there a function analogous to "dist" which would calculate distances between rows of two different data sets ? Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> Does xs contain a variable called xr?No - it does not. I was expecting that if I use model built using (yr,xr) data with 164 cases and 118 variables to predict on a test data (xs) with 35 cases> # dim(xr) = 164 118 ; dim(xs) = 35 118I should get vector of 35 responses from predict(), but I am getting instead 164 responses.> # length(ys) = 35 ; length(ps) = 164If I go through example provided with help(predict.lm) I get expected number of responses (13). This is however only 1D example. I am sure I am missing something and probably not using predict() correctly, but I am at loss what it is... R Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: Peter Dalgaard BSA [mailto:p.dalgaard at biostat.ku.dk] Sent: Thursday, June 20, 2002 5:10 PM To: Czerminski, Ryszard Cc: r-help at stat.math.ethz.ch Subject: Re: [R] problem with predict() "Czerminski, Ryszard" <ryszard at arqule.com> writes:> pr <- predict(model, as.data.frame(xr)) > ps <- predict(model, xs) > > cat("length(yr) =", length(yr), "; length(pr) =", length(pr),"\n") > cat("dim(xr) =", dim(xr), "; dim(xs) =", dim(xs),"\n") > cat("length(ys) =", length(ys), "; length(ps) =", length(ps), "\n") > cat("why length(ps) != length(ys) ???\n") > > # my output: > # > # length(yr) = 164 ; length(pr) = 164 > # dim(xr) = 164 118 ; dim(xs) = 35 118 > # length(ys) = 35 ; length(ps) = 164 > # why length(ps) != length(ys) ???Does xs contain a variable called xr? -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
The second argument to predict.lm, `newdata', is suppose to be a data frame containing variables that were used to fit the model. Thus if in your lm object the predictor is a variable named `xr', the predict method will be looking for `xr', and ignored `xs' that you passed in. It's probably easier to do the following: ## Put training data in a data frame. train <- data.frame(y=yr, x=xr) ## Put test data in a data frame. test <- data.frame(y=ys, x=xs) myfit <- lm(y ~ x, train) test.pred <- predict(myfit, test) Hope this helps, Andy> -----Original Message----- > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > Sent: Friday, June 21, 2002 8:10 AM > To: 'Peter Dalgaard BSA'; Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > > Does xs contain a variable called xr? > > No - it does not. > > I was expecting that if I use model built using > (yr,xr) data with 164 cases and 118 variables > to predict on a test data (xs) with 35 cases > > # dim(xr) = 164 118 ; dim(xs) = 35 118 > I should get vector of 35 responses from predict(), > but I am getting instead 164 responses. > > # length(ys) = 35 ; length(ps) = 164 > > If I go through example provided with help(predict.lm) > I get expected number of responses (13). > This is however only 1D example. > > I am sure I am missing something and probably not using > predict() correctly, but I am at loss what it is... > > R > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: Peter Dalgaard BSA [mailto:p.dalgaard at biostat.ku.dk] > Sent: Thursday, June 20, 2002 5:10 PM > To: Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] problem with predict() > > > "Czerminski, Ryszard" <ryszard at arqule.com> writes: > > > pr <- predict(model, as.data.frame(xr)) > > ps <- predict(model, xs) > > > > cat("length(yr) =", length(yr), "; length(pr) =", length(pr),"\n") > > cat("dim(xr) =", dim(xr), "; dim(xs) =", dim(xs),"\n") > > cat("length(ys) =", length(ys), "; length(ps) =", length(ps), "\n") > > cat("why length(ps) != length(ys) ???\n") > > > > # my output: > > # > > # length(yr) = 164 ; length(pr) = 164 > > # dim(xr) = 164 118 ; dim(xs) = 35 118 > > # length(ys) = 35 ; length(ps) = 164 > > # why length(ps) != length(ys) ??? > > Does xs contain a variable called xr? > > -- > O__ ---- Peter Dalgaard Blegdamsvej 3 > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > (*) \(*) -- University of Copenhagen Denmark Ph: > (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > (+45) 35327907 > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-. > -.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._. > _._ > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-.-.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._._._ >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>From Liaw, Andy:> It's probably easier to do the following: > ## Put training data in a data frame. > train <- data.frame(y=yr, x=xr) > ## Put test data in a data frame. > test <- data.frame(y=ys, x=xs) > > myfit <- lm(y ~ x, train) > test.pred <- predict(myfit, test)This looks promissing; however I get an error:> train <- data.frame(y=yr, x=xr) > test <- data.frame(y=ys, x=xs) > myfit <- lm(y ~ x, train)Error in eval(expr, envir, enclos) : Object "x" not found R Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: Liaw, Andy [mailto:andy_liaw at merck.com] Sent: Friday, June 21, 2002 8:34 AM To: 'Czerminski, Ryszard' Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() The second argument to predict.lm, `newdata', is suppose to be a data frame containing variables that were used to fit the model. Thus if in your lm object the predictor is a variable named `xr', the predict method will be looking for `xr', and ignored `xs' that you passed in. It's probably easier to do the following: ## Put training data in a data frame. train <- data.frame(y=yr, x=xr) ## Put test data in a data frame. test <- data.frame(y=ys, x=xs) myfit <- lm(y ~ x, train) test.pred <- predict(myfit, test) Hope this helps, Andy> -----Original Message----- > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > Sent: Friday, June 21, 2002 8:10 AM > To: 'Peter Dalgaard BSA'; Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > > Does xs contain a variable called xr? > > No - it does not. > > I was expecting that if I use model built using > (yr,xr) data with 164 cases and 118 variables > to predict on a test data (xs) with 35 cases > > # dim(xr) = 164 118 ; dim(xs) = 35 118 > I should get vector of 35 responses from predict(), > but I am getting instead 164 responses. > > # length(ys) = 35 ; length(ps) = 164 > > If I go through example provided with help(predict.lm) > I get expected number of responses (13). > This is however only 1D example. > > I am sure I am missing something and probably not using > predict() correctly, but I am at loss what it is... > > R > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: Peter Dalgaard BSA [mailto:p.dalgaard at biostat.ku.dk] > Sent: Thursday, June 20, 2002 5:10 PM > To: Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] problem with predict() > > > "Czerminski, Ryszard" <ryszard at arqule.com> writes: > > > pr <- predict(model, as.data.frame(xr)) > > ps <- predict(model, xs) > > > > cat("length(yr) =", length(yr), "; length(pr) =", length(pr),"\n") > > cat("dim(xr) =", dim(xr), "; dim(xs) =", dim(xs),"\n") > > cat("length(ys) =", length(ys), "; length(ps) =", length(ps), "\n") > > cat("why length(ps) != length(ys) ???\n") > > > > # my output: > > # > > # length(yr) = 164 ; length(pr) = 164 > > # dim(xr) = 164 118 ; dim(xs) = 35 118 > > # length(ys) = 35 ; length(ps) = 164 > > # why length(ps) != length(ys) ??? > > Does xs contain a variable called xr? > > -- > O__ ---- Peter Dalgaard Blegdamsvej 3 > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > (*) \(*) -- University of Copenhagen Denmark Ph: > (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > (+45) 35327907 > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-. > -.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._. > _._ > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-.-.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._._._ >---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
You still don't get the point. Please read Peter Dalgaard's reply and the help page for predict.lm carefully, and try to understand the `Detail' section. See the example below:> n<-100 > p<-10 > y<-rnorm(n) > xtr<-matrix(rnorm(n*p), n, p) > xts<- matrix(rnorm(5*p), 5, p) > myfit<-lm(y~.,data=as.data.frame(xtr)) > length(predict(myfit,as.data.frame(xts)))[1] 5 Andy> -----Original Message----- > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > Sent: Friday, June 21, 2002 10:21 AM > To: 'Liaw, Andy'; Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > From Liaw, Andy: > > > It's probably easier to do the following: > > ## Put training data in a data frame. > > train <- data.frame(y=yr, x=xr) > > ## Put test data in a data frame. > > test <- data.frame(y=ys, x=xs) > > > > myfit <- lm(y ~ x, train) > > test.pred <- predict(myfit, test) > > This looks promissing; however I get an error: > > > train <- data.frame(y=yr, x=xr) > > test <- data.frame(y=ys, x=xs) > > myfit <- lm(y ~ x, train) > Error in eval(expr, envir, enclos) : Object "x" not found > > R > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: Liaw, Andy [mailto:andy_liaw at merck.com] > Sent: Friday, June 21, 2002 8:34 AM > To: 'Czerminski, Ryszard' > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > The second argument to predict.lm, `newdata', is suppose to > be a data frame > containing variables that were used to fit the model. Thus > if in your lm > object the predictor is a variable named `xr', the predict > method will be > looking for `xr', and ignored `xs' that you passed in. > > It's probably easier to do the following: > > ## Put training data in a data frame. > train <- data.frame(y=yr, x=xr) > ## Put test data in a data frame. > test <- data.frame(y=ys, x=xs) > > myfit <- lm(y ~ x, train) > test.pred <- predict(myfit, test) > > Hope this helps, > Andy > > > -----Original Message----- > > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > > Sent: Friday, June 21, 2002 8:10 AM > > To: 'Peter Dalgaard BSA'; Czerminski, Ryszard > > Cc: r-help at stat.math.ethz.ch > > Subject: RE: [R] problem with predict() > > > > > > > Does xs contain a variable called xr? > > > > No - it does not. > > > > I was expecting that if I use model built using > > (yr,xr) data with 164 cases and 118 variables > > to predict on a test data (xs) with 35 cases > > > # dim(xr) = 164 118 ; dim(xs) = 35 118 > > I should get vector of 35 responses from predict(), > > but I am getting instead 164 responses. > > > # length(ys) = 35 ; length(ps) = 164 > > > > If I go through example provided with help(predict.lm) > > I get expected number of responses (13). > > This is however only 1D example. > > > > I am sure I am missing something and probably not using > > predict() correctly, but I am at loss what it is... > > > > R > > > > Ryszard Czerminski phone: (781)994-0479 > > ArQule, Inc. email:ryszard at arqule.com > > 19 Presidential Way http://www.arqule.com > > Woburn, MA 01801 fax: (781)994-0679 > > > > > > -----Original Message----- > > From: Peter Dalgaard BSA [mailto:p.dalgaard at biostat.ku.dk] > > Sent: Thursday, June 20, 2002 5:10 PM > > To: Czerminski, Ryszard > > Cc: r-help at stat.math.ethz.ch > > Subject: Re: [R] problem with predict() > > > > > > "Czerminski, Ryszard" <ryszard at arqule.com> writes: > > > > > pr <- predict(model, as.data.frame(xr)) > > > ps <- predict(model, xs) > > > > > > cat("length(yr) =", length(yr), "; length(pr) =", length(pr),"\n") > > > cat("dim(xr) =", dim(xr), "; dim(xs) =", dim(xs),"\n") > > > cat("length(ys) =", length(ys), "; length(ps) =", > length(ps), "\n") > > > cat("why length(ps) != length(ys) ???\n") > > > > > > # my output: > > > # > > > # length(yr) = 164 ; length(pr) = 164 > > > # dim(xr) = 164 118 ; dim(xs) = 35 118 > > > # length(ys) = 35 ; length(ps) = 164 > > > # why length(ps) != length(ys) ??? > > > > Does xs contain a variable called xr? > > > > -- > > O__ ---- Peter Dalgaard Blegdamsvej 3 > > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > > (*) \(*) -- University of Copenhagen Denmark Ph: > > (+45) 35327918 > > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > > (+45) 35327907 > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > -.-.-.-.-.-.-. > > -.- > > r-help mailing list -- Read > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > Send "info", "help", or "[un]subscribe" > > (in the "body", not the subject !) To: > > r-help-request at stat.math.ethz.ch > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > > _._._._._._._. > > _._ > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > -.-.-.-.-.-.-.-.- > > r-help mailing list -- Read > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > Send "info", "help", or "[un]subscribe" > > (in the "body", not the subject !) To: > > r-help-request at stat.math.ethz.ch > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > > _._._._._._._._._ > > > > -------------------------------------------------------------- > -------------- > -- > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (Whitehouse Station, New > Jersey, USA) that > may be confidential, proprietary copyrighted and/or legally > privileged, and > is intended solely for the use of the individual or entity > named on this > message. If you are not the intended recipient, and have received this > message in error, please immediately return this by e-mail > and then delete > it. > > =============================================================> =============> =>------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
The problem is that xr and xs are both matrices in his example, not vectors. Andy> -----Original Message----- > From: Peter Dalgaard BSA [mailto:p.dalgaard at biostat.ku.dk] > Sent: Friday, June 21, 2002 1:03 PM > To: Liaw, Andy > Cc: 'Czerminski, Ryszard'; r-help at stat.math.ethz.ch > Subject: Re: [R] problem with predict() > > > "Liaw, Andy" <andy_liaw at merck.com> writes: > > > You still don't get the point. Please read Peter > Dalgaard's reply and the > > help page for predict.lm carefully, and try to understand > the `Detail' > > section. See the example below: > [snip] > > > > This looks promissing; however I get an error: > > > > > > > train <- data.frame(y=yr, x=xr) > > > > test <- data.frame(y=ys, x=xs) > > > > myfit <- lm(y ~ x, train) > > > Error in eval(expr, envir, enclos) : Object "x" not found > > But there's nothing wrong with that code as far as I can see?? I don't > get an error from it: > > > xr <- rnorm(10) > > yr <- rnorm(10) > > ys <- rnorm(5) > > xs <- rnorm(5) > > train <- data.frame(y=yr, x=xr) > > test <- data.frame(y=ys, x=xs) > > myfit <- lm(y ~ x, train) > > predict(myfit,test) > 1 2 3 4 5 > -0.03809295 0.11422384 0.35570765 0.55436954 0.22979523 > > > Something must have gone wrong with the creation of "train". > > -- > O__ ---- Peter Dalgaard Blegdamsvej 3 > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > (*) \(*) -- University of Copenhagen Denmark Ph: > (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > (+45) 35327907 >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thank you for great support so far ! I think I am getting closer, but I still not quite get it... Two questions: (1) what is the difference between lm(y~., and lm(y~x, ??? with second form failing ?> train <- data.frame(y = yr, x = xr) > test <- data.frame(y = ys, x = xs) > model <- lm(y~., train) > model <- lm(y~x, train)Error in eval(expr, envir, enclos) : Object "x" not found (2) and the other problems seems to be data related... Consider following code: ::: rm(list=ls()) train.data <- read.csv("train.csv", header = TRUE, row.names = "mol", comment.char="") test.data <- read.csv("test.csv", header = TRUE, row.names = "mol", comment.char="") #train.data <- matrix(rnorm(164*119), nrow = 164) #test.data <- matrix(rnorm(35*119), nrow = 35) yr <- train.data[,1]; xr <- train.data[,-1] xr <- scale(xr) # matrix <- scale(data.frame) x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, "scaled:scale") mask <- apply(xr, 2, function(x) any(is.na(x))) xr <- xr[,!mask] # rm NA's ys <- test.data[,1]; xs <- test.data[,-1] xs <- scale(xs, center = x.center, scale = x.scale) xs <- xs[,!mask] train <- data.frame(y = yr, x = xr) test <- data.frame(y = ys, x = xs) model <- lm(y~., train) length(predict(model, test)) :::: and execute it twice with: (S) simulated data and (R) "real" data I get: ::: for simulated data ::: dim(train) = 164 119 ; dim(test) = 35 119> length(predict(model, test))[1] 35 ::: for real data ::: dim(train) = 164 119 ; dim(test) = 35 119> length(predict(model, test))Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : subscript out of bounds The shape of data seems to be the same in both cases and the only difference (as far as I can tell) is in actual values R Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: Liaw, Andy [mailto:andy_liaw at merck.com] Sent: Friday, June 21, 2002 1:06 PM To: 'Peter Dalgaard BSA' Cc: 'Czerminski, Ryszard'; r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() The problem is that xr and xs are both matrices in his example, not vectors. Andy> -----Original Message----- > From: Peter Dalgaard BSA [mailto:p.dalgaard at biostat.ku.dk] > Sent: Friday, June 21, 2002 1:03 PM > To: Liaw, Andy > Cc: 'Czerminski, Ryszard'; r-help at stat.math.ethz.ch > Subject: Re: [R] problem with predict() > > > "Liaw, Andy" <andy_liaw at merck.com> writes: > > > You still don't get the point. Please read Peter > Dalgaard's reply and the > > help page for predict.lm carefully, and try to understand > the `Detail' > > section. See the example below: > [snip] > > > > This looks promissing; however I get an error: > > > > > > > train <- data.frame(y=yr, x=xr) > > > > test <- data.frame(y=ys, x=xs) > > > > myfit <- lm(y ~ x, train) > > > Error in eval(expr, envir, enclos) : Object "x" not found > > But there's nothing wrong with that code as far as I can see?? I don't > get an error from it: > > > xr <- rnorm(10) > > yr <- rnorm(10) > > ys <- rnorm(5) > > xs <- rnorm(5) > > train <- data.frame(y=yr, x=xr) > > test <- data.frame(y=ys, x=xs) > > myfit <- lm(y ~ x, train) > > predict(myfit,test) > 1 2 3 4 5 > -0.03809295 0.11422384 0.35570765 0.55436954 0.22979523 > > > Something must have gone wrong with the creation of "train". > > -- > O__ ---- Peter Dalgaard Blegdamsvej 3 > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > (*) \(*) -- University of Copenhagen Denmark Ph: > (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > (+45) 35327907 >---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
--- first problem If I store 'simulated' data in data frames: # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) it still works the same way i.e. the code below works fine for simulated data and fails for 'real' data the only difference being in actual numeric values stored in data structures of the same shape and type. Any suggestions why this happens ? --- second problem> As Andy Liaw pointed out, xr is a matrix. Take a look at the names of > train. Hint: they do not contain `x'.Following your hint I am guessing that the fact that names do not contain 'x' explains why lm(y~., train) form works and lm(y~x, train) fails and "lm(y~., train)" means roughly: correlate column "y" to all other colums Where I can find more detail specification of this syntax ? In help(lm) I find this paragraph: Models for `lm' are specified symbolically. A typical model has the form `response ~ terms' where `response' is the (numeric)... which does not quite cover this case. Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] Sent: Friday, June 21, 2002 2:31 PM To: Czerminski, Ryszard Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() As Andy Liaw pointed out, xr is a matrix. Take a look at the names of train. Hint: they do not contain `x'. Similarly, in your `simulated' example you have matrices and not data frames. *Store your data in data frames* and you may be less confused. On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:> Thank you for great support so far ! > I think I am getting closer, but I still not quite get it... > > Two questions: > > (1) what is the difference between lm(y~., and lm(y~x, ??? > with second form failing ? > > > train <- data.frame(y = yr, x = xr) > > test <- data.frame(y = ys, x = xs) > > model <- lm(y~., train) > > model <- lm(y~x, train) > Error in eval(expr, envir, enclos) : Object "x" not found > > (2) and the other problems seems to be data related... > > Consider following code: > > ::: > rm(list=ls()) > > train.data <- read.csv("train.csv", header = TRUE, row.names = "mol", > comment.char="") > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol", > comment.char="") > > #train.data <- matrix(rnorm(164*119), nrow = 164) > #test.data <- matrix(rnorm(35*119), nrow = 35) > > yr <- train.data[,1]; xr <- train.data[,-1] > xr <- scale(xr) # matrix <- scale(data.frame) > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, "scaled:scale") > mask <- apply(xr, 2, function(x) any(is.na(x))) > xr <- xr[,!mask] # rm NA's > ys <- test.data[,1]; xs <- test.data[,-1] > xs <- scale(xs, center = x.center, scale = x.scale) > xs <- xs[,!mask] > train <- data.frame(y = yr, x = xr) > test <- data.frame(y = ys, x = xs) > model <- lm(y~., train) > length(predict(model, test)) > :::: > > and execute it twice with: (S) simulated data and (R) "real" data I get: > > ::: for simulated data ::: > dim(train) = 164 119 ; dim(test) = 35 119 > > length(predict(model, test)) > [1] 35 > > ::: for real data ::: > dim(train) = 164 119 ; dim(test) = 35 119 > > length(predict(model, test)) > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > subscript out of bounds > > The shape of data seems to be the same in both cases and > the only difference (as far as I can tell) is in actual values > > R > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: Liaw, Andy [mailto:andy_liaw at merck.com] > Sent: Friday, June 21, 2002 1:06 PM > To: 'Peter Dalgaard BSA' > Cc: 'Czerminski, Ryszard'; r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > The problem is that xr and xs are both matrices in his example, notvectors.> > Andy > > > -----Original Message----- > > From: Peter Dalgaard BSA [mailto:p.dalgaard at biostat.ku.dk] > > Sent: Friday, June 21, 2002 1:03 PM > > To: Liaw, Andy > > Cc: 'Czerminski, Ryszard'; r-help at stat.math.ethz.ch > > Subject: Re: [R] problem with predict() > > > > > > "Liaw, Andy" <andy_liaw at merck.com> writes: > > > > > You still don't get the point. Please read Peter > > Dalgaard's reply and the > > > help page for predict.lm carefully, and try to understand > > the `Detail' > > > section. See the example below: > > [snip] > > > > > > This looks promissing; however I get an error: > > > > > > > > > train <- data.frame(y=yr, x=xr) > > > > > test <- data.frame(y=ys, x=xs) > > > > > myfit <- lm(y ~ x, train) > > > > Error in eval(expr, envir, enclos) : Object "x" not found > > > > But there's nothing wrong with that code as far as I can see?? I don't > > get an error from it: > > > > > xr <- rnorm(10) > > > yr <- rnorm(10) > > > ys <- rnorm(5) > > > xs <- rnorm(5) > > > train <- data.frame(y=yr, x=xr) > > > test <- data.frame(y=ys, x=xs) > > > myfit <- lm(y ~ x, train) > > > predict(myfit,test) > > 1 2 3 4 5 > > -0.03809295 0.11422384 0.35570765 0.55436954 0.22979523 > > > > > > Something must have gone wrong with the creation of "train". > > > > -- > > O__ ---- Peter Dalgaard Blegdamsvej 3 > > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > > (*) \(*) -- University of Copenhagen Denmark Ph: > > (+45) 35327918 > > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > > (+45) 35327907 > > > >----------------------------------------------------------------------------> -- > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA)that> may be confidential, proprietary copyrighted and/or legally privileged,and> is intended solely for the use of the individual or entity named on this > message. If you are not the intended recipient, and have received this > message in error, please immediately return this by e-mail and then delete > it. > >===========================================================================> => -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-> r-help mailing list -- Readhttp://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html> Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._>-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > > --- first problem > > If I store 'simulated' data in data frames: > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) > it still works the same way i.e. the code below works fine > for simulated data and fails for 'real' data the only > difference being in actual numeric values stored in data > structures of the same shape and type. > > Any suggestions why this happens ?As Prof. Ripley hinted, check the names of the data frames. For your "real" data, do names(train) names(test) to see how they are different.> --- second problem > > > As Andy Liaw pointed out, xr is a matrix. Take a look at > the names of > > train. Hint: they do not contain `x'. > > Following your hint I am guessing that the fact that names do > not contain > 'x' > explains why lm(y~., train) form works and lm(y~x, train) fails > and "lm(y~., train)" means roughly: correlate column "y" to > all other colums > > Where I can find more detail specification of this syntax ? > In help(lm) I find this paragraph: > > Models for `lm' are specified symbolically. A typical model has > the form `response ~ terms' where `response' is the (numeric)... > > which does not quite cover this case.Reading the `An Introduction to R' manual, especially parts of Chapter 5. (The `.' shorthand is explained in section 11.5.) Andy ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
# Yes. You are *still* using a matrix in a data frame. Please do read more # carefully. I have read some more R documentation trying to understand difference between matrices and data frames etc... and I repeat my example this time executing EXACTLY the same code with only difference being that in one case I use smaller data sets ({train,test}-small.csv) and in the second case I use larger data sets ({train,test}.csv) - and I got different behaviour. Small case (10*4) runs fine, larger case (164*119) fails. Any ideas why this happens ? R> rm(list=ls()) > train.data <- read.csv("train-small.csv", header = TRUE, row.names "mol", comment.char="") > test.data <- read.csv("test-small.csv", header = TRUE, row.names = "mol",comment.char="")> yr <- train.data[,1]; xr <- train.data[,-1] > xr <- scale(xr) > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, "scaled:scale") > mask <- apply(xr, 2, function(x) any(is.na(x))) > xr <- xr[,!mask] # rm NA's > ys <- test.data[,1]; xs <- test.data[,-1] > xs <- scale(xs, center = x.center, scale = x.scale) > xs <- xs[,!mask] > train <- data.frame(y = yr, x = xr) > test <- data.frame(y = ys, x = xs) > model <- lm(y~., train) > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")dim(train) = 10 4 ; dim(test) = 10 4> length(predict(model, test))[1] 10> train.data <- read.csv("train.csv", header = TRUE, row.names = "mol",comment.char="")> test.data <- read.csv("test.csv", header = TRUE, row.names = "mol",comment.char="") [snip...]> cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")dim(train) = 164 119 ; dim(test) = 35 119> length(predict(model, test))Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : subscript out of bounds>Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] Sent: Friday, June 21, 2002 3:41 PM To: Czerminski, Ryszard Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:> --- first problem > > If I store 'simulated' data in data frames: > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) > it still works the same way i.e. the code below works fine > for simulated data and fails for 'real' data the only > difference being in actual numeric values stored in data > structures of the same shape and type. > > Any suggestions why this happens ?Yes. You are *still* using a matrix in a data frame. Please do read more carefully.> --- second problem > > > As Andy Liaw pointed out, xr is a matrix. Take a look at the names of > > train. Hint: they do not contain `x'. > > Following your hint I am guessing that the fact that names do not contain > 'x' > explains why lm(y~., train) form works and lm(y~x, train) fails > and "lm(y~., train)" means roughly: correlate column "y" to all othercolums No, it means regress y on all the remaining colums in the data argument.> > Where I can find more detail specification of this syntax ? > In help(lm) I find this paragraph: > > Models for `lm' are specified symbolically. A typical model has > the form `response ~ terms' where `response' is the (numeric)... > > which does not quite cover this case.In any good book on the subject. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
You can try: names(test) <- names(train) before calling predict() to make sure that the variable names match. Without your data files, it's hard to tell why your first example worked. Andy> -----Original Message----- > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > Sent: Thursday, June 27, 2002 3:29 PM > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > > # Yes. You are *still* using a matrix in a data frame. > Please do read more > # carefully. > > I have read some more R documentation trying to understand difference > between > matrices and data frames etc... and I repeat my example this time > executing EXACTLY the same code with only difference being > that in one case > I use smaller data sets ({train,test}-small.csv) and in the > second case I > use larger > data sets ({train,test}.csv) - and I got different behaviour. > > Small case (10*4) runs fine, larger case (164*119) fails. > > Any ideas why this happens ? > > R > > > rm(list=ls()) > > train.data <- read.csv("train-small.csv", header = TRUE, row.names > "mol", comment.char="") > > test.data <- read.csv("test-small.csv", header = TRUE, > row.names = "mol", > comment.char="") > > yr <- train.data[,1]; xr <- train.data[,-1] > > xr <- scale(xr) > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, > "scaled:scale") > > mask <- apply(xr, 2, function(x) any(is.na(x))) > > xr <- xr[,!mask] # rm NA's > > ys <- test.data[,1]; xs <- test.data[,-1] > > xs <- scale(xs, center = x.center, scale = x.scale) > > xs <- xs[,!mask] > > train <- data.frame(y = yr, x = xr) > > test <- data.frame(y = ys, x = xs) > > model <- lm(y~., train) > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > dim(train) = 10 4 ; dim(test) = 10 4 > > length(predict(model, test)) > [1] 10 > > train.data <- read.csv("train.csv", header = TRUE, > row.names = "mol", > comment.char="") > > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol", > comment.char="") > [snip...] > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > dim(train) = 164 119 ; dim(test) = 35 119 > > length(predict(model, test)) > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > subscript out of bounds > > > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] > Sent: Friday, June 21, 2002 3:41 PM > To: Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > On Fri, 21 Jun 2002, Czerminski, Ryszard wrote: > > > --- first problem > > > > If I store 'simulated' data in data frames: > > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) > > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) > > it still works the same way i.e. the code below works fine > > for simulated data and fails for 'real' data the only > > difference being in actual numeric values stored in data > > structures of the same shape and type. > > > > Any suggestions why this happens ? > > Yes. You are *still* using a matrix in a data frame. Please > do read more > carefully. > > > --- second problem > > > > > As Andy Liaw pointed out, xr is a matrix. Take a look at > the names of > > > train. Hint: they do not contain `x'. > > > > Following your hint I am guessing that the fact that names > do not contain > > 'x' > > explains why lm(y~., train) form works and lm(y~x, train) fails > > and "lm(y~., train)" means roughly: correlate column "y" to > all other > colums > > No, it means regress y on all the remaining colums in the > data argument. > > > > > Where I can find more detail specification of this syntax ? > > In help(lm) I find this paragraph: > > > > Models for `lm' are specified symbolically. A typical > model has > > the form `response ~ terms' where `response' is the > (numeric)... > > > > which does not quite cover this case. > > In any good book on the subject. > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-. > -.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._. > _._ > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-.-.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._._._ >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
This time I use the same file for train.data and test.data throwing in "names(test) <- names(train)" before predict() for double protection (:-) and it still fails... Is it some weird problem with this particular data set ? or a bug ? (why "subscript out of bounds" ?)> rm(list=ls()) > train.data <- read.csv("train.csv", header = TRUE, row.names = "mol",comment.char="")> test.data <- read.csv("train.csv", header = TRUE, row.names = "mol",comment.char="")> yr <- train.data[,1]; xr <- train.data[,-1] > xr <- scale(xr) # matrix <- scale(data.frame) > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, "scaled:scale") > mask <- apply(xr, 2, function(x) any(is.na(x))) > xr <- xr[,!mask] # rm NA's > ys <- test.data[,1]; xs <- test.data[,-1] > xs <- scale(xs, center = x.center, scale = x.scale) > xs <- xs[,!mask] > train <- data.frame(y = yr, x = xr) > test <- data.frame(y = ys, x = xs) > model <- lm(y~., train) > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")dim(train) = 164 119 ; dim(test) = 164 119> names(test) <- names(train) > length(predict(model, test))Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : subscript out of bounds>Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: Liaw, Andy [mailto:andy_liaw at merck.com] Sent: Friday, June 28, 2002 8:46 AM To: 'Czerminski, Ryszard' Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() You can try: names(test) <- names(train) before calling predict() to make sure that the variable names match. Without your data files, it's hard to tell why your first example worked. Andy> -----Original Message----- > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > Sent: Thursday, June 27, 2002 3:29 PM > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > > # Yes. You are *still* using a matrix in a data frame. > Please do read more > # carefully. > > I have read some more R documentation trying to understand difference > between > matrices and data frames etc... and I repeat my example this time > executing EXACTLY the same code with only difference being > that in one case > I use smaller data sets ({train,test}-small.csv) and in the > second case I > use larger > data sets ({train,test}.csv) - and I got different behaviour. > > Small case (10*4) runs fine, larger case (164*119) fails. > > Any ideas why this happens ? > > R > > > rm(list=ls()) > > train.data <- read.csv("train-small.csv", header = TRUE, row.names > "mol", comment.char="") > > test.data <- read.csv("test-small.csv", header = TRUE, > row.names = "mol", > comment.char="") > > yr <- train.data[,1]; xr <- train.data[,-1] > > xr <- scale(xr) > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, > "scaled:scale") > > mask <- apply(xr, 2, function(x) any(is.na(x))) > > xr <- xr[,!mask] # rm NA's > > ys <- test.data[,1]; xs <- test.data[,-1] > > xs <- scale(xs, center = x.center, scale = x.scale) > > xs <- xs[,!mask] > > train <- data.frame(y = yr, x = xr) > > test <- data.frame(y = ys, x = xs) > > model <- lm(y~., train) > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > dim(train) = 10 4 ; dim(test) = 10 4 > > length(predict(model, test)) > [1] 10 > > train.data <- read.csv("train.csv", header = TRUE, > row.names = "mol", > comment.char="") > > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol", > comment.char="") > [snip...] > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > dim(train) = 164 119 ; dim(test) = 35 119 > > length(predict(model, test)) > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > subscript out of bounds > > > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] > Sent: Friday, June 21, 2002 3:41 PM > To: Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > On Fri, 21 Jun 2002, Czerminski, Ryszard wrote: > > > --- first problem > > > > If I store 'simulated' data in data frames: > > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) > > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) > > it still works the same way i.e. the code below works fine > > for simulated data and fails for 'real' data the only > > difference being in actual numeric values stored in data > > structures of the same shape and type. > > > > Any suggestions why this happens ? > > Yes. You are *still* using a matrix in a data frame. Please > do read more > carefully. > > > --- second problem > > > > > As Andy Liaw pointed out, xr is a matrix. Take a look at > the names of > > > train. Hint: they do not contain `x'. > > > > Following your hint I am guessing that the fact that names > do not contain > > 'x' > > explains why lm(y~., train) form works and lm(y~x, train) fails > > and "lm(y~., train)" means roughly: correlate column "y" to > all other > colums > > No, it means regress y on all the remaining colums in the > data argument. > > > > > Where I can find more detail specification of this syntax ? > > In help(lm) I find this paragraph: > > > > Models for `lm' are specified symbolically. A typical > model has > > the form `response ~ terms' where `response' is the > (numeric)... > > > > which does not quite cover this case. > > In any good book on the subject. > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-. > -.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._. > _._ > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-.-.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._._._ >---------------------------------------------------------------------------- -- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I will try. I tried to use lm() to have basic linear "reference". If this is the reason (i.e. train data set is rank-deficient) and lm cannot handle such situation, what are the other packages in R you could recommend which would handle this type of data ? Also in such case: should not lm() report a problem in a model building phase ? So far I have used SVM approach to do regression with this data set and I am getting rather poor r^2 (~0.25 on test set), but I do not have any numerical problems with SVM. I am also planning to try randomForest() to do classification. This was my immediate motivation to turn to R. All the best, R Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] Sent: Friday, June 28, 2002 12:39 PM To: Czerminski, Ryszard Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() Have you tried the R debugging tools? If not, please make use of them. My guess is that you have a rank-deficient problem. ?debugger ?recover ?dump.frames ... On Fri, 28 Jun 2002, Czerminski, Ryszard wrote:> This time I use the same file for train.data and test.data > throwing in "names(test) <- names(train)" before predict() for double > protection (:-) > and it still fails... > > Is it some weird problem with this particular data set ? or a bug ? > (why "subscript out of bounds" ?)That's what the debugging tools are for.> > > rm(list=ls()) > > train.data <- read.csv("train.csv", header = TRUE, row.names = "mol", > comment.char="") > > test.data <- read.csv("train.csv", header = TRUE, row.names = "mol", > comment.char="") > > yr <- train.data[,1]; xr <- train.data[,-1] > > xr <- scale(xr) # matrix <- scale(data.frame) > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,"scaled:scale")> > mask <- apply(xr, 2, function(x) any(is.na(x))) > > xr <- xr[,!mask] # rm NA's > > ys <- test.data[,1]; xs <- test.data[,-1] > > xs <- scale(xs, center = x.center, scale = x.scale) > > xs <- xs[,!mask] > > train <- data.frame(y = yr, x = xr) > > test <- data.frame(y = ys, x = xs) > > model <- lm(y~., train) > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > dim(train) = 164 119 ; dim(test) = 164 119 > > names(test) <- names(train) > > length(predict(model, test)) > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > subscript out of bounds > > > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: Liaw, Andy [mailto:andy_liaw at merck.com] > Sent: Friday, June 28, 2002 8:46 AM > To: 'Czerminski, Ryszard' > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > You can try: > > names(test) <- names(train) > > before calling predict() to make sure that the variable names match. > Without your data files, it's hard to tell why your first example worked. > > Andy > > > -----Original Message----- > > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > > Sent: Thursday, June 27, 2002 3:29 PM > > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard > > Cc: r-help at stat.math.ethz.ch > > Subject: RE: [R] problem with predict() > > > > > > > > # Yes. You are *still* using a matrix in a data frame. > > Please do read more > > # carefully. > > > > I have read some more R documentation trying to understand difference > > between > > matrices and data frames etc... and I repeat my example this time > > executing EXACTLY the same code with only difference being > > that in one case > > I use smaller data sets ({train,test}-small.csv) and in the > > second case I > > use larger > > data sets ({train,test}.csv) - and I got different behaviour. > > > > Small case (10*4) runs fine, larger case (164*119) fails. > > > > Any ideas why this happens ? > > > > R > > > > > rm(list=ls()) > > > train.data <- read.csv("train-small.csv", header = TRUE, row.names > > "mol", comment.char="") > > > test.data <- read.csv("test-small.csv", header = TRUE, > > row.names = "mol", > > comment.char="") > > > yr <- train.data[,1]; xr <- train.data[,-1] > > > xr <- scale(xr) > > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, > > "scaled:scale") > > > mask <- apply(xr, 2, function(x) any(is.na(x))) > > > xr <- xr[,!mask] # rm NA's > > > ys <- test.data[,1]; xs <- test.data[,-1] > > > xs <- scale(xs, center = x.center, scale = x.scale) > > > xs <- xs[,!mask] > > > train <- data.frame(y = yr, x = xr) > > > test <- data.frame(y = ys, x = xs) > > > model <- lm(y~., train) > > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > > dim(train) = 10 4 ; dim(test) = 10 4 > > > length(predict(model, test)) > > [1] 10 > > > train.data <- read.csv("train.csv", header = TRUE, > > row.names = "mol", > > comment.char="") > > > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol", > > comment.char="") > > [snip...] > > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > > dim(train) = 164 119 ; dim(test) = 35 119 > > > length(predict(model, test)) > > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > > subscript out of bounds > > > > > > > Ryszard Czerminski phone: (781)994-0479 > > ArQule, Inc. email:ryszard at arqule.com > > 19 Presidential Way http://www.arqule.com > > Woburn, MA 01801 fax: (781)994-0679 > > > > > > -----Original Message----- > > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] > > Sent: Friday, June 21, 2002 3:41 PM > > To: Czerminski, Ryszard > > Cc: r-help at stat.math.ethz.ch > > Subject: RE: [R] problem with predict() > > > > > > On Fri, 21 Jun 2002, Czerminski, Ryszard wrote: > > > > > --- first problem > > > > > > If I store 'simulated' data in data frames: > > > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) > > > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) > > > it still works the same way i.e. the code below works fine > > > for simulated data and fails for 'real' data the only > > > difference being in actual numeric values stored in data > > > structures of the same shape and type. > > > > > > Any suggestions why this happens ? > > > > Yes. You are *still* using a matrix in a data frame. Please > > do read more > > carefully. > > > > > --- second problem > > > > > > > As Andy Liaw pointed out, xr is a matrix. Take a look at > > the names of > > > > train. Hint: they do not contain `x'. > > > > > > Following your hint I am guessing that the fact that names > > do not contain > > > 'x' > > > explains why lm(y~., train) form works and lm(y~x, train) fails > > > and "lm(y~., train)" means roughly: correlate column "y" to > > all other > > colums > > > > No, it means regress y on all the remaining colums in the > > data argument. > > > > > > > > Where I can find more detail specification of this syntax ? > > > In help(lm) I find this paragraph: > > > > > > Models for `lm' are specified symbolically. A typical > > model has > > > the form `response ~ terms' where `response' is the > > (numeric)... > > > > > > which does not quite cover this case. > > > > In any good book on the subject. > > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > -.-.-.-.-.-.-. > > -.- > > r-help mailing list -- Read > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > Send "info", "help", or "[un]subscribe" > > (in the "body", not the subject !) To: > > r-help-request at stat.math.ethz.ch > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > > _._._._._._._. > > _._ > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > -.-.-.-.-.-.-.-.- > > r-help mailing list -- Read > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > Send "info", "help", or "[un]subscribe" > > (in the "body", not the subject !) To: > > r-help-request at stat.math.ethz.ch > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > > _._._._._._._._._ > > > >----------------------------------------------------------------------------> -- > Notice: This e-mail message, together with any attachments, contains > information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA)that> may be confidential, proprietary copyrighted and/or legally privileged,and> is intended solely for the use of the individual or entity named on this > message. If you are not the intended recipient, and have received this > message in error, please immediately return this by e-mail and then delete > it. > >===========================================================================> => -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-> r-help mailing list -- Readhttp://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html> Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._>-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
As Prof. Ripley guessed, the X data is less than full rank. I'm surprised that lm didn't issue warning. summary() does say: Coefficients: (17 not defined because of singularities) I'm also surprised that with such a fitted object, predict(model) works, but not predict(model, data), where data is the original data used to fit the model. This does not seem to be user-friendly... Andy> -----Original Message----- > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > Sent: Friday, June 28, 2002 1:42 PM > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch; 'Liaw, Andy' > Subject: RE: [R] problem with predict() > > > I will try. > > I tried to use lm() to have basic linear "reference". > > If this is the reason (i.e. train data set is rank-deficient) and lm > cannot handle such situation, what are the other packages in R > you could recommend which would handle this type of data ? > Also in such case: should not lm() report a problem in a > model building > phase ? > > So far I have used SVM approach to do regression with this > data set and I am > getting > rather poor r^2 (~0.25 on test set), but I do not have any numerical > problems with SVM. > I am also planning to try randomForest() to do > classification. This was my > immediate > motivation to turn to R. > > All the best, > > R > > Ryszard Czerminski phone: (781)994-0479 > ArQule, Inc. email:ryszard at arqule.com > 19 Presidential Way http://www.arqule.com > Woburn, MA 01801 fax: (781)994-0679 > > > -----Original Message----- > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] > Sent: Friday, June 28, 2002 12:39 PM > To: Czerminski, Ryszard > Cc: r-help at stat.math.ethz.ch > Subject: RE: [R] problem with predict() > > > Have you tried the R debugging tools? If not, please make > use of them. > My guess is that you have a rank-deficient problem. > > ?debugger > ?recover > ?dump.frames > ... > > > On Fri, 28 Jun 2002, Czerminski, Ryszard wrote: > > > This time I use the same file for train.data and test.data > > throwing in "names(test) <- names(train)" before predict() > for double > > protection (:-) > > and it still fails... > > > > Is it some weird problem with this particular data set ? or a bug ? > > (why "subscript out of bounds" ?) > > That's what the debugging tools are for. > > > > > > rm(list=ls()) > > > train.data <- read.csv("train.csv", header = TRUE, > row.names = "mol", > > comment.char="") > > > test.data <- read.csv("train.csv", header = TRUE, > row.names = "mol", > > comment.char="") > > > yr <- train.data[,1]; xr <- train.data[,-1] > > > xr <- scale(xr) # matrix <- scale(data.frame) > > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, > "scaled:scale") > > > mask <- apply(xr, 2, function(x) any(is.na(x))) > > > xr <- xr[,!mask] # rm NA's > > > ys <- test.data[,1]; xs <- test.data[,-1] > > > xs <- scale(xs, center = x.center, scale = x.scale) > > > xs <- xs[,!mask] > > > train <- data.frame(y = yr, x = xr) > > > test <- data.frame(y = ys, x = xs) > > > model <- lm(y~., train) > > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n") > > dim(train) = 164 119 ; dim(test) = 164 119 > > > names(test) <- names(train) > > > length(predict(model, test)) > > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > > subscript out of bounds > > > > > > > Ryszard Czerminski phone: (781)994-0479 > > ArQule, Inc. email:ryszard at arqule.com > > 19 Presidential Way http://www.arqule.com > > Woburn, MA 01801 fax: (781)994-0679 > > > > > > -----Original Message----- > > From: Liaw, Andy [mailto:andy_liaw at merck.com] > > Sent: Friday, June 28, 2002 8:46 AM > > To: 'Czerminski, Ryszard' > > Cc: r-help at stat.math.ethz.ch > > Subject: RE: [R] problem with predict() > > > > > > You can try: > > > > names(test) <- names(train) > > > > before calling predict() to make sure that the variable names match. > > Without your data files, it's hard to tell why your first > example worked. > > > > Andy > > > > > -----Original Message----- > > > From: Czerminski, Ryszard [mailto:ryszard at arqule.com] > > > Sent: Thursday, June 27, 2002 3:29 PM > > > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard > > > Cc: r-help at stat.math.ethz.ch > > > Subject: RE: [R] problem with predict() > > > > > > > > > > > > # Yes. You are *still* using a matrix in a data frame. > > > Please do read more > > > # carefully. > > > > > > I have read some more R documentation trying to > understand difference > > > between > > > matrices and data frames etc... and I repeat my example this time > > > executing EXACTLY the same code with only difference being > > > that in one case > > > I use smaller data sets ({train,test}-small.csv) and in the > > > second case I > > > use larger > > > data sets ({train,test}.csv) - and I got different behaviour. > > > > > > Small case (10*4) runs fine, larger case (164*119) fails. > > > > > > Any ideas why this happens ? > > > > > > R > > > > > > > rm(list=ls()) > > > > train.data <- read.csv("train-small.csv", header = > TRUE, row.names > > > "mol", comment.char="") > > > > test.data <- read.csv("test-small.csv", header = TRUE, > > > row.names = "mol", > > > comment.char="") > > > > yr <- train.data[,1]; xr <- train.data[,-1] > > > > xr <- scale(xr) > > > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, > > > "scaled:scale") > > > > mask <- apply(xr, 2, function(x) any(is.na(x))) > > > > xr <- xr[,!mask] # rm NA's > > > > ys <- test.data[,1]; xs <- test.data[,-1] > > > > xs <- scale(xs, center = x.center, scale = x.scale) > > > > xs <- xs[,!mask] > > > > train <- data.frame(y = yr, x = xr) > > > > test <- data.frame(y = ys, x = xs) > > > > model <- lm(y~., train) > > > > cat("dim(train) =", dim(train), "; dim(test) =", > dim(test), "\n") > > > dim(train) = 10 4 ; dim(test) = 10 4 > > > > length(predict(model, test)) > > > [1] 10 > > > > train.data <- read.csv("train.csv", header = TRUE, > > > row.names = "mol", > > > comment.char="") > > > > test.data <- read.csv("test.csv", header = TRUE, > row.names = "mol", > > > comment.char="") > > > [snip...] > > > > cat("dim(train) =", dim(train), "; dim(test) =", > dim(test), "\n") > > > dim(train) = 164 119 ; dim(test) = 35 119 > > > > length(predict(model, test)) > > > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > > > subscript out of bounds > > > > > > > > > > Ryszard Czerminski phone: (781)994-0479 > > > ArQule, Inc. email:ryszard at arqule.com > > > 19 Presidential Way http://www.arqule.com > > > Woburn, MA 01801 fax: (781)994-0679 > > > > > > > > > -----Original Message----- > > > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] > > > Sent: Friday, June 21, 2002 3:41 PM > > > To: Czerminski, Ryszard > > > Cc: r-help at stat.math.ethz.ch > > > Subject: RE: [R] problem with predict() > > > > > > > > > On Fri, 21 Jun 2002, Czerminski, Ryszard wrote: > > > > > > > --- first problem > > > > > > > > If I store 'simulated' data in data frames: > > > > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) > > > > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) > > > > it still works the same way i.e. the code below works fine > > > > for simulated data and fails for 'real' data the only > > > > difference being in actual numeric values stored in data > > > > structures of the same shape and type. > > > > > > > > Any suggestions why this happens ? > > > > > > Yes. You are *still* using a matrix in a data frame. Please > > > do read more > > > carefully. > > > > > > > --- second problem > > > > > > > > > As Andy Liaw pointed out, xr is a matrix. Take a look at > > > the names of > > > > > train. Hint: they do not contain `x'. > > > > > > > > Following your hint I am guessing that the fact that names > > > do not contain > > > > 'x' > > > > explains why lm(y~., train) form works and lm(y~x, train) fails > > > > and "lm(y~., train)" means roughly: correlate column "y" to > > > all other > > > colums > > > > > > No, it means regress y on all the remaining colums in the > > > data argument. > > > > > > > > > > > Where I can find more detail specification of this syntax ? > > > > In help(lm) I find this paragraph: > > > > > > > > Models for `lm' are specified symbolically. A typical > > > model has > > > > the form `response ~ terms' where `response' is the > > > (numeric)... > > > > > > > > which does not quite cover this case. > > > > > > In any good book on the subject. > > > > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > > -.-.-.-.-.-.-. > > > -.- > > > r-help mailing list -- Read > > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > > Send "info", "help", or "[un]subscribe" > > > (in the "body", not the subject !) To: > > > r-help-request at stat.math.ethz.ch > > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > > > _._._._._._._. > > > _._ > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > > > -.-.-.-.-.-.-.-.- > > > r-help mailing list -- Read > > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > > Send "info", "help", or "[un]subscribe" > > > (in the "body", not the subject !) To: > > > r-help-request at stat.math.ethz.ch > > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > > > _._._._._._._._._ > > > > > > > > -------------------------------------------------------------- > -------------- > > -- > > Notice: This e-mail message, together with any attachments, contains > > information of Merck & Co., Inc. (Whitehouse Station, New > Jersey, USA) > that > > may be confidential, proprietary copyrighted and/or legally > privileged, > and > > is intended solely for the use of the individual or entity > named on this > > message. If you are not the intended recipient, and have > received this > > message in error, please immediately return this by e-mail > and then delete > > it. > > > > > =============================================================> =============> > => > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-. > -.- > > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > > Send "info", "help", or "[un]subscribe" > > (in the "body", not the subject !) To: > r-help-request at stat.math.ethz.ch > > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._. > _._ > > > > -- > Brian D. Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272860 (secr) > Oxford OX1 3TG, UK Fax: +44 1865 272595 >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> From: Thomas Lumley [mailto:tlumley at u.washington.edu] > Sent: Friday, June 28, 2002 2:17 PM > > On Fri, 28 Jun 2002, Liaw, Andy wrote: > > > As Prof. Ripley guessed, the X data is less than full rank. > I'm surprised > > that lm didn't issue warning. summary() does say: > > > > Coefficients: (17 not defined because of singularities) > > > > I'm also surprised that with such a fitted object, > predict(model) works, but > > not predict(model, data), where data is the original data > used to fit the > > model. This does not seem to be user-friendly... > > There's something to be said for more warnings, but I think > what predict > does is reasonable. > > If you have fitted a model to rank-deficient data then you > can predict on > data with the same rank deficiency. In particular, you can > predict on the > original data, as predict(model) does. However, you can't > predict on new > data unless they have the same rank-deficiencies, and since this is > something that can't be tested stably it make sense to refuse > to predict > on new data.But in that case shouldn't there be a more informative error message? If that's the desired behavior, I'd check for rank deficiency in the fitted object, and issue an error if newdata is supplied, rather than waiting for something to break downstream. Just my $0.02... Andy> > -thomas > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I should point that there is (as I thought) nothing wrong with predict.lm on a rank-degenerate problem, e.g. x1 <- rnorm(100) x3 <- rnorm(100) y <- rnorm(100) train <- data.frame(y=y, x1=x1, x2=x1, x3=x3) fit <- lm(y ~ ., train) stopifnot(all.equal(predict(fit), predict(fit, train))) although as Thomas points out a warning would be useful. The problem here is that model.matrix is (for me) adding 13 duplicate columns in lm and not in predict.lm. That's a bug unrelated to predict(). -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._