There is a bug in `predict' whereby the order of variables sometimes gets re-arranged compared to the original fit, and then disaster results. Specifically, the 'variables' and 'predvars' attributes of a 'terms' object get out of synch. This only happens when the terms in the original formula get re-ordered during fitting: test> scrunge.data_ data.frame( contin=1:10, discrete=factor( rep( c( 'cat', 'dog'), 5)), resp=runif( 10)) test> lm.ok_ lm( resp ~ discrete + contin %in% discrete, data=scrunge.data) test> predict( lm.ok, scrunge.data) # no problemo 1 2 3 4 5 6 7 8 9 10 0.29663793 0.04572655 0.42661779 0.31668732 0.55659764 0.58764809 0.68657750 0.85860886 0.81655736 1.12956963 test> lm.bug_ lm( resp ~ contin %in% discrete + discrete, data=scrunge.data) # terms will be re-ordered test> predict( lm.bug, scrunge.data) Error in "contrasts<-"(*tmp*, value = "contr.treatment") : contrasts apply only to factors In addition: Warning message: variable discrete is not a factor in: model.frame.default(object, data, xlev = xlev) This actually turns out to be a bug in `model.frame.default', to do with an inconsistency between `predvars' and `vars' when `model.frame.default' is called inside `predict'. AFAICS it can be fixed by including the commented line below in `model.frame.default': <<...>> vars <- attr(formula, "variables") predvars <- attr(formula, "predvars") if (is.null(predvars)) predvars <- vars varnames <- as.character(predvars[-1]) # MVB: was vars[-1] not predvars[-1] variables <- eval(predvars, data, env) <<...>> This has the side-effect that there are some ugly column names in the model.frame if e.g. a `poly' term is used, but doesn't actually seem to hurt the prediction. However, that doesn't fix it all. There is still a bug in `predict', even after replacing `model.frame.default' with the above: test> predict( lm.bug, scrunge.data) Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : subscript out of bounds test> # wot??? This time, the bug is in `delete.response', which call `terms' to set most of the attributes including `variables', but adjusts the original `predvars' by hand. Because `terms' returns the variables in a different order when it's called by `predict.lm' to when it was originally called by `lm', things get out of synch. This is a slightly tricky bug to fix, because `predvars' and `variables' can look a bit different e.g. if there are `poly' terms, but I think the following change near the end of `delete.response' does the trick: <<...>> if (length(formula(termobj)) == 3) { # Old code, reliant on maintaining the order of terms: attr(tt, "predvars") <- attr(termobj, "predvars")[-2] reorder <- match( sapply( attr( tt, 'variables'), deparse), sapply( attr( termobj, 'variables'), deparse)) # MVB attr( tt, 'predvars') <- attr( termobj, 'predvars')[ reorder] # MVB } <<...>> cheers Mark ******************************* Mark Bravington CSIRO (CMIS) PO Box 1538 Castray Esplanade Hobart TAS 7001 phone (61) 3 6232 5118 fax (61) 3 6232 5012 Mark.Bravington@csiro.au
You forgot to give the R version, platform etc. This is already fixed in R-devel, and your example works there provided a valid assignment operator is used. It is the same PR#2206, and is marked as fixed in R-bugs. On Wed, 26 Mar 2003 Mark.Bravington@csiro.au wrote:> There is a bug in `predict' whereby the order of variables sometimes gets > re-arranged compared to the original fit, and then disaster results. > Specifically, the 'variables' and 'predvars' attributes of a 'terms' object > get out of synch. This only happens when the terms in the original formula > get re-ordered during fitting: > > test> scrunge.data_ data.frame( contin=1:10, discrete=factor( rep( c( 'cat', > 'dog'), 5)), resp=runif( 10)) > test> lm.ok_ lm( resp ~ discrete + contin %in% discrete, data=scrunge.data) > test> predict( lm.ok, scrunge.data) # no problemo > 1 2 3 4 5 6 7 > 8 9 10 > 0.29663793 0.04572655 0.42661779 0.31668732 0.55659764 0.58764809 0.68657750 > 0.85860886 0.81655736 1.12956963 > > test> lm.bug_ lm( resp ~ contin %in% discrete + discrete, data=scrunge.data) > # terms will be re-ordered > test> predict( lm.bug, scrunge.data) > Error in "contrasts<-"(*tmp*, value = "contr.treatment") : > contrasts apply only to factors > In addition: Warning message: > variable discrete is not a factor in: model.frame.default(object, data, xlev > = xlev) > > This actually turns out to be a bug in `model.frame.default', to do with an > inconsistency between `predvars' and `vars' when `model.frame.default' is > called inside `predict'. AFAICS it can be fixed by including the commented > line below in `model.frame.default': > > <<...>> > vars <- attr(formula, "variables") > predvars <- attr(formula, "predvars") > if (is.null(predvars)) > predvars <- vars > varnames <- as.character(predvars[-1]) # MVB: was vars[-1] not > predvars[-1] > variables <- eval(predvars, data, env) > <<...>> > > This has the side-effect that there are some ugly column names in the > model.frame if e.g. a `poly' term is used, but doesn't actually seem to hurt > the prediction. > > However, that doesn't fix it all. There is still a bug in `predict', even > after replacing `model.frame.default' with the above: > > test> predict( lm.bug, scrunge.data) > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : > subscript out of bounds > test> # wot??? > > This time, the bug is in `delete.response', which call `terms' to set most > of the attributes including `variables', but adjusts the original `predvars' > by hand. Because `terms' returns the variables in a different order when > it's called by `predict.lm' to when it was originally called by `lm', things > get out of synch. > > This is a slightly tricky bug to fix, because `predvars' and `variables' can > look a bit different e.g. if there are `poly' terms, but I think the > following change near the end of `delete.response' does the trick: > > <<...>> > if (length(formula(termobj)) == 3) { > # Old code, reliant on maintaining the order of terms: attr(tt, > "predvars") <- attr(termobj, "predvars")[-2] > reorder <- match( sapply( attr( tt, 'variables'), deparse), sapply( > attr( termobj, 'variables'), deparse)) # MVB > attr( tt, 'predvars') <- attr( termobj, 'predvars')[ reorder] # MVB > } > <<...>> > > cheers > Mark > > ******************************* > > Mark Bravington > CSIRO (CMIS) > PO Box 1538 > Castray Esplanade > Hobart > TAS 7001 > > phone (61) 3 6232 5118 > fax (61) 3 6232 5012 > Mark.Bravington@csiro.au > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
( summary: predict now gets round problems caused by re-ordering of terms ) Sorry-- glad it's fixed & surprised I left the info off-- it's in yesterday's other bug reports tho' :). I know that I searched the bug tracking page yesterday to see whether the bugs were already reported before sending them. Didn't find anything yesterday, but there they are today in black & white. Must have done something daft. cheers Mark ******************************* Mark Bravington CSIRO (CMIS) PO Box 1538 Castray Esplanade Hobart TAS 7001 phone (61) 3 6232 5118 fax (61) 3 6232 5012 Mark.Bravington@csiro.au #-----Original Message----- #From: ripley@stats.ox.ac.uk [mailto:ripley@stats.ox.ac.uk] #Sent: Wednesday, 26 March 2003 6:09 PM #To: Mark.Bravington@csiro.au #Cc: r-devel@stat.math.ethz.ch; R-bugs@biostat.ku.dk #Subject: Re: [Rd] predict (PR#2685) # # #You forgot to give the R version, platform etc. # #This is already fixed in R-devel, and your example works there #provided a #valid assignment operator is used. # #It is the same PR#2206, and is marked as fixed in R-bugs. # #On Wed, 26 Mar 2003 Mark.Bravington@csiro.au wrote: # #> There is a bug in `predict' whereby the order of variables #sometimes gets #> re-arranged compared to the original fit, and then disaster results. #> Specifically, the 'variables' and 'predvars' attributes of a #'terms' object #> get out of synch. This only happens when the terms in the #original formula #> get re-ordered during fitting: #> #> test> scrunge.data_ data.frame( contin=1:10, #discrete=factor( rep( c( 'cat', #> 'dog'), 5)), resp=runif( 10)) #> test> lm.ok_ lm( resp ~ discrete + contin %in% discrete, #data=scrunge.data) #> test> predict( lm.ok, scrunge.data) # no problemo #> 1 2 3 4 5 # 6 7 #> 8 9 10 #> 0.29663793 0.04572655 0.42661779 0.31668732 0.55659764 #0.58764809 0.68657750 #> 0.85860886 0.81655736 1.12956963 #> #> test> lm.bug_ lm( resp ~ contin %in% discrete + discrete, #data=scrunge.data) #> # terms will be re-ordered #> test> predict( lm.bug, scrunge.data) #> Error in "contrasts<-"(*tmp*, value = "contr.treatment") : #> contrasts apply only to factors #> In addition: Warning message: #> variable discrete is not a factor in: #model.frame.default(object, data, xlev #> = xlev) #> #> This actually turns out to be a bug in #`model.frame.default', to do with an #> inconsistency between `predvars' and `vars' when #`model.frame.default' is #> called inside `predict'. AFAICS it can be fixed by including #the commented #> line below in `model.frame.default': #> #> <<...>> #> vars <- attr(formula, "variables") #> predvars <- attr(formula, "predvars") #> if (is.null(predvars)) #> predvars <- vars #> varnames <- as.character(predvars[-1]) # MVB: was vars[-1] not #> predvars[-1] #> variables <- eval(predvars, data, env) #> <<...>> #> #> This has the side-effect that there are some ugly column names in the #> model.frame if e.g. a `poly' term is used, but doesn't #actually seem to hurt #> the prediction. #> #> However, that doesn't fix it all. There is still a bug in #`predict', even #> after replacing `model.frame.default' with the above: #> #> test> predict( lm.bug, scrunge.data) #> Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) : #> subscript out of bounds #> test> # wot??? #> #> This time, the bug is in `delete.response', which call #`terms' to set most #> of the attributes including `variables', but adjusts the #original `predvars' #> by hand. Because `terms' returns the variables in a #different order when #> it's called by `predict.lm' to when it was originally called #by `lm', things #> get out of synch. #> #> This is a slightly tricky bug to fix, because `predvars' and #`variables' can #> look a bit different e.g. if there are `poly' terms, but I think the #> following change near the end of `delete.response' does the trick: #> #> <<...>> #> if (length(formula(termobj)) == 3) { #> # Old code, reliant on maintaining the order of terms: attr(tt, #> "predvars") <- attr(termobj, "predvars")[-2] #> reorder <- match( sapply( attr( tt, 'variables'), #deparse), sapply( #> attr( termobj, 'variables'), deparse)) # MVB #> attr( tt, 'predvars') <- attr( termobj, 'predvars')[ #reorder] # MVB #> } #> <<...>> #> #> cheers #> Mark #> #> ******************************* #> #> Mark Bravington #> CSIRO (CMIS) #> PO Box 1538 #> Castray Esplanade #> Hobart #> TAS 7001 #> #> phone (61) 3 6232 5118 #> fax (61) 3 6232 5012 #> Mark.Bravington@csiro.au #> #> ______________________________________________ #> R-devel@stat.math.ethz.ch mailing list #> https://www.stat.math.ethz.ch/mailman/listinfo/r-devel #> # #-- #Brian D. Ripley, ripley@stats.ox.ac.uk #Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ #University of Oxford, Tel: +44 1865 272861 (self) #1 South Parks Road, +44 1865 272866 (PA) #Oxford OX1 3TG, UK Fax: +44 1865 272595 #