# r-bugs@r-project.org `predict' complains about new factor levels, even if the "new" levels are merely levels in the original that didn't occur in the original fit and were sensibly dropped, and that don't occur in the prediction data either. (At least if `drop.unused.levels' was set to TRUE, which the default.) test> scrunge.data.2_ data.frame( y=runif( 3), disc=factor( c( 'cat', 'dog', 'cat'), levels=c( 'cat', 'dog', 'earwig'))) test> lm.predbug.2_ lm( y~disc, data=scrunge.data.2) test> predict(lm.predbug.2, newdata=scrunge.data.2) Error in model.frame.default(object, data, xlev = xlev) : factor disc has new level(s) earwig A cure for this seems to be to add the commented line below towards the end of `model.frame.default': <<...>> if (length(xlev) > 0) { for (nm in names(xlev)) if (!is.null(xl <- xlev[[nm]])) { xi <- data[[nm]] if (is.null(nxl <- levels(xi))) warning(paste("variable", nm, "is not a factor")) else { xi <- xi[, drop = TRUE] nxl <- levels( xi) # MVB: remove droppees if (any(m <- is.na(match(nxl, xl)))) stop(paste("factor", nm, "has new level(s)", nxl[m])) } } } else if (drop.unused.levels) { <<...>> cheers Mark ******************************* Mark Bravington CSIRO (CMIS) PO Box 1538 Castray Esplanade Hobart TAS 7001 phone (61) 3 6232 5118 fax (61) 3 6232 5012 Mark.Bravington@csiro.au --please do not edit the information below-- Version: platform = i386-pc-mingw32 arch = i386 os = mingw32 system = i386, mingw32 status = major = 1 minor = 6.2 year = 2003 month = 01 day = 10 language = R Windows 2000 Professional (build 2195) Service Pack 3.0 Search Path: .GlobalEnv, ROOT, package:handy, package:debug, mvb.session.info, package:mvbutils, package:tcltk, Autoloads, package:base
This is intentional. The coding for factors is based on the full set of levels, and should be comparable for different prediction sets. If you are using factors with fictitious levels the fix is obvious: improve the design. On Wed, 26 Mar 2003 Mark.Bravington@csiro.au wrote:> # r-bugs@r-project.org > > `predict' complains about new factor levels, even if the "new" levels are > merely levels in the original that didn't occur in the original fit and were > sensibly dropped, and that don't occur in the prediction data either. (At > least if `drop.unused.levels' was set to TRUE, which the default.)Actually, the default is FALSE: see args(model.frame.default). lm and glm call model.frame.default with non-default args.> test> scrunge.data.2_ data.frame( y=runif( 3), disc=factor( c( 'cat', 'dog', > 'cat'), levels=c( 'cat', 'dog', 'earwig'))) > test> lm.predbug.2_ lm( y~disc, data=scrunge.data.2) > test> predict(lm.predbug.2, newdata=scrunge.data.2) > Error in model.frame.default(object, data, xlev = xlev) : > factor disc has new level(s) earwig > > > A cure for this seems to be to add the commented line below towards the end > of `model.frame.default': > > <<...>> > if (length(xlev) > 0) { > for (nm in names(xlev)) if (!is.null(xl <- xlev[[nm]])) { > xi <- data[[nm]] > if (is.null(nxl <- levels(xi))) > warning(paste("variable", nm, "is not a factor")) > else { > xi <- xi[, drop = TRUE] > nxl <- levels( xi) # MVB: remove droppees > if (any(m <- is.na(match(nxl, xl)))) > stop(paste("factor", nm, "has new level(s)", nxl[m])) > } > } > } > else if (drop.unused.levels) { > <<...>> > > cheers > Mark > > ******************************* > > Mark Bravington > CSIRO (CMIS) > PO Box 1538 > Castray Esplanade > Hobart > TAS 7001 > > phone (61) 3 6232 5118 > fax (61) 3 6232 5012 > Mark.Bravington@csiro.au > > --please do not edit the information below-- > > Version: > platform = i386-pc-mingw32 > arch = i386 > os = mingw32 > system = i386, mingw32 > status = > major = 1 > minor = 6.2 > year = 2003 > month = 01 > day = 10 > language = R > > Windows 2000 Professional (build 2195) Service Pack 3.0 > > Search Path: > .GlobalEnv, ROOT, package:handy, package:debug, mvb.session.info, > package:mvbutils, package:tcltk, Autoloads, package:base > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
<Bravington wrote:> #> `predict' complains about new factor levels, even if the #"new" levels are #> merely levels in the original that didn't occur in the #original fit and were #> sensibly dropped, and that don't occur in the prediction #data either. <Ripley replied:> #This is intentional. The coding for factors is based on the #full set of #levels, and should be comparable for different prediction sets. # #If you are using factors with fictitious levels the fix is obvious: #improve the design. There is still an inconsistency bug between `lm' and `predict.lm', though. `lm' intentionally overlooks inactive levels of a factor, but `predict.lm' doesn't, even when it legitimately could. In particular, it is a bit odd to have no problem predicting without a `newdata' argument even when the original data had inactive factor levels, but then to get an error if `newdata=<<original data>>' is supplied explicitly! (See example.) Given that the (IMHO sensible) decision to drop has been taken for `lm' to drop inactive levels, deliberately so that users don't have to change their designs when they don't really need to, then surely it's inconsistent for `predict' not to do the same when it's statistically OK? [When it's not OK-- i.e. when there are levels in the prediction data that didn't appear in the fitting data-- the cleanest solution would perhaps be for `predict' to return NA values and a warning, rather than an error. But that's a separate issue.] cheers Mark mark.bravington@csiro.au Slightly expanded example, and suggested fix to `model.frame.default': test> scrunge.data.2_ data.frame( y=runif( 3), disc=factor( c( 'cat', 'dog','cat'), levels=c( 'cat', 'dog', 'earwig'))) test> lm.predbug.2_ lm( y~disc, data=scrunge.data.2) test> predict( lm.predbug.2) # uses original data 1 2 3 0.2185388 0.5843139 0.2185388 test> predict(lm.predbug.2, newdata=scrunge.data.2) # newdata = original data Error in model.frame.default(object, data, xlev = xlev) : factor disc has new level(s) earwig A cure for this seems to be to add the commented line below, towards the end of `model.frame.default': <<...>> if (length(xlev) > 0) { for (nm in names(xlev)) if (!is.null(xl <- xlev[[nm]])) { xi <- data[[nm]] if (is.null(nxl <- levels(xi))) warning(paste("variable", nm, "is not a factor")) else { xi <- xi[, drop = TRUE] nxl <- levels( xi) # MVB: remove droppees if (any(m <- is.na(match(nxl, xl)))) stop(paste("factor", nm, "has new level(s)", nxl[m])) } } } else if (drop.unused.levels) { <<...>> --please do not edit the information below-- Version: platform = i386-pc-mingw32 arch = i386 os = mingw32 system = i386, mingw32 status = major = 1 minor = 6.2 year = 2003 month = 01 day = 10 language = R Windows 2000 Professional (build 2195) Service Pack 3.0 Search Path: .GlobalEnv, ROOT, package:handy, package:debug, mvb.session.info, package:mvbutils, package:tcltk, Autoloads, package:base
On Thu, 27 Mar 2003 Mark.Bravington@csiro.au wrote:> <Bravington wrote:> > #> `predict' complains about new factor levels, even if the > #"new" levels are > #> merely levels in the original that didn't occur in the > #original fit and were > #> sensibly dropped, and that don't occur in the prediction > #data either. > > <Ripley replied:> > #This is intentional. The coding for factors is based on the > #full set of > #levels, and should be comparable for different prediction sets. > # > #If you are using factors with fictitious levels the fix is obvious: > #improve the design. > > There is still an inconsistency bug between `lm' and `predict.lm', though. > `lm' intentionally overlooks inactive levels of a factor, but `predict.lm'Only if an argument is set, and originally lm did not do so.> doesn't, even when it legitimately could. In particular, it is a bit odd to > have no problem predicting without a `newdata' argument even when the > original data had inactive factor levels, but then to get an error if > `newdata=<<original data>>' is supplied explicitly! (See example.)Read again. predict.lm is consistent across its inputs: unlike lm it can take variable `newdata'. As I said the intention is to be consistent across *prediction sets*. Omitting newdata is not giving a prediction set. -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
>> <Bravington wrote:> >>>> `predict' complains about new factor levels, even if the >>>> "new" levels are >>>> merely levels in the original that didn't occur in the >>>> original fit and were >>>> sensibly dropped, and that don't occur in the prediction >>>> data either.>> <Ripley replied:> >>> This is intentional. The coding for factors is based on the >>> full set of >>> levels, and should be comparable for different prediction sets. >>> >>> If you are using factors with fictitious levels the fix is obvious: >>> improve the design.>> <Bravington again:> >> There is still an inconsistency bug between `lm' and `predict.lm',though.>> `lm' intentionally overlooks inactive levels of a factor,> <Ripley again:> > Only if an argument is set, and originally lm did not do so.<Bravington again:> But `lm' always does this now, doesn't it? -- even if it didn't originally. I think you can't not drop unused levels, even if you wanted to.>> but `predict.lm'doesn't, even when it legitimately could. >> In particular, it is a bit odd to >> have no problem predicting without a `newdata' argument even when the >> original data had inactive factor levels, but then to get an error if >> `newdata=<<original data>>' is supplied explicitly! (See example.) > > <Ripley:> >Read again. predict.lm is consistent across its inputs: >unlike lm it can >take variable `newdata'. As I said the intention is to be consistent >across *prediction sets*. Omitting newdata is not giving a prediction >set.<Bravington again:> Mmm-- that's getting a bit metaphysical for me-- when is a prediction not a prediction, and what is ``predict'' actually doing if it is not predicting?! Anyhow, according to the help page for `predict.lm': If the fit is rank-deficient, some of the columns of the design matrix will have been dropped. Prediction from such a fit only makes sense if `newdata' is contained in the same subspace as the original data. That cannot be checked accurately, so a warning is issued. The subspace condition is obviously satisfied if the prediction data is the same as the original data-- so prediction does "make sense" in that context according to the documentation (as well as common sense. Normally I am no fan of slavish adherence to documentation, but in my own interests I'll make an exception...). And yet there's an error message, not even a warning. Prediction from the original data was just an example, of course; my general proposal is that inactive factor levels in the prediction set should be dropped. I don't see how this could ever cause inconsistent behaviour across prediction sets-- have I missed something? cheers Mark ******************************* Mark Bravington CSIRO (CMIS) PO Box 1538 Castray Esplanade Hobart TAS 7001 phone (61) 3 6232 5118 fax (61) 3 6232 5012 Mark.Bravington@csiro.au
On Tue, 1 Apr 2003 Mark.Bravington@csiro.au wrote:> Prediction from the original data was just an example, of course; my general > proposal is that inactive factor levels in the prediction set should be > dropped. I don't see how this could ever cause inconsistent behaviour across > prediction sets-- have I missed something?Yes, repeatedly: `inactive' depends on the prediction set, and that's not thought desirable. -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595