thr3ads.net - R devel - [Rd] predict (PR#2686) [Mar 2003]

If this information is useful, please help other people find it:
Share via:

Mark.Bravington@csiro.au

2003-Mar-26 00:31 UTC

[Rd] predict (PR#2686)

#       r-bugs@r-project.org

`predict' complains about new factor levels, even if the "new"
levels are
merely levels in the original that didn't occur in the original fit and were
sensibly dropped, and that don't occur in the prediction data either. (At
least if `drop.unused.levels' was set to TRUE, which the default.)

test> scrunge.data.2_ data.frame( y=runif( 3), disc=factor( c( 'cat',
'dog',
'cat'), levels=c( 'cat', 'dog', 'earwig')))
test> lm.predbug.2_ lm( y~disc, data=scrunge.data.2)
test> predict(lm.predbug.2, newdata=scrunge.data.2)
Error in model.frame.default(object, data, xlev = xlev) : 
        factor disc has new level(s) earwig


A cure for this seems to be to add the commented line below towards the end
of `model.frame.default':

    <<...>>
    if (length(xlev) > 0) {
        for (nm in names(xlev)) if (!is.null(xl <- xlev[[nm]])) {
            xi <- data[[nm]]
            if (is.null(nxl <- levels(xi))) 
                warning(paste("variable", nm, "is not a
factor"))
            else {
                xi <- xi[, drop = TRUE]
                nxl <- levels( xi) # MVB: remove droppees
                if (any(m <- is.na(match(nxl, xl)))) 
                  stop(paste("factor", nm, "has new
level(s)", nxl[m]))
            }
        }
    }
    else if (drop.unused.levels) {
    <<...>>
    
cheers
Mark

*******************************

Mark Bravington
CSIRO (CMIS)
PO Box 1538
Castray Esplanade
Hobart
TAS 7001

phone (61) 3 6232 5118
fax (61) 3 6232 5012
Mark.Bravington@csiro.au 

--please do not edit the information below--

Version:
 platform = i386-pc-mingw32
 arch = i386
 os = mingw32
 system = i386, mingw32
 status = 
 major = 1
 minor = 6.2
 year = 2003
 month = 01
 day = 10
 language = R

Windows 2000 Professional (build 2195) Service Pack 3.0

Search Path:
 .GlobalEnv, ROOT, package:handy, package:debug, mvb.session.info,
package:mvbutils, package:tcltk, Autoloads, package:base

ripley@stats.ox.ac.uk

2003-Mar-26 08:20 UTC

head link

[Rd] predict (PR#2686)

This is intentional.  The coding for factors is based on the full set of 
levels, and should be comparable for different prediction sets.

If you are using factors with fictitious levels the fix is obvious: 
improve the design.

On Wed, 26 Mar 2003 Mark.Bravington@csiro.au wrote:
> #       r-bugs@r-project.org
> 
> `predict' complains about new factor levels, even if the
"new" levels are
> merely levels in the original that didn't occur in the original fit and
were
> sensibly dropped, and that don't occur in the prediction data either.
(At
> least if `drop.unused.levels' was set to TRUE, which the default.)
Actually, the default is FALSE: see args(model.frame.default).  lm and glm
call model.frame.default with non-default args.
> test> scrunge.data.2_ data.frame( y=runif( 3), disc=factor( c(
'cat', 'dog',
> 'cat'), levels=c( 'cat', 'dog', 'earwig')))
> test> lm.predbug.2_ lm( y~disc, data=scrunge.data.2)
> test> predict(lm.predbug.2, newdata=scrunge.data.2)
> Error in model.frame.default(object, data, xlev = xlev) : 
>         factor disc has new level(s) earwig
> 
> 
> A cure for this seems to be to add the commented line below towards the end
> of `model.frame.default':
> 
>     <<...>>
>     if (length(xlev) > 0) {
>         for (nm in names(xlev)) if (!is.null(xl <- xlev[[nm]])) {
>             xi <- data[[nm]]
>             if (is.null(nxl <- levels(xi))) 
>                 warning(paste("variable", nm, "is not a
factor"))
>             else {
>                 xi <- xi[, drop = TRUE]
>                 nxl <- levels( xi) # MVB: remove droppees
>                 if (any(m <- is.na(match(nxl, xl)))) 
>                   stop(paste("factor", nm, "has new
level(s)", nxl[m]))
>             }
>         }
>     }
>     else if (drop.unused.levels) {
>     <<...>>
>     
> cheers
> Mark
> 
> *******************************
> 
> Mark Bravington
> CSIRO (CMIS)
> PO Box 1538
> Castray Esplanade
> Hobart
> TAS 7001
> 
> phone (61) 3 6232 5118
> fax (61) 3 6232 5012
> Mark.Bravington@csiro.au 
> 
> --please do not edit the information below--
> 
> Version:
>  platform = i386-pc-mingw32
>  arch = i386
>  os = mingw32
>  system = i386, mingw32
>  status = 
>  major = 1
>  minor = 6.2
>  year = 2003
>  month = 01
>  day = 10
>  language = R
> 
> Windows 2000 Professional (build 2195) Service Pack 3.0
> 
> Search Path:
>  .GlobalEnv, ROOT, package:handy, package:debug, mvb.session.info,
> package:mvbutils, package:tcltk, Autoloads, package:base
> 
> ______________________________________________
> R-devel@stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
> 
-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Mark.Bravington@csiro.au

2003-Mar-27 02:59 UTC

head link

[Rd] predict (PR#2686)

<Bravington wrote:>
#> `predict' complains about new factor levels, even if the 
#"new" levels are
#> merely levels in the original that didn't occur in the 
#original fit and were
#> sensibly dropped, and that don't occur in the prediction 
#data either. 

<Ripley replied:>
#This is intentional.  The coding for factors is based on the 
#full set of 
#levels, and should be comparable for different prediction sets.
#
#If you are using factors with fictitious levels the fix is obvious: 
#improve the design.

There is still an inconsistency bug between `lm' and `predict.lm',
though.
`lm' intentionally overlooks inactive levels of a factor, but
`predict.lm'
doesn't, even when it legitimately could. In particular, it is a bit odd to
have no problem predicting without a `newdata' argument even when the
original data had inactive factor levels, but then to get an error if
`newdata=<<original data>>' is supplied explicitly! (See
example.)

Given that the (IMHO sensible) decision to drop has been taken for `lm' to
drop inactive levels, deliberately so that users don't have to change their
designs when they don't really need to, then surely it's inconsistent
for
`predict' not to do the same when it's statistically OK?

[When it's not OK-- i.e. when there are levels in the prediction data that
didn't appear in the fitting data-- the cleanest solution would perhaps be
for `predict' to return NA values and a warning, rather than an error. But
that's a separate issue.]

cheers
Mark

mark.bravington@csiro.au

Slightly expanded example, and suggested fix to `model.frame.default':

test> scrunge.data.2_ data.frame( y=runif( 3), disc=factor( c( 'cat',
'dog','cat'), levels=c( 'cat', 'dog',
'earwig')))
test> lm.predbug.2_ lm( y~disc, data=scrunge.data.2)

test> predict( lm.predbug.2) # uses original data
         1         2         3 
 0.2185388 0.5843139 0.2185388 

test> predict(lm.predbug.2, newdata=scrunge.data.2) # newdata = original
data
Error in model.frame.default(object, data, xlev = xlev) : 
        factor disc has new level(s) earwig


A cure for this seems to be to add the commented line below, towards the end
of `model.frame.default':

    <<...>>
    if (length(xlev) > 0) {
        for (nm in names(xlev)) if (!is.null(xl <- xlev[[nm]])) {
            xi <- data[[nm]]
            if (is.null(nxl <- levels(xi))) 
                warning(paste("variable", nm, "is not a
factor"))
            else {
                xi <- xi[, drop = TRUE]
                nxl <- levels( xi) # MVB: remove droppees
                if (any(m <- is.na(match(nxl, xl)))) 
                  stop(paste("factor", nm, "has new
level(s)", nxl[m]))
            }
        }
    }
    else if (drop.unused.levels) {
    <<...>>

--please do not edit the information below--

Version:
 platform = i386-pc-mingw32
 arch = i386
 os = mingw32
 system = i386, mingw32
 status = 
 major = 1
 minor = 6.2
 year = 2003
 month = 01
 day = 10
 language = R

Windows 2000 Professional (build 2195) Service Pack 3.0

Search Path:
 .GlobalEnv, ROOT, package:handy, package:debug, mvb.session.info,
package:mvbutils, package:tcltk, Autoloads, package:base

ripley@stats.ox.ac.uk

2003-Mar-27 08:23 UTC

head link

[Rd] predict (PR#2686)

On Thu, 27 Mar 2003 Mark.Bravington@csiro.au wrote:
> <Bravington wrote:>
> #> `predict' complains about new factor levels, even if the 
> #"new" levels are
> #> merely levels in the original that didn't occur in the 
> #original fit and were
> #> sensibly dropped, and that don't occur in the prediction 
> #data either. 
> 
> <Ripley replied:>
> #This is intentional.  The coding for factors is based on the 
> #full set of 
> #levels, and should be comparable for different prediction sets.
> #
> #If you are using factors with fictitious levels the fix is obvious: 
> #improve the design.
> 
> There is still an inconsistency bug between `lm' and `predict.lm',
though.
> `lm' intentionally overlooks inactive levels of a factor, but
`predict.lm'
Only if an argument is set, and originally lm did not do so.
> doesn't, even when it legitimately could. In particular, it is a bit
odd to
> have no problem predicting without a `newdata' argument even when the
> original data had inactive factor levels, but then to get an error if
> `newdata=<<original data>>' is supplied explicitly! (See
example.)
Read again.  predict.lm is consistent across its inputs: unlike lm it can
take variable `newdata'.  As I said the intention is to be consistent
across *prediction sets*.  Omitting newdata is not giving a prediction
set.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Mark.Bravington@csiro.au

2003-Apr-01 09:49 UTC

head link

[Rd] predict (PR#2686)

>> <Bravington wrote:>
>>>> `predict' complains about new factor levels, even if the 
>>>> "new" levels are
>>>> merely levels in the original that didn't occur in the 
>>>> original fit and were
>>>> sensibly dropped, and that don't occur in the prediction 
>>>> data either. 
 >> <Ripley replied:>
>>> This is intentional.  The coding for factors is based on the 
>>> full set of 
>>> levels, and should be comparable for different prediction sets.
>>>
>>> If you are using factors with fictitious levels the fix is obvious:
>>> improve the design.
 >> <Bravington again:>
>> There is still an inconsistency bug between `lm' and
`predict.lm',
though.>> `lm' intentionally overlooks inactive levels of a factor, 
> <Ripley again:>
> Only if an argument is set, and originally lm did not do so.
<Bravington again:>
But `lm' always does this now, doesn't it? -- even if it didn't
originally.
I think you can't not drop unused levels, even if you wanted to.
>> but `predict.lm'doesn't, even when it legitimately could. 
>> In particular, it is a bit odd to
>> have no problem predicting without a `newdata' argument even when
the
>> original data had inactive factor levels, but then to get an error if
>> `newdata=<<original data>>' is supplied explicitly!
(See example.)
>
> <Ripley:>
>Read again.  predict.lm is consistent across its inputs: 
>unlike lm it can
>take variable `newdata'.  As I said the intention is to be consistent
>across *prediction sets*.  Omitting newdata is not giving a prediction
>set.
<Bravington again:>
Mmm-- that's getting a bit metaphysical for me-- when is a prediction not a
prediction, and what is ``predict'' actually doing if it is not
predicting?!


Anyhow, according to the help page for `predict.lm':

     If the fit is rank-deficient, some of the columns of the design
     matrix will have been dropped.  Prediction from such a fit only
     makes sense if `newdata' is contained in the same subspace as the
     original data. That cannot be checked accurately, so a warning is
     issued.
     
The subspace condition is obviously satisfied if the prediction data is the
same as the original data-- so prediction does "make sense" in that
context
according to the documentation (as well as common sense. Normally I am no
fan of slavish adherence to documentation, but in my own interests I'll make
an exception...). And yet there's an error message, not even a warning.

Prediction from the original data was just an example, of course; my general
proposal is that inactive factor levels in the prediction set should be
dropped. I don't see how this could ever cause inconsistent behaviour across
prediction sets-- have I missed something?

cheers
Mark

*******************************

Mark Bravington
CSIRO (CMIS)
PO Box 1538
Castray Esplanade
Hobart
TAS 7001

phone (61) 3 6232 5118
fax (61) 3 6232 5012
Mark.Bravington@csiro.au

ripley@stats.ox.ac.uk

2003-Apr-01 10:22 UTC

head link

[Rd] predict (PR#2686)

On Tue, 1 Apr 2003 Mark.Bravington@csiro.au wrote:
> Prediction from the original data was just an example, of course; my
general
> proposal is that inactive factor levels in the prediction set should be
> dropped. I don't see how this could ever cause inconsistent behaviour
across
> prediction sets-- have I missed something?
Yes, repeatedly: `inactive' depends on the prediction set, and that's
not
thought desirable.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Apparently Analagous Threads

Search for more apparently analagous threads

R devel - Mar 2003 - predict (PR#2686)

[Rd] predict (PR#2686)

[Rd] predict (PR#2686)

[Rd] predict (PR#2686)

[Rd] predict (PR#2686)

[Rd] predict (PR#2686)

[Rd] predict (PR#2686)

Apparently Analagous Threads