thr3ads.net - R help - [R] lm() silently drops NAs [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2016-Jul-26 08:24 UTC

[R] lm() silently drops NAs

I have been asked (in private)

    > Hi Martin,

    y <- c(1, 2, 3, NA, 4)
    x <- c(1, 2, 2, 1, 1)

    t.test(y ~ x)
    lm(y ~ x)

    > Normally, most R functions follow the principle that
    > "missings should never silently go missing". Do you have
    > any background on why these functions (and most model/test
    > function in stats, I presume) silently drop missing values?

And I think, the issue and an answer are important enough to be
public, hence this posting to R-help :

First note that in some sense it is not true that lm(y ~ x) silently drops 
NAs: Everybody who is taught about lm() is taught to look at
   summary( fm )  where  fm <- lm(y ~ x)

and that (for the above case)  "says"

 ' (1 observation deleted due to missingness) '

and so is not entirely silent.


This goes all back to the idea of having an 'na.action' argument
which may be a function (a very good idea, "functional
programming" in a double sense!)... which Chambers et al
introduced in The White Book (*1) and which I think to remember 
was quite a revolutionary idea; at least I had liked that very much
once I understood the beauty of passing functions as arguments
to other functions.
One problem already back then has been that we already had the
---much more primitive but often sufficient--- standard of an
'na.rm = FALSE' (i.e. default FALSE) argument.

This has been tought in all good classes/course about statistical
modeling with S and R ever since ... I had hoped ....
(but it seems I was too optimistic, .. or many students have too
 quickly forgotten what they were taught ..)
Notably the white book itself, and the MASS (*2) book do teach
this.. though possibly not loudly enough.

Two more decisions about this were made back then, as well:

  1) The default for na.action to be na.omit  (==> "silently
dropping")

  2) na.action being governed by  options(na.action = ..)

'1)' may have been mostly "historical": I think it had been
the behavior of
other "main stream" statistical packages back then (and now?) *and*
possibly more importantly, notably with the later (than white book =
"S3")
advent of na.replace, you did want to keep the missing in your
data frame, for later analysis; e.g. drawing (-> "gaps" in
plots) so the NA *were* carried along and would not be
forgotten, something very important in most case.s.
   
and '2)' is something I personally no longer like very
much, as it is "killing" the functional paradigm.
OTOH, it has to be said in favor of that "session wide" /
"global" setting
      options(na.action = *)
that indeed it depends on the specific data analysis, or even
the specific *phase* of a data analysis, *what* behavior of NA
treatment is desired.... and at the time it was thought smart
that all methods (also functions "deep down" called by
user-called functtions) would automatically use the "currently
desired" NA handling.

There have been recommendations (I don't know exactly where and
by whom) to always set

   options(na.action = na.fail)

in your global .First() or nowadays rather your  Rprofile, and
I assume that some of the CRAN packages and some of the "teaching
setups" would do that (and if you do that, the lm() and t.test()
calls above give an error). 

Of course, there's much more to say about this, quite a bit of which
has been published in scientific papers and books..

Martin Maechler
ETH Zurich and R Core Team

----------------

*1) "The White Book": Chambers and Hastie (1992):
 https://www.r-project.org/doc/bib/R-books_bib.html#R:Chambers+Hastie:1992

*2) "MASS - the book": Venables and Ripley (2002 (= 4th ed.)):
 https://www.r-project.org/doc/bib/R-books_bib.html#R:Venables+Ripley:2002

Hadley Wickham

2016-Jul-26 20:26 UTC

head link

[R] lm() silently drops NAs

On Tue, Jul 26, 2016 at 3:24 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:> I have been asked (in private)
Martin was very polite to not share my name, but it was me :)
>     > Hi Martin,
>
>     y <- c(1, 2, 3, NA, 4)
>     x <- c(1, 2, 2, 1, 1)
>
>     t.test(y ~ x)
>     lm(y ~ x)
>
>     > Normally, most R functions follow the principle that
>     > "missings should never silently go missing". Do you have
>     > any background on why these functions (and most model/test
>     > function in stats, I presume) silently drop missing values?
>
> And I think, the issue and an answer are important enough to be
> public, hence this posting to R-help :
>
> First note that in some sense it is not true that lm(y ~ x) silently drops
> NAs: Everybody who is taught about lm() is taught to look at
>    summary( fm )  where  fm <- lm(y ~ x)
Good point - unfortunately the message was a bit subtle for me, and
don't remember being taught to look for it :(
> and that (for the above case)  "says"
>
>  ' (1 observation deleted due to missingness) '
>
> and so is not entirely silent.
>
>
> This goes all back to the idea of having an 'na.action' argument
> which may be a function (a very good idea, "functional
> programming" in a double sense!)... which Chambers et al
> introduced in The White Book (*1) and which I think to remember
> was quite a revolutionary idea; at least I had liked that very much
> once I understood the beauty of passing functions as arguments
> to other functions.
> One problem already back then has been that we already had the
> ---much more primitive but often sufficient--- standard of an
> 'na.rm = FALSE' (i.e. default FALSE) argument.
>
> This has been tought in all good classes/course about statistical
> modeling with S and R ever since ... I had hoped ....
> (but it seems I was too optimistic, .. or many students have too
>  quickly forgotten what they were taught ..)
> Notably the white book itself, and the MASS (*2) book do teach
> this.. though possibly not loudly enough.
>
> Two more decisions about this were made back then, as well:
>
>   1) The default for na.action to be na.omit  (==> "silently
dropping")
>
>   2) na.action being governed by  options(na.action = ..)
>
> '1)' may have been mostly "historical": I think it had
been the behavior of
> other "main stream" statistical packages back then (and now?)
*and*
> possibly more importantly, notably with the later (than white book =
"S3")
> advent of na.replace, you did want to keep the missing in your
> data frame, for later analysis; e.g. drawing (-> "gaps" in
> plots) so the NA *were* carried along and would not be
> forgotten, something very important in most case.s.
>
> and '2)' is something I personally no longer like very
> much, as it is "killing" the functional paradigm.
> OTOH, it has to be said in favor of that "session wide" /
"global" setting
>       options(na.action = *)
> that indeed it depends on the specific data analysis, or even
> the specific *phase* of a data analysis, *what* behavior of NA
> treatment is desired.... and at the time it was thought smart
> that all methods (also functions "deep down" called by
> user-called functtions) would automatically use the "currently
> desired" NA handling.
>
> There have been recommendations (I don't know exactly where and
> by whom) to always set
>
>    options(na.action = na.fail)
>
> in your global .First() or nowadays rather your  Rprofile, and
> I assume that some of the CRAN packages and some of the "teaching
> setups" would do that (and if you do that, the lm() and t.test()
> calls above give an error).
I think that's a bit too strict for me, so I wrote my own:

na.warn <- function(object, ...) {
  missing <- complete.cases(object)
  if (any(missing)) {
    warning("Dropping ", sum(missing), " rows with missing
values",
call. = FALSE)
  }

  na.exclude(object, ...)
}

That gives me the behaviour I want:

options(na.action = na.warn)

y <- c(1, 2, 3, NA, 4)
x <- c(1, 2, 2, 1, 1)
mod <- lm(y ~ x)
#> Warning: Dropping 4 rows with missing values

predict(mod)
#>   1   2   3   4   5
#> 2.5 2.5 2.5  NA 2.5
resid(mod)
#>    1    2    3    4    5
#> -1.5 -0.5  0.5   NA  1.5

To me, this would be the most sensible default behaviour, but I
realise it's too late to change without breaking many existing
expectations.

On a related note, I've never really understood why it's called
na.exclude - from my perspective it causes the _inclusion_ of missing
values in the predictions/residuals.

Thanks for the (as always!) informative response, Martin.

Hadley

-- 
http://hadley.nz

Hadley Wickham

2016-Jul-26 20:37 UTC

head link

[R] lm() silently drops NAs

> I think that's a bit too strict for me, so I wrote my own:
>
> na.warn <- function(object, ...) {
>   missing <- complete.cases(object)
>   if (any(missing)) {
>     warning("Dropping ", sum(missing), " rows with missing
values",
> call. = FALSE)
>   }
>
>   na.exclude(object, ...)
> }
That should, of course, have been:

missing <- !complete.cases(object)

:-/

Hadley

-- 
http://hadley.nz

peter dalgaard

2016-Jul-26 21:30 UTC

head link

[R] lm() silently drops NAs

> On 26 Jul 2016, at 22:26 , Hadley Wickham <h.wickham at gmail.com>
wrote:
> 
> On Tue, Jul 26, 2016 at 3:24 AM, Martin Maechler
> <maechler at stat.math.ethz.ch> wrote:
>> 
...> To me, this would be the most sensible default behaviour, but I
> realise it's too late to change without breaking many existing
> expectations.
Probably. 

Re. the default choice, my recollection is that at the time the only choices
available were na.omit and na.fail. S-PLUS was using na.fail for all the usual
good reasons, but from a practical perspective, the consequence was that, since
almost every data set has NA values,  you got an error unless you added
na.action=na.omit to every single lm() call. And habitually typing
na.action=na.omit doesn't really solve any of the issues with different
models being fit to different subsets and all that. So the rationale for doing
it differently in R was that it was better to get some probably meaningful
output rather than to be certain of getting nothing. And, that was what the
mainstream packages of the time were doing.
> On a related note, I've never really understood why it's called
> na.exclude - from my perspective it causes the _inclusion_ of missing
> values in the predictions/residuals.
I think the notion is that you exclude them from the analysis, but keep them
around for the other purposes.

-pd
> 
> Thanks for the (as always!) informative response, Martin.
> 
> Hadley
> 
> -- 
> http://hadley.nz
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

R help - Jul 2016 - lm() silently drops NAs

[R] lm() silently drops NAs

[R] lm() silently drops NAs

[R] lm() silently drops NAs

[R] lm() silently drops NAs