thr3ads.net - R devel - [Rd] For wishlist: sanity checks for subsets in lm, glm (PR#515) [Apr 2000]

If this information is useful, please help other people find it:
Share via:

plummer@iarc.fr

2000-Apr-12 09:55 UTC

[Rd] For wishlist: sanity checks for subsets in lm, glm (PR#515)

I got stung by this yesterday:

R> x <- seq(0,1,length=101)
R> y <- x + rnorm(101)/5
R> test.data <- data.frame(x=x, y=y, cond=(x <= 0.5))
R> # A perfect fit! But very implausible parameter estimates.
R> glm(y ~ x, data=test.data, subset=cond)

Call:  glm(formula = y ~ x, data = test.data, subset = cond) 

Coefficients:
(Intercept)            x  
    -0.1283      29.5571  

Degrees of Freedom: 100 Total (i.e. Null);  99 Residual
Null Deviance:      2.206 
Residual Deviance: 3.929e-32    AIC: -7477 
R> # Here's why ...
R> test.data2 <- test.data[test.data$cond, ]
R> table(test.data2$x, test.data2$y)
      
       -0.128278372512167 0.167292488167292
  0                    50                 0
  0.01                  0                51
R> # ... I just replicated the first two rows
R> test.data[1:2,]
     x          y cond
1 0.00 -0.1282784 TRUE
2 0.01  0.1672925 TRUE

Now, I admit that this isn't a bug. I should have used

R> test.data <- data.frame(x=x, y=y, cond=I(x <= 0.5))

to preserve cond as a logical vector, as documented in the help for
data.frame.  However, it is still very unpleasant behaviour, and no
warnings were issued.  I can't help thinking that some sanity checks on
the subset argument could have helped me.

So I suggest that the subset argument should either be
* A logical vector of the same length as the number of rows in the data
  frame, or
* A numeric vector of unique integers
to ensure that we really do get a subset of the data frame. The more
flexible behaviour of "[.data.frame" really isn't required here.

Martyn

--please do not edit the information below--

Version:
 platform = i686-unknown-linux
 arch = i686
 os = linux
 system = i686, linux
 status  major = 1
 minor = 0.0
 year = 2000
 month = February
 day = 29
 language = R

Search Path:
 .GlobalEnv, Autoloads, package:base

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Dalgaard BSA

2000-Apr-12 10:12 UTC

head link

[Rd] For wishlist: sanity checks for subsets in lm, glm (PR#515)

plummer@iarc.fr writes:
> to preserve cond as a logical vector, as documented in the help for
> data.frame.  However, it is still very unpleasant behaviour, and no
> warnings were issued.  I can't help thinking that some sanity checks on
> the subset argument could have helped me.
> 
> So I suggest that the subset argument should either be
> * A logical vector of the same length as the number of rows in the data
>   frame, or
> * A numeric vector of unique integers
> to ensure that we really do get a subset of the data frame. The more
> flexible behaviour of "[.data.frame" really isn't required
here.
Might be a good idea. Mind you, Splus 3.4 does exatly the same thing,
and I'm a little worried that the uniqueness assumption might kill
some bootstrapping applications:

 glm(y ~ x, data=test.data, subset=sample(seq(along=y),replace=T))


-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Reasonably Related Threads

Search for more seemingly similar threads

R devel - Apr 2000 - For wishlist: sanity checks for subsets in lm, glm (PR#515)

[Rd] For wishlist: sanity checks for subsets in lm, glm (PR#515)

[Rd] For wishlist: sanity checks for subsets in lm, glm (PR#515)

Reasonably Related Threads