plummer@iarc.fr
2000-Apr-12 09:55 UTC
[Rd] For wishlist: sanity checks for subsets in lm, glm (PR#515)
I got stung by this yesterday: R> x <- seq(0,1,length=101) R> y <- x + rnorm(101)/5 R> test.data <- data.frame(x=x, y=y, cond=(x <= 0.5)) R> # A perfect fit! But very implausible parameter estimates. R> glm(y ~ x, data=test.data, subset=cond) Call: glm(formula = y ~ x, data = test.data, subset = cond) Coefficients: (Intercept) x -0.1283 29.5571 Degrees of Freedom: 100 Total (i.e. Null); 99 Residual Null Deviance: 2.206 Residual Deviance: 3.929e-32 AIC: -7477 R> # Here's why ... R> test.data2 <- test.data[test.data$cond, ] R> table(test.data2$x, test.data2$y) -0.128278372512167 0.167292488167292 0 50 0 0.01 0 51 R> # ... I just replicated the first two rows R> test.data[1:2,] x y cond 1 0.00 -0.1282784 TRUE 2 0.01 0.1672925 TRUE Now, I admit that this isn't a bug. I should have used R> test.data <- data.frame(x=x, y=y, cond=I(x <= 0.5)) to preserve cond as a logical vector, as documented in the help for data.frame. However, it is still very unpleasant behaviour, and no warnings were issued. I can't help thinking that some sanity checks on the subset argument could have helped me. So I suggest that the subset argument should either be * A logical vector of the same length as the number of rows in the data frame, or * A numeric vector of unique integers to ensure that we really do get a subset of the data frame. The more flexible behaviour of "[.data.frame" really isn't required here. Martyn --please do not edit the information below-- Version: platform = i686-unknown-linux arch = i686 os = linux system = i686, linux status major = 1 minor = 0.0 year = 2000 month = February day = 29 language = R Search Path: .GlobalEnv, Autoloads, package:base -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Peter Dalgaard BSA
2000-Apr-12 10:12 UTC
[Rd] For wishlist: sanity checks for subsets in lm, glm (PR#515)
plummer@iarc.fr writes:> to preserve cond as a logical vector, as documented in the help for > data.frame. However, it is still very unpleasant behaviour, and no > warnings were issued. I can't help thinking that some sanity checks on > the subset argument could have helped me. > > So I suggest that the subset argument should either be > * A logical vector of the same length as the number of rows in the data > frame, or > * A numeric vector of unique integers > to ensure that we really do get a subset of the data frame. The more > flexible behaviour of "[.data.frame" really isn't required here.Might be a good idea. Mind you, Splus 3.4 does exatly the same thing, and I'm a little worried that the uniqueness assumption might kill some bootstrapping applications: glm(y ~ x, data=test.data, subset=sample(seq(along=y),replace=T)) -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._