Thomas P C Chu
2008-Aug-01 09:00 UTC
[R] drop1() seems to give unexpected results compare to anova()
Dear all, I have been trying to investigate the behaviour of different weights in weighted regression for a dataset with lots of missing data. As a start I simulated some data using the following: library(MASS) N <- 200 sigma <- matrix(c(1, .5, .5, 1), nrow = 2) sim.set <- as.data.frame(mvrnorm(N, c(0, 0), sigma)) colnames(sim.set) <- c('x1', 'x2') # x1 & x2 are correlated sim.set$x3 <- rnorm(N, 0, 1) # x3 is independent sim.set$x4 <- rnorm(N, 0, 1) # x4 is red herring sim.set$y = 4 * sim.set$x1 + 5 * sim.set$x2 + 6 * sim.set$x3 # y is outcome I then checked the correlation of my simulated data and fitted a linear regression to check if y = 4 * x1 + 5 * x2 + 6 * x3 indeed. round(cor(sim.set), 2) summary(model <- lm(y ~ ., data = sim.set)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.423e-17 1.256e-16 -4.320e-01 0.666 x1 4.000e+00 1.388e-16 2.881e+16 <2e-16 *** x2 5.000e+00 1.441e-16 3.470e+16 <2e-16 *** x3 6.000e+00 1.188e-16 5.051e+16 <2e-16 *** x4 -8.150e-18 1.165e-16 -7.000e-02 0.944 anova(model) Df Sum Sq Mean Sq F value Pr(>F) x1 1 8686.1 8686.1 2.8218e+33 <2e-16 *** x2 1 5568.7 5568.7 1.8091e+33 <2e-16 *** x3 1 7852.1 7852.1 2.5509e+33 <2e-16 *** x4 1 1.507e-32 1.507e-32 4.9000e-03 0.9443 Residuals 195 6.002e-28 3.078e-30 All was well so far, as x4 was identified as not significant and its coeff was almost 0 (because I made it so in the first place). Now I expected it to be dropped in stepwise: step(model, direction = 'both', test = 'F') drop1(model, test = 'F') dropterm(model, test = 'F') Df Sum of Sq RSS AIC F value Pr(F) <none> 6.002e-28 -13585.7 x1 1 2555.1 2555.1 517.5 8.3006e+32 < 2.2e-16 *** x2 1 3707.0 3707.0 591.9 1.2043e+33 < 2.2e-16 *** x3 1 7851.9 7851.9 742.0 2.5508e+33 < 2.2e-16 *** x4 1 2.118e-27 2.718e-27 -13285.6 6.8806e+02 < 2.2e-16 *** Neither of those 3 lines of commands managed to drop x4 and its P value magically decreased from 0.94 to almost 0! I am also baffled by how R calculated those RSS. However, if I fitted the smaller model and compared it with the original one by hand, I got the expected answer: summary(model2 <- lm(y ~ x1 + x2 + x3, data = sim.set)) anova(model, model2, test = 'F') Model 1: y ~ x1 + x2 + x3 + x4 Model 2: y ~ x1 + x2 + x3 Res.Df RSS Df Sum of Sq F Pr(>F) 1 195 6.0025e-28 2 196 6.0026e-28 -1 -2.0000e-32 0.0049 0.9443 Interestingly, if I had started from a null model I ended up with y = 4 * x1 + 5 * x2 + 6 * x3 and R did not add x4 into the model as expected. summary(model1 <- lm(y ~ 1, data = sim.set)) step(model1, direction = 'both', scope = .~. + x1 + x2 + x3 + x4, test = 'F') add1(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F') addterm(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F') Df Sum of Sq RSS AIC F value Pr(F) <none> 22107.0 943.1 x1 1 8686.1 13420.8 845.2 128.1478 <2e-16 *** x2 1 11658.7 10448.3 795.2 220.9377 <2e-16 *** x3 1 11045.4 11061.6 806.6 197.7096 <2e-16 *** x4 1 13.4 22093.6 944.9 0.1199 0.7295 I'm not sure what is going on. I am running R 2.7.1 on Ubuntu Linux, with all components up to date. Thank you in advance for all your thoughts and replies. Yours sincerely, Thomas P C Chu University of Leicester LE1 7RH UK ________________________________________________________________________ AOL Email goes Mobile! You can now read your AOL Emails whilst on the move. Sign up for a free AOL Email account with unlimited storage today. ************************************** See what's new at http://www.aol.com
Thomas P C Chu
2008-Aug-01 14:18 UTC
[R] drop1() seems to give unexpected results compare to anova()
Interestingly, if I fitted the model using glm() rather than lm(), drop1() would behave as expected: summary(model.glm <- glm(y ~ ., data = sim.set, family = 'gaussian')) summary(model.lm <- lm(y ~ ., data = sim.set)) drop1(model.glm, test = 'F') drop1(model.lm, test = 'F') model.glm <- step(model.glm, direction = 'both', test = 'F') model.lm <- step(model.lm, direction = 'both', test = 'F') summary(model.glm) summary(model.lm) Although it is debatable whether one should use glm(family = 'gaussian') rather than lm() for fitting models with normal distribution residuals. This raises the suspicion that there could be a bug in drop1() and step(), which I think uses add1() and drop1() repeatedly. Thomas P C Chu ________________________________________________________________________ AOL Email goes Mobile! You can now read your AOL Emails whilst on the move. Sign up for a free AOL Email account with unlimited storage today. ************************************** See what's new at http://www.aol.com
Jeroen Ooms
2008-Aug-01 16:00 UTC
[R] drop1() seems to give unexpected results compare to anova()
Thomas Chu wrote:> > Neither of those 3 lines of commands managed to drop x4 and its P value > magically decreased from 0.94 to almost 0! I am also baffled by how R > calculated those RSS. >Maybe it is using a different type of SS. If i have a lm() model, and i do: options(contrasts=c("contr.sum", "contr.poly")); anovafit <- drop1(model,attributes(model$terms)$term.labels,test="F"); then i get identical SS, F and p values as in SPSS. Maybe http://www.nabble.com/set-type-of-SS-in-anova()-to18287076.html#a18287076 this post is helpfull. Also check out the post on http://myowelt.blogspot.com/ this blog from 2008-05-24: Obtaining the same ANOVA results in R as in SPSS - the difficulties with Type II and Type III sums of squares . -- View this message in context: http://www.nabble.com/drop1%28%29-seems-to-give-unexpected-results-compare-to-anova%28%29-tp18770635p18777713.html Sent from the R help mailing list archive at Nabble.com.
Thomas P C Chu
2008-Aug-02 08:27 UTC
[R] drop1() seems to give unexpected results compare to anova()
I am not sure why my messages are not threaded together. Thank you to the author of this post: https://stat.ethz.ch/pipermail/r-help/2008-August/169691.html I have tried the suggestions, but I got the same results as in my original query: https://stat.ethz.ch/pipermail/r-help/2008-August/169647.html I have considered the issue of partial and sequential sum of squares. Given that the variable x4 (the red herring) entered the model last in the sequence, I thought partial and sequential SS ought to be numerically the same. However, I later found out that using glm() instead of lm() gave the expected results. See: https://stat.ethz.ch/pipermail/r-help/2008-August/169673.html I have read that glm() uses iterative re-weighted least square (which I think is related to maximum likelihood) for fitting whereas lm() uses matrix decomposition. I do expect slightly different answers from these two functions, but not ones that were so far apart! Now I can use glm() as a workaround, but I just want to make sure there are no bugs in drop1(). Hopefully more people can give their opinions whether there is a bug. Thomas P C Chu ________________________________________________________________________ AOL Email goes Mobile! You can now read your AOL Emails whilst on the move. Sign up for a free AOL Email account with unlimited storage today.
Peter Dalgaard
2008-Aug-02 23:13 UTC
[R] drop1() seems to give unexpected results compare to anova()
Thomas P C Chu wrote:> Dear all, > > I have been trying to investigate the behaviour of different weights > in weighted regression for a dataset with lots of missing data. As a > start I simulated some data using the following: > > library(MASS) > N <- 200 > sigma <- matrix(c(1, .5, .5, 1), nrow = 2) > sim.set <- as.data.frame(mvrnorm(N, c(0, 0), sigma)) > colnames(sim.set) <- c('x1', 'x2') # x1 & x2 are correlated > sim.set$x3 <- rnorm(N, 0, 1) # x3 is independent > sim.set$x4 <- rnorm(N, 0, 1) # x4 is red herring > sim.set$y = 4 * sim.set$x1 + 5 * sim.set$x2 + 6 * sim.set$x3 # y is > outcomeThis is your problem: There is no noise term in there. As a result, F and t tests are basically 0/0 except for errors introduced by roundoff. These errors can affect numerators and denominators differently when different numerical methods are used, and it is really a toss-up whether this results in a significant test or not.> > I then checked the correlation of my simulated data and fitted a > linear regression to check if y = 4 * x1 + 5 * x2 + 6 * x3 indeed. > > round(cor(sim.set), 2) > summary(model <- lm(y ~ ., data = sim.set)) > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) -5.423e-17 1.256e-16 -4.320e-01 0.666 > x1 4.000e+00 1.388e-16 2.881e+16 <2e-16 *** > x2 5.000e+00 1.441e-16 3.470e+16 <2e-16 *** > x3 6.000e+00 1.188e-16 5.051e+16 <2e-16 *** > x4 -8.150e-18 1.165e-16 -7.000e-02 0.944 > > anova(model) > > Df Sum Sq Mean Sq F value Pr(>F) > x1 1 8686.1 8686.1 2.8218e+33 <2e-16 *** > x2 1 5568.7 5568.7 1.8091e+33 <2e-16 *** > x3 1 7852.1 7852.1 2.5509e+33 <2e-16 *** > x4 1 1.507e-32 1.507e-32 4.9000e-03 0.9443 > Residuals 195 6.002e-28 3.078e-30 > > All was well so far, as x4 was identified as not significant and its > coeff was almost 0 (because I made it so in the first place). Now I > expected it to be dropped in stepwise: > > step(model, direction = 'both', test = 'F') > drop1(model, test = 'F') > dropterm(model, test = 'F') > > Df Sum of Sq RSS AIC F value Pr(F) > <none> 6.002e-28 -13585.7 > x1 1 2555.1 2555.1 517.5 8.3006e+32 < 2.2e-16 *** > x2 1 3707.0 3707.0 591.9 1.2043e+33 < 2.2e-16 *** > x3 1 7851.9 7851.9 742.0 2.5508e+33 < 2.2e-16 *** > x4 1 2.118e-27 2.718e-27 -13285.6 6.8806e+02 < 2.2e-16 *** > > Neither of those 3 lines of commands managed to drop x4 and its P > value magically decreased from 0.94 to almost 0! I am also baffled by > how R calculated those RSS. However, if I fitted the smaller model and > compared it with the original one by hand, I got the expected answer: > > summary(model2 <- lm(y ~ x1 + x2 + x3, data = sim.set)) > anova(model, model2, test = 'F') > > Model 1: y ~ x1 + x2 + x3 + x4 > Model 2: y ~ x1 + x2 + x3 > Res.Df RSS Df Sum of Sq F Pr(>F) > 1 195 6.0025e-28 > 2 196 6.0026e-28 -1 -2.0000e-32 0.0049 0.9443 > > Interestingly, if I had started from a null model I ended up with y = > 4 * x1 + 5 * x2 + 6 * x3 and R did not add x4 into the model as expected. > > summary(model1 <- lm(y ~ 1, data = sim.set)) > step(model1, direction = 'both', scope = .~. + x1 + x2 + x3 + x4, test > = 'F') > add1(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F') > addterm(model1, scope = .~. + x1 + x2 + x3 + x4, test = 'F') > > Df Sum of Sq RSS AIC F value Pr(F) > <none> 22107.0 943.1 > x1 1 8686.1 13420.8 845.2 128.1478 <2e-16 *** > x2 1 11658.7 10448.3 795.2 220.9377 <2e-16 *** > x3 1 11045.4 11061.6 806.6 197.7096 <2e-16 *** > x4 1 13.4 22093.6 944.9 0.1199 0.7295 > > I'm not sure what is going on. I am running R 2.7.1 on Ubuntu Linux, > with all components up to date. Thank you in advance for all your > thoughts and replies. > > Yours sincerely, > Thomas P C Chu > > University of Leicester > LE1 7RH > UK > > ________________________________________________________________________ > AOL Email goes Mobile! You can now read your AOL Emails whilst on the > move. Sign up for a free AOL Email account with unlimited storage today. > > > > ************************************** > See what's new at http://www.aol.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Thomas P C Chu
2008-Aug-03 13:53 UTC
[R] drop1() seems to give unexpected results compare to anova()
Thanks to Mr Dalgaard for his advice and everyone else who has contributed. Inclusion of an error term at the end of sim.set$y = ... line did cure my problems with drop1() and step(). I suppose it is my own inexperience in carrying out simulations caused such gaffe. Thomas ________________________________________________________________________ AOL Email goes Mobile! You can now read your AOL Emails whilst on the move. Sign up for a free AOL Email account with unlimited storage today.