Warnes, Gregory R
2002-Dec-13 21:34 UTC
[R] Surprising results from summary(lm()) on data with NO variation
I have some data (from Affymetrix experiments) where I fit an aov() model to a large number of outcome variables. A reasonable fraction of the outcome variables have 0 variability because values below a cutoff have been replaced with the cutoff value (often 20) . In this case, the overall p-value from summary(lm(..)) is misleadingly small: For example (equivalent results in R 1.6.1 on both Solaris and for Windows 2000): > x <- rep(20, 4+5+5) # no variation > Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) )) # 3 treatment levels > summary( lm( x ~ Treatment) ) Call: lm(formula = x ~ Treatment) Residuals: Min 1Q Median 3Q Max -1.994e-14 7.889e-31 7.889e-31 7.889e-31 6.647e-15 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.000e+01 3.471e-15 5.762e+15 <2e-16 *** TreatmentHigh 6.647e-15 4.657e-15 1.427e+00 0.181 TreatmentLow 6.647e-15 4.657e-15 1.427e+00 0.181 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 6.942e-15 on 11 degrees of freedom Multiple R-Squared: 0.5333, Adjusted R-squared: 0.4485 F-statistic: 6.286 on 2 and 11 DF, p-value: 0.01512 ^^^^^^^^^^^^^^^^ Note F statistic and the overall p-value. For some values (eg 0 and 7), fitting this model will generate NA's for the estimates. For most values, we get similar results. I presume that this is due to numerical problems in the fitting algorithm. Still the result is quite surprising. It appears that summary.aov doesn't have the same problem: > x <- rep(20, 4+5+5) # no variation > Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) )) # 3 treatment levels > summary( aov( x ~ Treatment) ) Df Sum Sq Mean Sq F value Pr(>F) Treatment 2 1.2622e-28 6.3110e-29 1.3095 0.3089 Residuals 11 5.3011e-28 4.8190e-29 Perhaps the mechanism for computing the F statistic that summary.aov uses is less susceptible to the 'no variability' problem than the computation in summary.lm. I'm posting this here to raise a warning flag. For my particular application, I'll just check for the no variability case and handle it separately. I'll leave it to the R core folks to decide if this is a big enough problem to do something about in the code. -Greg LEGAL NOTICE\ Unless expressly stated otherwise, this message is ... [[dropped]]
ripley@stats.ox.ac.uk
2002-Dec-13 21:50 UTC
[R] Surprising results from summary(lm()) on data with NO variation
It's just rounding errors. We can't tell if you are genuinely interested in studying rounding error or not. In other words, whether something is really constant or has a very small range of variation is not something you can get from the numbers alone. lda tries hard to warn here, but doing that well is expensive. On Fri, 13 Dec 2002, Warnes, Gregory R wrote:> > I have some data (from Affymetrix experiments) where I fit an aov() model to > a large number of outcome variables. A reasonable fraction of the outcome > variables have 0 variability because values below a cutoff have been > replaced with the cutoff value (often 20) . > > In this case, the overall p-value from summary(lm(..)) is misleadingly > small: > > For example (equivalent results in R 1.6.1 on both Solaris and for Windows > 2000): > > > > x <- rep(20, 4+5+5) # no variation > > Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) )) # > 3 treatment levels > > summary( lm( x ~ Treatment) ) > > Call: > lm(formula = x ~ Treatment) > > Residuals: > Min 1Q Median 3Q Max > -1.994e-14 7.889e-31 7.889e-31 7.889e-31 6.647e-15 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 2.000e+01 3.471e-15 5.762e+15 <2e-16 *** > TreatmentHigh 6.647e-15 4.657e-15 1.427e+00 0.181 > TreatmentLow 6.647e-15 4.657e-15 1.427e+00 0.181 > --- > Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > > Residual standard error: 6.942e-15 on 11 degrees of freedom > Multiple R-Squared: 0.5333, Adjusted R-squared: 0.4485 > F-statistic: 6.286 on 2 and 11 DF, p-value: 0.01512 > ^^^^^^^^^^^^^^^^ > > Note F statistic and the overall p-value. > > For some values (eg 0 and 7), fitting this model will generate NA's for the > estimates. For most values, we get similar results. I presume that this is > due to numerical problems in the fitting algorithm. Still the result is > quite surprising. > > It appears that summary.aov doesn't have the same problem: > > > x <- rep(20, 4+5+5) # no variation > > Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) )) # > 3 treatment levels > > summary( aov( x ~ Treatment) ) > Df Sum Sq Mean Sq F value Pr(>F) > Treatment 2 1.2622e-28 6.3110e-29 1.3095 0.3089 > Residuals 11 5.3011e-28 4.8190e-29 > > Perhaps the mechanism for computing the F statistic that summary.aov uses is > less susceptible to the 'no variability' problem than the computation in > summary.lm. > > I'm posting this here to raise a warning flag. For my particular > application, I'll just check for the no variability case and handle it > separately. I'll leave it to the R core folks to decide if this is a big > enough problem to do something about in the code. > > -Greg > > > LEGAL NOTICE\ Unless expressly stated otherwise, this message is ... [[dropped]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > http://www.stat.math.ethz.ch/mailman/listinfo/r-help >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595