Lars Kunert
2009-Mar-16 22:24 UTC
[R] the effect of blocking on the size of confidence intervals - analysis using aov
Hi, I am a method developer in drug discovery. I have developed a similarity searching method for small (drug-like) molecules. I want to compare the performance of my method to the performance of other methods. Similarity searching methods are commonly assessed by their ability to (re)discover a set of molecules that is avtive versus a given target, given one randomly selected query molecule. Thus my dataset comprises, a quantitative response variable, namely: "response", and three categorical eplanatory variables, namely: "method", "target", and "query". The method variable has 4 levels, namely: XXX.pg.trained, XXX.rpg.trained, MACCS, and FTREES. The target variable has 28 levels. The query variable has 84 levels. For every target three queries are selected. Thus "query" is nested in "target". The design is completely balanced. "response" "method" "target" "query" 0.68 "XXX.pg.trained" "06243" "06243:218379" 0.96 "XXX.pg.trained" "06243" "06243:218384" 0.72 "XXX.pg.trained" "06243" "06243:246192" 0.7696 "XXX.pg.trained" "06245" "06245:235027" ... For the complete dataset see the attached file data.txt Because I am interested in comparing the actual levels of "method", I model "method" as a fixed factor. Because I consider both: the levels of "target" and the levels of "query" to be randomly chosen samples of the overall distributations of targets and queries, I model "target" and "query" as random factors. An analysis of variance shows that both: (1) the between target variance, and (2) the within target and between query variance contribute in large part to the overall variance. Thus, differences between the different levels of "method", should be emphasised by blocking on "target" or "query".> m.aov = aov(formula = response ~ method + Error(target/query), data = d) > summary( m.aov )Error: target Df Sum Sq Mean Sq F value Pr(>F) Residuals 27 18.8539 0.6983 Error: target:query Df Sum Sq Mean Sq F value Pr(>F) Residuals 56 3.15927 0.05642 Error: Within Df Sum Sq Mean Sq F value Pr(>F) method 3 1.0891 0.3630 13.062 5.97e-08 *** Residuals 249 6.9201 0.0278 I analyze three models:> m0.aov = aov(formula = response ~ method -1, data = d) > m1.aov = aov(formula = response ~ method -1 + Error(target), data = d) > m2.aov = aov(formula = response ~ method -1 + Error(query), data = d)I understand that the fitted parameters for the different levels of "method" should be identical for all models. However the confidence intervals of the paramters should decline in the oder m0, m1, and m3. In Maindonalds book "Data Analysis and Graphics Using R" I found a description, how confidence intervals of fitted parameters can be derived from the corresponding standard errors (p. 154). For the m0.aov model the approach is straight forward:> summary.lm(m0.aov)... Coefficients: Estimate Std. Error t value Pr(>|t|) methodFTREES 0.34163 0.03221 10.61 <2e-16 *** methodMACCS 0.45525 0.03221 14.13 <2e-16 *** methodXXX.pg.trained 0.49498 0.03221 15.37 <2e-16 *** methodXXX.rpg.trained 0.45049 0.03221 13.99 <2e-16 *** ... Residual standard error: 0.2952 on 332 degrees of freedom ... Thus the confidence interval for "methodFTREES" is given by: 0.34163 +- qt( 0.025, 332 )*0.03221 = ( 0.2782686, 0.4049914 ) However for the m1.aov and the m2.aov the call of summary.lm fails:> summary.lm(m1.aov)Error in if (p == 0) { : argument is of length zero And, even more disturbingly, the fixed parameters of both models derivate from the fixed parameters of the m0.aov model:> coef( m1.aov )target : methodFTREES 1.742346 Within : methodFTREES methodMACCS methodXXX.pg.trained -0.108865476 0.004753571 0.044491667 Note that I added a "-1" to the model, in order to exclude the intercept term. I added the "-1" in order to get four directly comparable parameters. With the intercept term the I get the following result:> coef(m1.aov)(Intercept) : (Intercept) 0.4355866 target : numeric(0) Within : methodMACCS methodXXX.pg.trained methodXXX.rpg.trained 0.1136190 0.1533571 0.1088655 I have the following questions: - Is the overall approach: use of blocking factors to reduce the residual varaince, and hence to decrease the width of the confidence intervals valid? - Why does the call of summary.lm fail for m1.aov and m2.aov? - Why are the fitted parameters different for m0.aov compared to m1.aov and m2.aov? Please note that I made similar attempts using lme and lmer. I failed in both cases. In order to keep the story short I decided to post my questions concerning lme and lmer in a follow-up mail. Thanks for your help, Lars -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: data.txt URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090316/7b7d5845/attachment-0003.txt>