Cormac Long
2010-Sep-15 10:38 UTC
[R] A question on modelling binary response data using factors
Dear all, A question on modelling proportional data in R. I have a test experiment that was designed in a particular way, and which I can analyse "by hand" to an extent. I am really struggling to get R to give me sensible results in modelling it "properly", so must be doing something wrong here. As background, I conduct a series of experiments and count the "hits" and "misses" - I believe this could be modelled using a glm. The experiments are done under a range of conditions, altering three different "factors", lets call them A, B, and C. Factors A and B each have 2 levels (A1,A2,B1,B2) while factor C has 5 levels (C1,C2,C3,C4,C5). The experiment has only partial coverage, that is not every A is tested with every B and every C. However, I was careful in the experimental design to ensure that every A and every B was tested against at least one C. Here is my experimental data: FactorA FactorB FactorC Hit Miss A1 B1 C1 17 83 A1 B1 C2 17 83 A1 B1 C3 18 82 A1 B1 C4 NA NA A1 B1 C5 NA NA A1 B2 C1 11 89 A1 B2 C2 17 83 A1 B2 C3 17 83 A1 B2 C4 NA NA A1 B2 C5 NA NA A2 B1 C1 NA NA A2 B1 C2 NA NA A2 B1 C3 23 77 A2 B1 C4 19 81 A2 B1 C5 29 71 A2 B2 C1 NA NA A2 B2 C2 NA NA A2 B2 C3 13 87 A2 B2 C4 20 80 A2 B2 C5 24 76 I to compare individual rates against the overall rate, and look at difference across pooled rate for each factor Doing hand calculations, it is possible to calculate the individual hit rates, the overall hit rate of the pool, and the standard errors on these using hitrate = hits/(hits+misses) and hitrateerror sqrt((hits+misses)*(hits/(hits+misses))*(misses/(hits+misses)))/(hits+misses) Again doing hand calculations, it is possible to extract the influence of each variant of each factor, by producing a "pool" and comparing the influence of each variant to the pool. For the overall pool, there were 225 hits from 1200 trials, for a pooled hitrate of 18.8%. For factor C the analysis is relatively straightforward, in that we calculate the hitrate for each C1 (28 hits from 200 trials = 14.0%), C2, ... etc. We then calculate the influence of a given factor compared to the pool - for example C1 has a hitrate of 14.0% compared to a pooled hitrate of 18.8%, so has an influence of (14.0%/18.8%)-100% = -25%. For factor A, the analysis is slightly complicated by the fact that factor C3 is the only variant that "fairly" tests factor A (none of factors C1,C2,C4,C5 do not have experiments for both factors A1 and A2). Therefore, just taking the factor C3 data, we can show that the pool of 400 trials had 71 hits, for a hitrate of 17.8%. We can then calculate the hitrate for A1 and A2 as 17.5% and 18.0% respectively. From this we can say that the effect of A1 and A2 was -1.4% and +1.4%, that is to say (17.5%/17.8%)-100% and (18.0/17.8%)-100%. The trick here is that the pooled hitrate for factor A is 17.8%, i.e. the hitrate for factor C3 alone, not 18.8% which is the overall pooled hitrate. However we have taken out the influence of C3 by normalising to the C3 hitrate, i.e. we could calculate the expected hitrate of the combination A2 B2 C1, which was never tested, by combining the influence of A2, B2 and C1 on the overall pool hitrate of 18.8%. It should be possible to model this using e.g. a glm with appropriate contrasts and model, however for the life of me I cannot work out how to get this to work. Getting R to provide a sensible "intercept" to the model, i.e. the overall hitrate of 18.8%, and also get sensible indications of the importance or otherwise of each factor. In particular, I am really struggling to get R to test the relevance of factor A - this is because when tested using all factor C, not just C3, there is a huge bias in that C1 and C2 have comparitively low hit rates, and were only tested against A1, whereas C4 and C5 have comparitively high hitrates, but were only tested against A2. That means that it appears, unless you look at the C3 data only, that factor A is highly important - whereas it clearly is not. The reason for using R to model this "correctly", rather than just doing the hand calculations above, is that I would then like to be able to test for interactions as well (and the significance thereof), and perform rather more complex sets of experiments using the same general ideas. Apologies for the massive post, but I have been banging my head against this one for more than a week now and am absolutely sure there is something simple that I am doing wrong! Yours sincerely, Cormac Long. [[alternative HTML version deleted]]