Cormac Long
2010-Sep-15 10:38 UTC
[R] A question on modelling binary response data using factors
Dear all,
A question on modelling proportional data in R. I have a test experiment
that was designed in a particular way, and which I can analyse "by
hand" to
an extent. I am really struggling to get R to give me sensible results in
modelling it "properly", so must be doing something wrong here.
As background, I conduct a series of experiments and count the "hits"
and
"misses" - I believe this could be modelled using a glm. The
experiments
are done under a range of conditions, altering three different
"factors",
lets call them A, B, and C. Factors A and B each have 2 levels
(A1,A2,B1,B2)
while factor C has 5 levels (C1,C2,C3,C4,C5). The experiment has only
partial coverage, that is not every A is tested with every B and every C.
However,
I was careful in the experimental design to ensure that every A and every B
was tested against at least one C.
Here is my experimental data:
FactorA FactorB FactorC Hit Miss
A1 B1 C1 17 83
A1 B1 C2 17 83
A1 B1 C3 18 82
A1 B1 C4 NA NA
A1 B1 C5 NA NA
A1 B2 C1 11 89
A1 B2 C2 17 83
A1 B2 C3 17 83
A1 B2 C4 NA NA
A1 B2 C5 NA NA
A2 B1 C1 NA NA
A2 B1 C2 NA NA
A2 B1 C3 23 77
A2 B1 C4 19 81
A2 B1 C5 29 71
A2 B2 C1 NA NA
A2 B2 C2 NA NA
A2 B2 C3 13 87
A2 B2 C4 20 80
A2 B2 C5 24 76
I to compare individual rates against the overall rate, and look at
difference across pooled rate for each factor
Doing hand calculations, it is possible to calculate the individual hit
rates, the overall hit rate of the pool, and the standard errors on these
using hitrate = hits/(hits+misses) and hitrateerror
sqrt((hits+misses)*(hits/(hits+misses))*(misses/(hits+misses)))/(hits+misses)
Again doing hand calculations, it is possible to extract the influence of
each variant of each factor, by producing a "pool" and comparing the
influence
of each variant to the pool. For the overall pool, there were 225 hits
from 1200 trials, for a pooled hitrate of 18.8%.
For factor C the analysis is relatively straightforward, in that we
calculate the hitrate for each C1 (28 hits from 200 trials = 14.0%), C2, ...
etc. We then
calculate the influence of a given factor compared to the pool - for
example C1 has a hitrate of 14.0% compared to a pooled hitrate of 18.8%, so
has an
influence of (14.0%/18.8%)-100% = -25%.
For factor A, the analysis is slightly complicated by the fact that factor
C3 is the only variant that "fairly" tests factor A
(none of factors C1,C2,C4,C5 do not have experiments for both factors A1
and A2). Therefore, just taking the factor C3 data, we can show that the
pool
of 400 trials had 71 hits, for a hitrate of 17.8%. We can then calculate
the hitrate for A1 and A2 as 17.5% and 18.0% respectively. From this we can
say that the effect of A1 and A2 was -1.4% and +1.4%, that is to say
(17.5%/17.8%)-100% and (18.0/17.8%)-100%.
The trick here is that the pooled hitrate for factor A is 17.8%, i.e. the
hitrate for factor C3 alone, not 18.8% which is the overall pooled hitrate.
However we have taken out the influence of C3 by normalising to the C3
hitrate, i.e. we could calculate the expected hitrate of the combination
A2 B2 C1, which was never tested, by combining the influence of A2, B2 and
C1 on the overall pool hitrate of 18.8%.
It should be possible to model this using e.g. a glm with appropriate
contrasts and model, however for the life of me I cannot work out how to get
this to work. Getting R to provide a sensible "intercept" to the
model,
i.e. the overall hitrate of 18.8%, and also get sensible indications of the
importance or otherwise of each factor.
In particular, I am really struggling to get R to test the relevance of
factor A - this is because when tested using all factor C, not just C3,
there
is a huge bias in that C1 and C2 have comparitively low hit rates, and were
only tested against A1, whereas C4 and C5 have comparitively high hitrates,
but were only tested against A2. That means that it appears, unless you
look at the C3 data only, that factor A is highly important - whereas it
clearly
is not.
The reason for using R to model this "correctly", rather than just
doing
the hand calculations above, is that I would then like to be able to test
for
interactions as well (and the significance thereof), and perform rather
more complex sets of experiments using the same general ideas.
Apologies for the massive post, but I have been banging my head against
this one for more than a week now and am absolutely sure there is something
simple that I am doing wrong!
Yours sincerely,
Cormac Long.
[[alternative HTML version deleted]]
