thr3ads.net - R help - [R] Predicted values from glm() when linear predictor is NA. [Jul 2022]

If this information is useful, please help other people find it:
Share via:

Jeff Newmiller

2022-Jul-28 16:50 UTC

[R] Predicted values from glm() when linear predictor is NA.

Yes, I am familiar with linear algebra. I nod along with you until you get to
"hence" and then you make a leap. The user made the decision (or
accepted the default) that the rank problem be ignored... they have the option
to reduce their model, but they wanted this coefficient (communicated via the
model and the parameter) to behave as if it were zero. The issue is that that
decision is getting hidden behind this additional implementation decision to
exclude certain columns in the model matrix _even after the model has been
solved_. If they try to use the coefficients themselves then they have to apply
this odd interpretation of NA that is embedded in predict.glm but not apparent
in their model instead of using a simple newdata %*% coefs.

It may be easier to implement summary.glm with an NA there as a hint to the rank
deficiency, but at the cost of mathematical inconsistency in the reporting of
coefficients. IMO either the NA should always be presented to the user as if it
were a zero or the rank deficiency should be recorded separately.

On July 28, 2022 8:46:13 AM PDT, John Fox <jfox at mcmaster.ca>
wrote:>Dear Jeff,
>
>On 2022-07-28 11:12 a.m., Jeff Newmiller wrote:
>> No, in this case I think I needed the "obvious" breakdown.
Still digesting, though... I would prefer that if an arbitrary selection had
been made that it be explicit .. the NA should be replaced with zero if the
singular.ok argument is TRUE, rather than making that interpretation in
predict.glm.
>
>That's one way to think about, but another is that the model matrix X
has 10 columns but is of rank 9. Thus 9 basis vectors are needed to span the
column space of X, and a simple way to provide a basis is to eliminate a
redundant column, hence the NA. The fitted values y-hat in a linear model are
the orthogonal projection of y onto the space spanned by the columns of X, and
are thus independent of the basis chosen. A GLM is a little more complicated,
but it's still the column space of X that's important.
>
>Best,
> John
>
>> 
>> On July 28, 2022 5:45:35 AM PDT, John Fox <jfox at mcmaster.ca>
wrote:
>>> Dear Jeff,
>>> 
>>> On 2022-07-28 1:31 a.m., Jeff Newmiller wrote:
>>>> But "disappearing" is not what NA is supposed to do
normally. Why is it being treated that way here?
>>> 
>>> NA has a different meaning here than in data.
>>> 
>>> By default, in glm() the argument singular.ok is TRUE, and so
estimates are provided even when there are singularities, and even though the
singularities are resolved arbitrarily.
>>> 
>>> In this model, the columns of the model matrix labelled LifestageL1
and TrtTime:LifestageL1 are perfectly collinear -- the second is 12 times the
first (both have 0s in the same rows and either 1 or 12 in three of the rows) --
and thus both can't be estimated simultaneously, but the model can be
estimated by eliminating one or the other (effectively setting its coefficient
to 0), or by taking any linear combination of the two regressors (i.e., using
any regressor with 0s and some other value). The fitted values under the model
are invariant with respect to this arbitrary choice.
>>> 
>>> My apologies if I'm stating the obvious and misunderstand your
objection.
>>> 
>>> Best,
>>> John
>>> 
>>>> 
>>>> On July 27, 2022 7:04:20 PM PDT, John Fox <jfox at
mcmaster.ca> wrote:
>>>>> Dear Rolf,
>>>>> 
>>>>> The coefficient of TrtTime:LifestageL1 isn't estimable
(as you explain) and by setting it to NA, glm() effectively removes it from the
model. An equivalent model is therefore
>>>>> 
>>>>>> fit2 <- glm(cbind(Dead,Alive) ~ TrtTime + Lifestage
+
>>>>> +               I((Lifestage == "Egg +
L1")*TrtTime) +
>>>>> +               I((Lifestage == "L1 +
L2")*TrtTime) +
>>>>> +               I((Lifestage == "L3")*TrtTime),
>>>>> +             family=binomial, data=demoDat)
>>>>> Warning message:
>>>>> glm.fit: fitted probabilities numerically 0 or 1 occurred
>>>>> 
>>>>>> cbind(coef(fit, complete=FALSE), coef(fit2))
>>>>>                                    [,1]         [,2]
>>>>> (Intercept)                -0.91718302  -0.91718302
>>>>> TrtTime                     0.88846195   0.88846195
>>>>> LifestageEgg + L1         -45.36420974 -45.36420974
>>>>> LifestageL1                14.27570572  14.27570572
>>>>> LifestageL1 + L2           -0.30332697  -0.30332697
>>>>> LifestageL3                -3.58672631  -3.58672631
>>>>> TrtTime:LifestageEgg + L1   8.10482459   8.10482459
>>>>> TrtTime:LifestageL1 + L2    0.05662651   0.05662651
>>>>> TrtTime:LifestageL3         1.66743472   1.66743472
>>>>> 
>>>>> There is no problem computing fitted values for the model,
specified either way. That the fitted values when Lifestage == "L1"
all round to 1 on the probability scale is coincidental -- that is, a
consequence of the data.
>>>>> 
>>>>> I hope this helps,
>>>>> John
>>>>> 
>>>>> On 2022-07-27 8:26 p.m., Rolf Turner wrote:
>>>>>> 
>>>>>> I have a data frame with a numeric
("TrtTime") and a categorical
>>>>>> ("Lifestage") predictor.
>>>>>> 
>>>>>> Level "L1" of Lifestage occurs only with a
single value of TrtTime,
>>>>>> explicitly 12, whence it is not possible to estimate a
TrtTime "slope"
>>>>>> when Lifestage is "L1".
>>>>>> 
>>>>>> Indeed, when I fitted the model
>>>>>> 
>>>>>>        fit <- glm(cbind(Dead,Alive) ~
TrtTime*Lifestage, family=binomial,
>>>>>>                   data=demoDat)
>>>>>> 
>>>>>> I got:
>>>>>> 
>>>>>>> as.matrix(coef(fit))
>>>>>>>                                      [,1]
>>>>>>> (Intercept)                -0.91718302
>>>>>>> TrtTime                     0.88846195
>>>>>>> LifestageEgg + L1         -45.36420974
>>>>>>> LifestageL1                14.27570572
>>>>>>> LifestageL1 + L2           -0.30332697
>>>>>>> LifestageL3                -3.58672631
>>>>>>> TrtTime:LifestageEgg + L1   8.10482459
>>>>>>> TrtTime:LifestageL1                 NA
>>>>>>> TrtTime:LifestageL1 + L2    0.05662651
>>>>>>> TrtTime:LifestageL3         1.66743472
>>>>>> 
>>>>>> That is, TrtTime:LifestageL1 is NA, as expected.
>>>>>> 
>>>>>> I would have thought that fitted or predicted values
corresponding to
>>>>>> Lifestage = "L1" would thereby be NA, but
this is not the case:
>>>>>> 
>>>>>>> predict(fit)[demoDat$Lifestage=="L1"]
>>>>>>>          26       65      131
>>>>>>> 24.02007 24.02007 24.02007
>>>>>>> 
>>>>>>> fitted(fit)[demoDat$Lifestage=="L1"]
>>>>>>>     26  65 131
>>>>>>>      1   1   1
>>>>>> 
>>>>>> That is, the predicted values on the scale of the
linear predictor are
>>>>>> large and positive, rather than being NA.
>>>>>> 
>>>>>> What this amounts to, it seems to me, is saying that if
the linear
>>>>>> predictor in a Binomial glm is NA, then
"success" is a certainty.
>>>>>> This strikes me as being a dubious proposition.  My gut
feeling is that
>>>>>> misleading results could be produced.
>>>>>> 
>>>>>> Can anyone explain to me a rationale for this behaviour
pattern?
>>>>>> Is there some justification for it that I am not
currently seeing?
>>>>>> Any other comments?  (Please omit comments to the
effect of "You are as
>>>>>> thick as two short planks!". :-) )
>>>>>> 
>>>>>> I have attached the example data set in a file
"demoDat.txt", should
>>>>>> anyone want to experiment with it.  The file was
created using dput() so
>>>>>> you should access it (if you wish to do so) via
something like
>>>>>> 
>>>>>>        demoDat <- dget("demoDat.txt")
>>>>>> 
>>>>>> Thanks for any enlightenment.
>>>>>> 
>>>>>> cheers,
>>>>>> 
>>>>>> Rolf Turner
>>>>>> 
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>> 
>> 
-- 
Sent from my phone. Please excuse my brevity.

Bill Dunlap

2022-Jul-28 18:00 UTC

head link

[R] Predicted values from glm() when linear predictor is NA.

In this example, x has an unknown influence on y (since x doesn't vary)
   > coef(lm(y ~ x, data=data.frame(y=101:105,x=rep(7,5))))
   (Intercept)           x
           103          NA
while in this example x has no influence on y, a very different conclusion
   > coef(lm(y ~ x, data=data.frame(y=rep(103,5),x=1:5)))
    (Intercept)           x
            103           0

Are you arguing that lm's (or glm's) output should have another
component
telling us whether a column was included in the model or not (i.e., if a
coefficient was well defined or not)?  Then coef() would have to copy this
information so it could show the user that (e.g., by printing '-'
instead
of NA).  That might be reasonable.  Using NA for an unknown coefficient
value also seems reasonable and a lot of code currently depends upon it.
Does it need to be documented better?

-Bill


On Thu, Jul 28, 2022 at 9:50 AM Jeff Newmiller <jdnewmil at
dcn.davis.ca.us>
wrote:
> Yes, I am familiar with linear algebra. I nod along with you until you get
> to "hence" and then you make a leap. The user made the decision
(or
> accepted the default) that the rank problem be ignored... they have the
> option to reduce their model, but they wanted this coefficient
> (communicated via the model and the parameter) to behave as if it were
> zero. The issue is that that decision is getting hidden behind this
> additional implementation decision to exclude certain columns in the model
> matrix _even after the model has been solved_. If they try to use the
> coefficients themselves then they have to apply this odd interpretation of
> NA that is embedded in predict.glm but not apparent in their model instead
> of using a simple newdata %*% coefs.
>
> It may be easier to implement summary.glm with an NA there as a hint to
> the rank deficiency, but at the cost of mathematical inconsistency in the
> reporting of coefficients. IMO either the NA should always be presented to
> the user as if it were a zero or the rank deficiency should be recorded
> separately.
>
> On July 28, 2022 8:46:13 AM PDT, John Fox <jfox at mcmaster.ca>
wrote:
> >Dear Jeff,
> >
> >On 2022-07-28 11:12 a.m., Jeff Newmiller wrote:
> >> No, in this case I think I needed the "obvious"
breakdown. Still
> digesting, though... I would prefer that if an arbitrary selection had been
> made that it be explicit .. the NA should be replaced with zero if the
> singular.ok argument is TRUE, rather than making that interpretation in
> predict.glm.
> >
> >That's one way to think about, but another is that the model matrix
X has
> 10 columns but is of rank 9. Thus 9 basis vectors are needed to span the
> column space of X, and a simple way to provide a basis is to eliminate a
> redundant column, hence the NA. The fitted values y-hat in a linear model
> are the orthogonal projection of y onto the space spanned by the columns of
> X, and are thus independent of the basis chosen. A GLM is a little more
> complicated, but it's still the column space of X that's important.
> >
> >Best,
> > John
> >
> >>
> >> On July 28, 2022 5:45:35 AM PDT, John Fox <jfox at
mcmaster.ca> wrote:
> >>> Dear Jeff,
> >>>
> >>> On 2022-07-28 1:31 a.m., Jeff Newmiller wrote:
> >>>> But "disappearing" is not what NA is supposed to
do normally. Why is
> it being treated that way here?
> >>>
> >>> NA has a different meaning here than in data.
> >>>
> >>> By default, in glm() the argument singular.ok is TRUE, and so
> estimates are provided even when there are singularities, and even though
> the singularities are resolved arbitrarily.
> >>>
> >>> In this model, the columns of the model matrix labelled
LifestageL1
> and TrtTime:LifestageL1 are perfectly collinear -- the second is 12 times
> the first (both have 0s in the same rows and either 1 or 12 in three of the
> rows) -- and thus both can't be estimated simultaneously, but the model
can
> be estimated by eliminating one or the other (effectively setting its
> coefficient to 0), or by taking any linear combination of the two
> regressors (i.e., using any regressor with 0s and some other value). The
> fitted values under the model are invariant with respect to this arbitrary
> choice.
> >>>
> >>> My apologies if I'm stating the obvious and misunderstand
your
> objection.
> >>>
> >>> Best,
> >>> John
> >>>
> >>>>
> >>>> On July 27, 2022 7:04:20 PM PDT, John Fox <jfox at
mcmaster.ca> wrote:
> >>>>> Dear Rolf,
> >>>>>
> >>>>> The coefficient of TrtTime:LifestageL1 isn't
estimable (as you
> explain) and by setting it to NA, glm() effectively removes it from the
> model. An equivalent model is therefore
> >>>>>
> >>>>>> fit2 <- glm(cbind(Dead,Alive) ~ TrtTime +
Lifestage +
> >>>>> +               I((Lifestage == "Egg +
L1")*TrtTime) +
> >>>>> +               I((Lifestage == "L1 +
L2")*TrtTime) +
> >>>>> +               I((Lifestage ==
"L3")*TrtTime),
> >>>>> +             family=binomial, data=demoDat)
> >>>>> Warning message:
> >>>>> glm.fit: fitted probabilities numerically 0 or 1
occurred
> >>>>>
> >>>>>> cbind(coef(fit, complete=FALSE), coef(fit2))
> >>>>>                                    [,1]         [,2]
> >>>>> (Intercept)                -0.91718302  -0.91718302
> >>>>> TrtTime                     0.88846195   0.88846195
> >>>>> LifestageEgg + L1         -45.36420974 -45.36420974
> >>>>> LifestageL1                14.27570572  14.27570572
> >>>>> LifestageL1 + L2           -0.30332697  -0.30332697
> >>>>> LifestageL3                -3.58672631  -3.58672631
> >>>>> TrtTime:LifestageEgg + L1   8.10482459   8.10482459
> >>>>> TrtTime:LifestageL1 + L2    0.05662651   0.05662651
> >>>>> TrtTime:LifestageL3         1.66743472   1.66743472
> >>>>>
> >>>>> There is no problem computing fitted values for the
model, specified
> either way. That the fitted values when Lifestage == "L1" all
round to 1 on
> the probability scale is coincidental -- that is, a consequence of the
data.
> >>>>>
> >>>>> I hope this helps,
> >>>>> John
> >>>>>
> >>>>> On 2022-07-27 8:26 p.m., Rolf Turner wrote:
> >>>>>>
> >>>>>> I have a data frame with a numeric
("TrtTime") and a categorical
> >>>>>> ("Lifestage") predictor.
> >>>>>>
> >>>>>> Level "L1" of Lifestage occurs only with
a single value of TrtTime,
> >>>>>> explicitly 12, whence it is not possible to
estimate a TrtTime
> "slope"
> >>>>>> when Lifestage is "L1".
> >>>>>>
> >>>>>> Indeed, when I fitted the model
> >>>>>>
> >>>>>>        fit <- glm(cbind(Dead,Alive) ~
TrtTime*Lifestage,
> family=binomial,
> >>>>>>                   data=demoDat)
> >>>>>>
> >>>>>> I got:
> >>>>>>
> >>>>>>> as.matrix(coef(fit))
> >>>>>>>                                      [,1]
> >>>>>>> (Intercept)                -0.91718302
> >>>>>>> TrtTime                     0.88846195
> >>>>>>> LifestageEgg + L1         -45.36420974
> >>>>>>> LifestageL1                14.27570572
> >>>>>>> LifestageL1 + L2           -0.30332697
> >>>>>>> LifestageL3                -3.58672631
> >>>>>>> TrtTime:LifestageEgg + L1   8.10482459
> >>>>>>> TrtTime:LifestageL1                 NA
> >>>>>>> TrtTime:LifestageL1 + L2    0.05662651
> >>>>>>> TrtTime:LifestageL3         1.66743472
> >>>>>>
> >>>>>> That is, TrtTime:LifestageL1 is NA, as expected.
> >>>>>>
> >>>>>> I would have thought that fitted or predicted
values corresponding
> to
> >>>>>> Lifestage = "L1" would thereby be NA,
but this is not the case:
> >>>>>>
> >>>>>>>
predict(fit)[demoDat$Lifestage=="L1"]
> >>>>>>>          26       65      131
> >>>>>>> 24.02007 24.02007 24.02007
> >>>>>>>
> >>>>>>> fitted(fit)[demoDat$Lifestage=="L1"]
> >>>>>>>     26  65 131
> >>>>>>>      1   1   1
> >>>>>>
> >>>>>> That is, the predicted values on the scale of the
linear predictor
> are
> >>>>>> large and positive, rather than being NA.
> >>>>>>
> >>>>>> What this amounts to, it seems to me, is saying
that if the linear
> >>>>>> predictor in a Binomial glm is NA, then
"success" is a certainty.
> >>>>>> This strikes me as being a dubious proposition. 
My gut feeling is
> that
> >>>>>> misleading results could be produced.
> >>>>>>
> >>>>>> Can anyone explain to me a rationale for this
behaviour pattern?
> >>>>>> Is there some justification for it that I am not
currently seeing?
> >>>>>> Any other comments?  (Please omit comments to the
effect of "You
> are as
> >>>>>> thick as two short planks!". :-) )
> >>>>>>
> >>>>>> I have attached the example data set in a file
"demoDat.txt", should
> >>>>>> anyone want to experiment with it.  The file was
created using
> dput() so
> >>>>>> you should access it (if you wish to do so) via
something like
> >>>>>>
> >>>>>>        demoDat <- dget("demoDat.txt")
> >>>>>>
> >>>>>> Thanks for any enlightenment.
> >>>>>>
> >>>>>> cheers,
> >>>>>>
> >>>>>> Rolf Turner
> >>>>>>
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained,
reproducible code.
> >>>>
> >>
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Jul 2022 - Predicted values from glm() when linear predictor is NA.

[R] Predicted values from glm() when linear predictor is NA.

[R] Predicted values from glm() when linear predictor is NA.