Marian K Talbert
2012-May-10 15:19 UTC
[R] disagreement in loglikelihood and deviace in GLM with weights leads to different models selected using step()
In species distribution modeling where one uses a large sample of
background points to capture background variation in
presence\pseudo-absence or use\available models (0\1 response) it is
frequently recommended that one weight the data so the sum of the absence
weights is equal to the sum of presence weights so that the model isn?t
swamped by an overwhelming and arbitrary number of background points.
I?m trying to do this in R in the standard glm in the stats package and am
a bit confused. I understand the issue with noninteger weights in glm and
specifying reasonable starting values to ensure convergence to something
meaningful and for any individual model fit I can get mostly the same
results. For example I have a dataset with 75 presence and 75 absence if
instead I had randomly sampled 3150 background points I could set the
weights to 1 for presence and 75/3150 for the background points. I
simulate this example by repeating my 75 absences 42 times and setting the
weights for the absence to 75/3150 and setting mustart equal to the
predicted response from the 75/75 model and fitting glms to each data set.
The output agrees in the coefficient estimates null and residual deviance
but NOT in the AIC or the value returned from logLik() it appears the AIC
is calculated from the likelihood though as I understand it these should
be directly related to deviance so I?m not sure why these don?t all agree.
My ultimate goal is to fit these models and run automatic model selection
using the step function. This doesn?t seem to work for two reasons:
1. it uses the logLik value of likelihood and I feel it should be
using .5*resid deviance. I'm not quite sure why these are different
2. it doesn?t use the correct penalty because the degrees of
freedom are 3225 rather than something reasonable (I adjust this by
multiplying by what I assume is a reasonable penalty so if I were using
75/75 I'd use the penalty k=2 but here I use 2*sum(weight)/length(weight)
For an individual model fit for example we have
For No Repeated observations:
Call: glm(formula = as.formula(paste("response", "~",
paste(out$dat$used.covs,
collapse = "+"))), family = "binomial", data =
test.data)
Coefficients:
(Intercept) asp_2k_alb beetle_yr bio_1 bio_10
bio_11
-3.057e+01 -3.045e-03 4.474e-01 -1.577e-01 -9.599e-01
2.326e+00
bio_12 bio_13 bio_14 bio_15 bio_16
bio_17
2.205e-02 NA 1.461e-01 1.227e-01 -4.705e-02
-8.138e-02
bio_18 bio_19 bio_4 bio_5 bio_6
bio_7
-9.938e-03 9.128e-03 2.713e-03 -5.724e+04 5.725e+04
5.725e+04
bio_8 bio_9 damage_yr dem_2k_alb dist_line_all
dist_pts
-4.155e-01 -2.874e-01 -6.129e-01 4.910e-03 -5.575e-05
3.965e-06
dist_trail_3 dist_transgov dist_trans_01 dist_util4568 dist_water
eastness
-1.510e-04 1.158e-05 NA 8.621e-06 2.392e-04
-4.421e-01
northness slp_2k_alb beetle_all1 damage_all1
3.066e-01 2.975e-02 -6.609e+00 1.894e+00
Degrees of Freedom: 149 Total (i.e. Null); 118 Residual
Null Deviance: 207.9
Residual Deviance: 112.5 AIC: 176.5
logLik(no.rep)
'log Lik.' -56.23042 (df=32)
For repeated observations (of course in a real example these wouldn't be
repeated observations but oversampled background points):
Call: glm(formula = as.formula(paste("response", "~",
paste(out$dat$used.covs,
collapse = "+"))), family = "binomial", data =
psdAbsDat,
weights =train$weight)
Coefficients:
(Intercept) asp_2k_alb beetle_yr bio_1 bio_10
bio_11
-3.057e+01 -3.045e-03 4.474e-01 -1.577e-01 -9.599e-01
2.326e+00
bio_12 bio_13 bio_14 bio_15 bio_16
bio_17
2.205e-02 NA 1.461e-01 1.227e-01 -4.705e-02
-8.138e-02
bio_18 bio_19 bio_4 bio_5 bio_6
bio_7
-9.938e-03 9.128e-03 2.713e-03 -5.724e+04 5.725e+04
5.725e+04
bio_8 bio_9 damage_yr dem_2k_alb dist_line_all
dist_pts
-4.155e-01 -2.874e-01 -6.129e-01 4.910e-03 -5.575e-05
3.965e-06
dist_trail_3 dist_transgov dist_trans_01 dist_util4568 dist_water
eastness
-1.510e-04 1.158e-05 NA 8.621e-06 2.392e-04
-4.421e-01
northness slp_2k_alb beetle_all1 damage_all1
3.066e-01 2.975e-02 -6.609e+00 1.894e+00
Degrees of Freedom: 3224 Total (i.e. Null); 3193 Residual
Null Deviance: 207.9
Residual Deviance: 112.5 AIC: 117
logLik(weight)
'log Lik.' -26.51084 (df=32)
Now using stepwise with my adjusted penalty for the degrees of freedom and
setting mustart equal to the predicted response for the 75/75 model leads
to different models being selected I think because the logLik isn't what I
would expect
For the model with 75 pres/75 abs the stepwise selects 9 predictors
Degrees of Freedom: 149 Total (i.e. Null); 141 Residual
Null Deviance: 207.9
Residual Deviance: 140.3 AIC: 158.3
For the model with 75/3150 the stepwise selects 19 predictors
Degrees of Freedom: 3224 Total (i.e. Null); 3206 Residual
Null Deviance: 207.9
Residual Deviance: 123.8 AIC: 97.78
I'm wondering if there is some way I can get what I expect (which I think
should be the same answer for both examples though maybe I need my
thinking debugged). I've thought about using svyglm in the survey package
since technically these are unequally sampled strata but I'm not sure how
to do the stepwise model selection without a likelihood being produced.
And also I'm curious as to why the likelihood is different for the two
examples.
Thanks for any help and sorry for the long winded question. I'm using R
vesrsion 2.14.2 in case it matters.
[[alternative HTML version deleted]]
