Marian K Talbert
2012-May-10  15:19 UTC
[R] disagreement in loglikelihood and deviace in GLM with weights leads to different models selected using step()
In species distribution modeling where one uses a large sample of 
background points to capture background variation in 
presence\pseudo-absence or use\available models (0\1 response) it is 
frequently recommended that one weight the data so the sum of the absence 
weights is equal to the sum of presence weights so that the model isn?t 
swamped by an overwhelming and arbitrary number of background points.   
I?m trying to do this in R in the standard glm in the stats package and am 
a bit confused.  I understand the issue with noninteger weights in glm and 
specifying reasonable starting values to ensure convergence to something 
meaningful and for any individual model fit I can get mostly the same 
results.  For example I have a dataset with 75 presence and 75 absence if 
instead I had randomly sampled 3150 background points I could set the 
weights to 1 for presence and 75/3150 for the background points.  I 
simulate this example by repeating my 75 absences 42 times and setting the 
weights for the absence to 75/3150 and setting mustart equal to the 
predicted response from the 75/75 model and fitting glms to each data set. 
 The output agrees in the coefficient estimates null and residual deviance 
but NOT in the AIC or the value returned from logLik() it appears the AIC 
is calculated from the likelihood though as I understand it these should 
be directly related to deviance so I?m not sure why these don?t all agree. 
 
My ultimate goal is to fit these models and run automatic model selection 
using the step function.  This doesn?t seem to work for two reasons: 
1.        it uses the logLik value of likelihood and I feel it should be 
using .5*resid deviance.  I'm not quite sure why these are different 
2.        it doesn?t use the correct penalty because the degrees of 
freedom are 3225 rather than something reasonable (I adjust this by 
multiplying by what I assume is a reasonable penalty so if I were using 
75/75 I'd use the penalty k=2 but here I use 2*sum(weight)/length(weight) 
For an individual model fit for example we have 
For No Repeated observations: 
Call:  glm(formula = as.formula(paste("response", "~", 
paste(out$dat$used.covs, 
    collapse = "+"))), family = "binomial", data =
test.data)
Coefficients: 
  (Intercept)     asp_2k_alb      beetle_yr          bio_1         bio_10  
      bio_11   
   -3.057e+01     -3.045e-03      4.474e-01     -1.577e-01     -9.599e-01  
   2.326e+00   
       bio_12         bio_13         bio_14         bio_15         bio_16  
      bio_17   
    2.205e-02             NA      1.461e-01      1.227e-01     -4.705e-02  
  -8.138e-02   
       bio_18         bio_19          bio_4          bio_5          bio_6  
       bio_7   
   -9.938e-03      9.128e-03      2.713e-03     -5.724e+04      5.725e+04  
   5.725e+04   
        bio_8          bio_9      damage_yr     dem_2k_alb  dist_line_all  
    dist_pts   
   -4.155e-01     -2.874e-01     -6.129e-01      4.910e-03     -5.575e-05  
   3.965e-06   
 dist_trail_3  dist_transgov  dist_trans_01  dist_util4568     dist_water  
    eastness   
   -1.510e-04      1.158e-05             NA      8.621e-06      2.392e-04  
  -4.421e-01   
    northness     slp_2k_alb    beetle_all1    damage_all1   
    3.066e-01      2.975e-02     -6.609e+00      1.894e+00   
Degrees of Freedom: 149 Total (i.e. Null);  118 Residual 
Null Deviance:      207.9 
Residual Deviance: 112.5        AIC: 176.5 
 logLik(no.rep) 
'log Lik.' -56.23042 (df=32) 
For repeated observations (of course in a real example these wouldn't be 
repeated observations but oversampled background points): 
Call:  glm(formula = as.formula(paste("response", "~", 
paste(out$dat$used.covs, 
    collapse = "+"))), family = "binomial", data =
psdAbsDat,
    weights =train$weight) 
Coefficients: 
  (Intercept)     asp_2k_alb      beetle_yr          bio_1         bio_10  
      bio_11   
   -3.057e+01     -3.045e-03      4.474e-01     -1.577e-01     -9.599e-01  
   2.326e+00   
       bio_12         bio_13         bio_14         bio_15         bio_16  
      bio_17   
    2.205e-02             NA      1.461e-01      1.227e-01     -4.705e-02  
  -8.138e-02   
       bio_18         bio_19          bio_4          bio_5          bio_6  
       bio_7   
   -9.938e-03      9.128e-03      2.713e-03     -5.724e+04      5.725e+04  
   5.725e+04   
        bio_8          bio_9      damage_yr     dem_2k_alb  dist_line_all  
    dist_pts   
   -4.155e-01     -2.874e-01     -6.129e-01      4.910e-03     -5.575e-05  
   3.965e-06   
 dist_trail_3  dist_transgov  dist_trans_01  dist_util4568     dist_water  
    eastness   
   -1.510e-04      1.158e-05             NA      8.621e-06      2.392e-04  
  -4.421e-01   
    northness     slp_2k_alb    beetle_all1    damage_all1   
    3.066e-01      2.975e-02     -6.609e+00      1.894e+00   
Degrees of Freedom: 3224 Total (i.e. Null);  3193 Residual 
Null Deviance:      207.9 
Residual Deviance: 112.5        AIC: 117 
logLik(weight) 
'log Lik.' -26.51084 (df=32) 
Now using stepwise with my adjusted penalty for the degrees of freedom and 
setting mustart equal to the predicted response for the 75/75 model leads 
to different models being selected I think because the logLik isn't what I 
would expect 
For the model with 75 pres/75 abs the stepwise selects 9 predictors 
Degrees of Freedom: 149 Total (i.e. Null);  141 Residual 
Null Deviance:      207.9 
Residual Deviance: 140.3        AIC: 158.3 
For the model with 75/3150 the stepwise selects 19 predictors 
Degrees of Freedom: 3224 Total (i.e. Null);  3206 Residual 
Null Deviance:      207.9 
Residual Deviance: 123.8        AIC: 97.78 
I'm wondering if there is some way I can get what I expect (which I think 
should be the same answer for both examples though maybe I need my 
thinking debugged).  I've thought about using svyglm in the survey package 
since technically these are unequally sampled strata but I'm not sure how 
to do the stepwise model selection without a likelihood being produced. 
And also I'm curious as to why the likelihood is different for the two 
examples.   
Thanks for any help and sorry for the long winded question.  I'm using R 
vesrsion 2.14.2 in case it matters.   
	[[alternative HTML version deleted]]
