Marian K Talbert
2012-May-10 15:19 UTC
[R] disagreement in loglikelihood and deviace in GLM with weights leads to different models selected using step()
In species distribution modeling where one uses a large sample of background points to capture background variation in presence\pseudo-absence or use\available models (0\1 response) it is frequently recommended that one weight the data so the sum of the absence weights is equal to the sum of presence weights so that the model isn?t swamped by an overwhelming and arbitrary number of background points. I?m trying to do this in R in the standard glm in the stats package and am a bit confused. I understand the issue with noninteger weights in glm and specifying reasonable starting values to ensure convergence to something meaningful and for any individual model fit I can get mostly the same results. For example I have a dataset with 75 presence and 75 absence if instead I had randomly sampled 3150 background points I could set the weights to 1 for presence and 75/3150 for the background points. I simulate this example by repeating my 75 absences 42 times and setting the weights for the absence to 75/3150 and setting mustart equal to the predicted response from the 75/75 model and fitting glms to each data set. The output agrees in the coefficient estimates null and residual deviance but NOT in the AIC or the value returned from logLik() it appears the AIC is calculated from the likelihood though as I understand it these should be directly related to deviance so I?m not sure why these don?t all agree. My ultimate goal is to fit these models and run automatic model selection using the step function. This doesn?t seem to work for two reasons: 1. it uses the logLik value of likelihood and I feel it should be using .5*resid deviance. I'm not quite sure why these are different 2. it doesn?t use the correct penalty because the degrees of freedom are 3225 rather than something reasonable (I adjust this by multiplying by what I assume is a reasonable penalty so if I were using 75/75 I'd use the penalty k=2 but here I use 2*sum(weight)/length(weight) For an individual model fit for example we have For No Repeated observations: Call: glm(formula = as.formula(paste("response", "~", paste(out$dat$used.covs, collapse = "+"))), family = "binomial", data = test.data) Coefficients: (Intercept) asp_2k_alb beetle_yr bio_1 bio_10 bio_11 -3.057e+01 -3.045e-03 4.474e-01 -1.577e-01 -9.599e-01 2.326e+00 bio_12 bio_13 bio_14 bio_15 bio_16 bio_17 2.205e-02 NA 1.461e-01 1.227e-01 -4.705e-02 -8.138e-02 bio_18 bio_19 bio_4 bio_5 bio_6 bio_7 -9.938e-03 9.128e-03 2.713e-03 -5.724e+04 5.725e+04 5.725e+04 bio_8 bio_9 damage_yr dem_2k_alb dist_line_all dist_pts -4.155e-01 -2.874e-01 -6.129e-01 4.910e-03 -5.575e-05 3.965e-06 dist_trail_3 dist_transgov dist_trans_01 dist_util4568 dist_water eastness -1.510e-04 1.158e-05 NA 8.621e-06 2.392e-04 -4.421e-01 northness slp_2k_alb beetle_all1 damage_all1 3.066e-01 2.975e-02 -6.609e+00 1.894e+00 Degrees of Freedom: 149 Total (i.e. Null); 118 Residual Null Deviance: 207.9 Residual Deviance: 112.5 AIC: 176.5 logLik(no.rep) 'log Lik.' -56.23042 (df=32) For repeated observations (of course in a real example these wouldn't be repeated observations but oversampled background points): Call: glm(formula = as.formula(paste("response", "~", paste(out$dat$used.covs, collapse = "+"))), family = "binomial", data = psdAbsDat, weights =train$weight) Coefficients: (Intercept) asp_2k_alb beetle_yr bio_1 bio_10 bio_11 -3.057e+01 -3.045e-03 4.474e-01 -1.577e-01 -9.599e-01 2.326e+00 bio_12 bio_13 bio_14 bio_15 bio_16 bio_17 2.205e-02 NA 1.461e-01 1.227e-01 -4.705e-02 -8.138e-02 bio_18 bio_19 bio_4 bio_5 bio_6 bio_7 -9.938e-03 9.128e-03 2.713e-03 -5.724e+04 5.725e+04 5.725e+04 bio_8 bio_9 damage_yr dem_2k_alb dist_line_all dist_pts -4.155e-01 -2.874e-01 -6.129e-01 4.910e-03 -5.575e-05 3.965e-06 dist_trail_3 dist_transgov dist_trans_01 dist_util4568 dist_water eastness -1.510e-04 1.158e-05 NA 8.621e-06 2.392e-04 -4.421e-01 northness slp_2k_alb beetle_all1 damage_all1 3.066e-01 2.975e-02 -6.609e+00 1.894e+00 Degrees of Freedom: 3224 Total (i.e. Null); 3193 Residual Null Deviance: 207.9 Residual Deviance: 112.5 AIC: 117 logLik(weight) 'log Lik.' -26.51084 (df=32) Now using stepwise with my adjusted penalty for the degrees of freedom and setting mustart equal to the predicted response for the 75/75 model leads to different models being selected I think because the logLik isn't what I would expect For the model with 75 pres/75 abs the stepwise selects 9 predictors Degrees of Freedom: 149 Total (i.e. Null); 141 Residual Null Deviance: 207.9 Residual Deviance: 140.3 AIC: 158.3 For the model with 75/3150 the stepwise selects 19 predictors Degrees of Freedom: 3224 Total (i.e. Null); 3206 Residual Null Deviance: 207.9 Residual Deviance: 123.8 AIC: 97.78 I'm wondering if there is some way I can get what I expect (which I think should be the same answer for both examples though maybe I need my thinking debugged). I've thought about using svyglm in the survey package since technically these are unequally sampled strata but I'm not sure how to do the stepwise model selection without a likelihood being produced. And also I'm curious as to why the likelihood is different for the two examples. Thanks for any help and sorry for the long winded question. I'm using R vesrsion 2.14.2 in case it matters. [[alternative HTML version deleted]]