eugen pircalabelu
2008-Sep-14 15:29 UTC
[R] Question on glm.nb vs zeroinfl vs hurdle models
Good afternoon, I?m in need of an advice regarding a proper use of glm.nb, zeroinfl or hurdle with my dataframe. I can not provide a self-contained example, since I need an advice on this current dataset and its ?contradictory? results. So.... i have a dataset which contains 1309 cases and 11 variables, highly right-skewed and heavily zeroinflated (with over 1100 cases that have 0 value for my variables both dependent and independent, eg: variable A has 1220 cases with 0 value, variable B has 1283 with 0 value and so on..) I tried to fit 3 models: glm.nb, zeroinfl and hurdle and I was expecting some ?similar? results and similar conclusions. What was similar was log-likelihood (very close for all 3 models) and the number of predicted 0 (which was identical for each model), but what surprised me were the following results: -glm.nb identified as having an influence the same variables that were identified by the hurdle model in the zero-model; -zerinfl model identified also d variable as influential; Now my question is the following: having seen the vignette (Regression Models for Count Data in R) I noticed that glm.nb, hurdle and zeroinfl give similar results for the count model, while for the zero-component hurdle and zeroifl may give slightly more different results, while for my example the count model from glm.nb is similar to the zero-component part of hurdle and zeroinfl. Why is that? Is there a problem with the fact that my dataset is extremely zero-inflated, and there are few cases with values different from 0? Any kind of help would be most welcomed Thank you and have a great day ahead.> summary(aaa)Call: hurdle(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) + as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin") Count model coefficients (truncated negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -0.02178 0.30753 -0.071 0.944 as.integer(a) -0.48886 0.54023 -0.905 0.366 as.integer(b) -0.09555 0.11688 -0.817 0.414 as.integer(c) -0.08654 0.20809 -0.416 0.678 as.integer(d) 0.17446 0.16956 1.029 0.304 as.integer(e) 0.27180 0.55702 0.488 0.626 as.integer(f) 0.15512 0.42721 0.363 0.717 as.integerg) -0.07687 0.21750 -0.353 0.724 as.integer(h) -0.16906 0.44986 -0.376 0.707 Log(theta) -0.76274 0.51800 -1.472 0.141 Zero hurdle model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.13498 0.07906 -14.356 < 2e-16 *** as.integer(a) -0.33134 0.30239 -1.096 0.27320 as.integer(b) -0.26394 0.08397 -3.143 0.00167 ** as.integer(c) 0.06689 0.12796 0.523 0.60115 as.integer(d) -0.12045 0.11984 -1.005 0.31486 as.integer(e) -0.79314 0.29106 -2.725 0.00643 ** as.integer(f) -0.28547 0.40790 -0.700 0.48402 as.integer(g) -0.33186 0.18887 -1.757 0.07890 . as.integer(h) -0.37008 0.31035 -1.192 0.23308 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta: count = 0.4664 Number of iterations in BFGS optimization: 28 Log-likelihood: -1073 on 19 Df> summary(a)Call: glm.nb(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) + as.integer(f) + as.integer(g) + as.integer(h), data = dep, init.theta = 0.187836108765364, link = log) Deviance Residuals: Min 1Q Median 3Q Max -0.8607 -0.7236 -0.6809 -0.4610 2.7575 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.56381 0.08820 -6.392 1.64e-10 *** as.integer(a) -0.51517 0.33477 -1.539 0.12384 as.integer(b) -0.21835 0.07250 -3.011 0.00260 ** as.integer(c) 0.08920 0.14546 0.613 0.53974 as.integer(d) -0.01742 0.10877 -0.160 0.87274 as.integer(e) -0.69085 0.23446 -2.946 0.00321 ** as.integer(f) -0.14182 0.42142 -0.337 0.73647 as.integer(g) -0.24976 0.15819 -1.579 0.11437 as.integer(h) -0.37652 0.30043 -1.253 0.21009 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for Negative Binomial(0.1878) family taken to be 1) Null deviance: 707.18 on 1308 degrees of freedom Residual deviance: 677.09 on 1300 degrees of freedom AIC: 2181.5 Number of Fisher Scoring iterations: 1 Theta: 0.1878 Std. Err.: 0.0186 Warning while fitting theta: alternation limit reached> summary(aa)Call: zeroinfl(formula = as.integer(x) ~ as.integer(a) + as.integer(b) + as.integer(c) + as.integer(d) + as.integer(e) + as.integer(f) + as.integer(g) + as.integer(h), data = dep, dist = "negbin") Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -0.030225 0.237197 -0.127 0.8986 as.integer(a) -0.419544 0.667512 -0.629 0.5297 as.integer(b) -0.128478 0.132001 -0.973 0.3304 as.integer(c) -0.226652 0.146983 -1.542 0.1231 as.integer(d) 0.226577 0.157547 1.438 0.1504 as.integer(e) 0.374845 0.650778 0.576 0.5646 as.integer(f) 0.381320 0.399210 0.955 0.3395 as.integer(g) -0.006804 0.195869 -0.035 0.9723 as.integer(h) -0.161501 0.426027 -0.379 0.7046 Log(theta) -0.776709 0.393571 -1.973 0.0484 * Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -0.3705 0.5458 -0.679 0.4973 as.integer(a) 0.1848 1.1336 0.163 0.8705 as.integer(b) 0.2453 0.1775 1.382 0.1669 as.integer(c) -1.2289 0.8108 -1.516 0.1296 as.integer(d) 0.3749 0.2015 1.861 0.0628 . as.integer(e) 1.2458 0.4929 2.527 0.0115 * as.integer(f) 1.1177 0.7105 1.573 0.1157 as.integer(g) 0.5752 0.3332 1.726 0.0843 . as.integer(h) 0.3890 0.5272 0.738 0.4606 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 0.4599 Number of iterations in BFGS optimization: 36 Log-likelihood: -1072 on 19 Df