Xochitl CORMON
2013-Feb-27 17:04 UTC
[R] Separation issue in binary response models - glm, brglm, logistf
Dear all, I am encountering some issues with my data and need some help. I am trying to run glm analysis with a presence/absence variable as response variable and several explanatory variable (time, location, presence/absence data, abundance data). First I tried to use the glm() function, however I was having 2 warnings concerning glm.fit () : # 1: glm.fit: algorithm did not converge # 2: glm.fit: fitted probabilities numerically 0 or 1 occurred After some investigation I found out that the problem was most probably quasi complete separation and therefor decide to use brglm and/or logistf. * logistf : analysis does not run When running logistf() I get a error message saying : # error in chol.default(x) : # leading minor 39 is not positive definite I looked into logistf package manual, on Internet, in the theoretical and technical paper of Heinze and Ploner and cannot find where this function is used and if the error can be fixed by some settings. * brglm : analysis run However I get a warning message saying : # In fit.proc(x = X, y = Y, weights = weights, start = start, etastart # = etastart, : # Iteration limit reached Like before i cannot find where and why this function is used while running the package and if it can be fixed by adjusting some settings. In a more general way, I was wondering what are the fundamental differences of these packages. I hope this make sense enough and I am sorry if this is kind of statistical evidence that I'm not aware of. It is my first time asking a question so I apologize if it's not like it should be and kindly ask you to not hesitate to let me know about it. Thank you for your help Xochitl C. ----------------------------------------------------------------------- Here an extract of my table and the different formula I run : > head (CPUE_table) Year Quarter Subarea Latitude Longitude Presence.S CPUE.S Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW Presence.C CPUE.C Presence.P CPUE.P Presence.W CPUE.W 1 2000 1 31F1 51.25 1.5 0 0 0 0 0 0 0 0 1 76.002 0 0 1 3358.667 2 2000 1 31F2 51.25 2.5 0 0 0 0 0 0 0 0 1 12.500 0 0 1 3028.500 3 2000 1 32F1 51.75 1.5 0 0 0 0 0 0 0 0 1 5.500 0 0 1 2256.750 4 2000 1 32F2 51.75 2.5 0 0 0 0 0 0 0 0 1 10.000 0 0 1 808.000 5 2000 1 32F3 51.75 3.5 0 0 0 0 0 0 0 0 1 19.000 0 0 1 277.000 6 2000 1 33F1 52.25 1.5 0 0 0 0 0 0 0 0 0 0.000 0 0 1 2.000 > tail (CPUE_table) Year Quarter Subarea Latitude Longitude Presence.S CPUE.S Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW Presence.C CPUE.C Presence.P CPUE.P Presence.W CPUE.W 4435 2012 3 50F3 60.75 3.5 1 103.000 1 110.000 1 2379.00 1 20.000 1 6.000 0 0 1 22.000 4436 2012 3 51E8 61.25 -1.5 1 1311.600 1 12.000 1 4194.78 0 0.000 1 18.000 0 0 0 0.000 4437 2012 3 51E9 61.25 -0.5 1 34.336 1 46.671 1 11031.56 1 2.668 1 3.335 0 0 1 3.333 4438 2012 3 51F0 61.25 0.5 1 430.500 1 148.000 1 1212.22 1 3279.200 1 2.000 0 0 1 2.000 4439 2012 3 51F1 61.25 1.5 1 115.000 1 85.000 1 2089.50 1 1.000 1 22.000 1 2 1 40.000 4440 2012 3 51F2 61.25 2.5 1 72.500 1 35.500 1 270.48 1 516.300 1 11.500 1 1 1 16.000 logistf_binomPres <- logistf (Presence.S ~ (Presence.BW + Presence.W + Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude + Longitude)^2, data = CPUE_table) Brglm_binomPres <- brglm (Presence.S ~ (Presence.BW + Presence.W + Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude + Longitude)^2, family = binomial, data = CPUE_table) ----------------------------------------------------------------------- -- <>< <>< <>< <>< Xochitl CORMON +33 (0)3 21 99 56 84 Doctorante en sciences halieutiques PhD student in fishery sciences <>< <>< <>< <>< IFREMER Centre Manche Mer du Nord 150 quai Gambetta 62200 Boulogne-sur-Mer <>< <>< <>< <><
Ben Bolker
2013-Feb-28 16:22 UTC
[R] Separation issue in binary response models - glm, brglm, logistf
Xochitl CORMON <Xochitl.Cormon <at> ifremer.fr> writes:> Dear all, > > I am encountering some issues with my data and need some help. > I am trying to run glm analysis with a presence/absence variable as > response variable and several explanatory variable (time, location, > presence/absence data, abundance data). > > First I tried to use the glm() function, however I was having 2 warnings > concerning glm.fit () : > # 1: glm.fit: algorithm did not converge > # 2: glm.fit: fitted probabilities numerically 0 or 1 occurred > After some investigation I found out that the problem was most probably > quasi complete separation and therefor decide to use brglm and/or logistf. > > * logistf : analysis does not run > When running logistf() I get a error message saying : > # error in chol.default(x) : > # leading minor 39 is not positive definite > I looked into logistf package manual, on Internet, in the theoretical > and technical paper of Heinze and Ploner and cannot find where this > function is used and if the error can be fixed by some settings.chol.default is a function for Cholesky decomposition, which is going to be embedded fairly deeply in the code ...> * brglm : analysis run > However I get a warning message saying : > # In fit.proc(x = X, y = Y, weights = weights, start = start, etastart # > = etastart, : > # Iteration limit reached > Like before i cannot find where and why this function is used while > running the package and if it can be fixed by adjusting some settings. > > In a more general way, I was wondering what are the fundamental > differences of these packages.You might also take a crack with bayesglm() in the arm package, which should (?) be able to overcome the separation problem by specifying a not-completely-uninformative prior.> I hope this make sense enough and I am sorry if this is kind of > statistical evidence that I'm not aware of. > > ----------------------------------------------------------------------- > > Here an extract of my table and the different formula I run : > > > head (CPUE_table) > Year Quarter Subarea Latitude Longitude Presence.S CPUE.S Presence.H > CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW Presence.C CPUE.C > Presence.P CPUE.P Presence.W CPUE.W > 1 2000 1 31F1 51.25 1.5 0 0 0 > 0 0 0 0 0 1 76.002 > 0 0 1 3358.667[snip]> logistf_binomPres <- logistf (Presence.S ~ (Presence.BW + Presence.W + > Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H + > CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude + > Longitude)^2, data = CPUE_table) > > Brglm_binomPres <- brglm (Presence.S ~ (Presence.BW + Presence.W + > Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H + > CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude + > Longitude)^2, family = binomial, data = CPUE_table)It's not much to go on, but: * are you overfitting your data? That is, do you have at least 20 times as many 1's or 0's (whichever is rarer) as the number of parameters you are trying to estimated? * have you examined your data graphically and looked for any strong outliers that might be throwing off the fit? * do you have some strongly correlated/multicollinear predictors? * for what it's worth it looks like a variety of your variables might be dummy variables, which you can often express more compactly by using a factor variable and letting R construct the design matrix (i.e. generating the dummy variables on the fly), although that shouldn't change your results