thr3ads.net - R help - [R] Binomial GLM, chisq.test, or? [May 2012]

If this information is useful, please help other people find it:
Share via:

lincoln

2012-May-04 15:38 UTC

[R] Binomial GLM, chisq.test, or?

Hi,

I have a data set with 999 observations, for each of them I have data on
four variables:
site, colony, gender (quite a few NA values), and cohort.

This is how the data set looks like:> str(dispersal)'data.frame':	999 obs. of  4 variables:
 $ site  : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 2 2
...
 $ gender: Factor w/ 2 levels "0","1": NA NA 2 1 2 NA 1 2 2
NA ...
 $ colony: Factor w/ 2 levels "main","other": 2 2 2 2 2 2 2
2 2 2 ...
 $ cohort: Factor w/ 11 levels "1996","2000",..: 8 8 8 8 8 8
8 8 8 6 ...

Now, I want to estimate if sites 1 and site 2 differ on some of the other
variables. For instance there are relatively more males in site 1 with
respect to site 2, more individuals of the main colony in site 2 with
respect to site 1 and this sort of things.

I thought I might do a binomial GLM considering as response variable the
site, I tried to run the more general model to have a look to overdispersion
but I believe there is something wrong even before worrying about
overdispersion. I know (I did a chisq.test) that cohort2004 is very diversly
represented between the two sites but it is not reflected in the results of
GLM. Here there are the results of chisq.test and the GLM:

1) 
/>
age_cohort<-as.table(rbind(c(142,95,46,33,14,59,18,12,7,1,0),c(258,144,54,70,20,11,6,8,2,3,1)))> dimnames(age_cohort)<-list(site=c("M","D"),+                       
cohorts=c(2010,2009,2008,2007,2006,2004,2003,2002,2001,2000,1996))> age_cohort    cohorts
site 2010 2009 2008 2007 2006 2004 2003 2002 2001 2000 1996
   M  142   95   46   33   14   59   18   12    7    1    0
   D  258  144   54   70   20   11    6    8    2    3   
1> (Xsqagec <- chisq.test(age_cohort))  # Prints test summary
	Pearson's Chi-squared test

data:  age_cohort 
X-squared = 82.6016, df = 10, p-value = 1.549e-13

Mensajes de aviso perdidos
In chisq.test(age_cohort) : Chi-squared approximation may be
incorrect> Xsqagec$observed   # observed counts    cohorts
site 2010 2009 2008 2007 2006 2004 2003 2002 2001 2000 1996
   M  142   95   46   33   14   59   18   12    7    1    0
   D  258  144   54   70   20   11    6    8    2    3   
1> Xsqagec$expected   # expected counts under the null    cohorts
site     2010     2009     2008     2007     2006     2004     2003
   M 170.1195 101.6464 42.52988 43.80578 14.46016 29.77092 10.20717
   D 229.8805 137.3536 57.47012 59.19422 19.53984 40.22908 13.79283
    cohorts
site      2002     2001     2000      1996
   M  8.505976 3.827689 1.701195 0.4252988
   D 11.494024 5.172311 2.298805 0.5747012> Xsqagec$residuals  # Pearson residuals    cohorts
site       2010       2009       2008       2007       2006
   M -2.1559111 -0.6592367  0.5321050 -1.6326395 -0.1210101
   D  1.8546283  0.5671101 -0.4577448  1.4044825  0.1040993
    cohorts
site       2004       2003       2002       2001       2000
   M  5.3569686  2.4391720  1.1980192  1.6214643 -0.5376032
   D -4.6083465 -2.0983042 -1.0305993 -1.3948690  0.4624746
    cohorts
site       1996
   M -0.6521494
   D  0.5610132> Xsqagec$stdres     # standardized residuals    cohorts
site       2010       2009       2008       2007       2006
   M -3.6665549 -0.9962228  0.7397057 -2.2733888 -0.1623984
   D  3.6665549  0.9962228 -0.7397057  2.2733888  0.1623984
    cohorts
site       2004       2003       2002       2001       2000
   M  7.3264142  3.2566808  1.5962909  2.1485311 -0.7105713
   D -7.3264142 -3.2566808 -1.5962909 -2.1485311  0.7105713
    cohorts
site       1996
   M -0.8606814
   D  0.8606814
/


2)
/> model1<-glm(site~gender*colony*cohort,binomial)> summary(model1)
Call:
glm(formula = site ~ gender * colony * cohort, family = binomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.84648  -0.96954  -0.00036   1.11269   2.03933  

Coefficients: (12 not defined because of singularities)
                                 Estimate Std. Error z value
(Intercept)                    -1.657e+01  2.400e+03  -0.007
gender1                        -2.231e-01  9.220e-01  -0.242
colonyother                     9.531e-02  8.006e-01   0.119
cohort2002                      1.717e-08  3.393e+03   0.000
cohort2003                      1.766e+01  2.400e+03   0.007
cohort2004                      1.807e+01  2.400e+03   0.008
cohort2006                      1.697e+01  2.400e+03   0.007
cohort2007                      1.726e+01  2.400e+03   0.007
cohort2008                      1.606e+01  2.400e+03   0.007
cohort2009                      1.657e+01  2.400e+03   0.007
cohort2010                      1.587e+01  2.400e+03   0.007
gender1:colonyother             9.163e-01  1.087e+00   0.843
gender1:cohort2002              1.719e+01  2.400e+03   0.007
gender1:cohort2003             -1.823e-01  1.713e+00  -0.106
gender1:cohort2004              2.231e-01  1.329e+00   0.168
gender1:cohort2006             -5.878e-01  1.586e+00  -0.371
gender1:cohort2007             -6.454e-02  1.784e+00  -0.036
gender1:cohort2008              8.881e-01  1.156e+00   0.768
gender1:cohort2009             -2.817e-02  1.199e+00  -0.023
gender1:cohort2010                     NA         NA      NA
colonyother:cohort2002                 NA         NA      NA
colonyother:cohort2003                 NA         NA      NA
colonyother:cohort2004          1.497e+01  1.697e+03   0.009
colonyother:cohort2006         -1.707e+01  2.400e+03  -0.007
colonyother:cohort2007         -7.885e-01  1.772e+00  -0.445
colonyother:cohort2008                 NA         NA      NA
colonyother:cohort2009                 NA         NA      NA
colonyother:cohort2010                 NA         NA      NA
gender1:colonyother:cohort2002         NA         NA      NA
gender1:colonyother:cohort2003         NA         NA      NA
gender1:colonyother:cohort2004 -9.163e-01  2.400e+03   0.000
gender1:colonyother:cohort2006         NA         NA      NA
gender1:colonyother:cohort2007 -2.575e+00  2.379e+00  -1.082
gender1:colonyother:cohort2008         NA         NA      NA
gender1:colonyother:cohort2009         NA         NA      NA
gender1:colonyother:cohort2010         NA         NA      NA
                               Pr(>|z|)
(Intercept)                       0.994
gender1                           0.809
colonyother                       0.905
cohort2002                        1.000
cohort2003                        0.994
cohort2004                        0.994
cohort2006                        0.994
cohort2007                        0.994
cohort2008                        0.995
cohort2009                        0.994
cohort2010                        0.995
gender1:colonyother               0.399
gender1:cohort2002                0.994
gender1:cohort2003                0.915
gender1:cohort2004                0.867
gender1:cohort2006                0.711
gender1:cohort2007                0.971
gender1:cohort2008                0.442
gender1:cohort2009                0.981
gender1:cohort2010                   NA
colonyother:cohort2002               NA
colonyother:cohort2003               NA
colonyother:cohort2004            0.993
colonyother:cohort2006            0.994
colonyother:cohort2007            0.656
colonyother:cohort2008               NA
colonyother:cohort2009               NA
colonyother:cohort2010               NA
gender1:colonyother:cohort2002       NA
gender1:colonyother:cohort2003       NA
gender1:colonyother:cohort2004    1.000
gender1:colonyother:cohort2006       NA
gender1:colonyother:cohort2007    0.279
gender1:colonyother:cohort2008       NA
gender1:colonyother:cohort2009       NA
gender1:colonyother:cohort2010       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 311.91  on 224  degrees of freedom
Residual deviance: 271.61  on 201  degrees of freedom
  (774 observations deleted due to missingness)
AIC: 319.61

Number of Fisher Scoring iterations: 15
/

I thought that perhaps keeping the gender as explanatory variable was
reducing a lot the sample size and it was the matter (I removed it):



/> model1<-glm(site~colony*cohort,binomial)> summary(model1)
Call:
glm(formula = site ~ colony * cohort, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8683  -0.9712  -0.8757   1.3468   1.7470  

Coefficients: (6 not defined because of singularities)
                        Estimate Std. Error z value Pr(>|z|)
(Intercept)             -15.5661  1455.3976  -0.011    0.991
colonyother               0.2544     0.2167   1.174    0.240
cohort2000               14.4675  1455.3981   0.010    0.992
cohort2001               16.8188  1455.3978   0.012    0.991
cohort2002               15.9715  1455.3977   0.011    0.991
cohort2003               16.6647  1455.3977   0.011    0.991
cohort2004               17.1194  1455.3976   0.012    0.991
cohort2006               15.2607  1455.3976   0.010    0.992
cohort2007               14.9470  1455.3976   0.010    0.992
cohort2008               15.3837  1455.3976   0.011    0.992
cohort2009               15.1762  1455.3976   0.010    0.992
cohort2010               14.8053  1455.3976   0.010    0.992
colonyother:cohort2000        NA         NA      NA       NA
colonyother:cohort2001        NA         NA      NA       NA
colonyother:cohort2002        NA         NA      NA       NA
colonyother:cohort2003        NA         NA      NA       NA
colonyother:cohort2004   13.7583   550.0887   0.025    0.980
colonyother:cohort2006  -15.5151  1455.3976  -0.011    0.991
colonyother:cohort2007   -0.9163     0.5979  -1.533    0.125
colonyother:cohort2008        NA         NA      NA       NA
colonyother:cohort2009   -0.6912     0.5214  -1.326    0.185
colonyother:cohort2010        NA         NA      NA       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1361.4  on 998  degrees of freedom
Residual deviance: 1267.9  on 983  degrees of freedom
AIC: 1299.9

Number of Fisher Scoring iterations: 14
  /

Any comment/suggestion on this?
Thanks for any help

--
View this message in context:
http://r.789695.n4.nabble.com/Binomial-GLM-chisq-test-or-tp4608941.html
Sent from the R help mailing list archive at Nabble.com.

Tal Galili

2012-May-04 18:14 UTC

head link

[R] Binomial GLM, chisq.test, or?

Without going deeply into your analysis, 2 comments:

1) Use the anova command to test two nested models using:
anova(model1, model2, test="Chisq")

2) glm's are non-trivial models (at least to me), be sure to google for
some tutorials in order to understand what you are looking at...

Cheers,
Tal





----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------




On Fri, May 4, 2012 at 6:38 PM, lincoln <miseno77@hotmail.com> wrote:
> Hi,
>
> I have a data set with 999 observations, for each of them I have data on
> four variables:
> site, colony, gender (quite a few NA values), and cohort.
>
> This is how the data set looks like:
> > str(dispersal)
> 'data.frame':   999 obs. of  4 variables:
>  $ site  : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1
2 2 ...
>  $ gender: Factor w/ 2 levels "0","1": NA NA 2 1 2 NA 1
2 2 NA ...
>  $ colony: Factor w/ 2 levels "main","other": 2 2 2 2 2
2 2 2 2 2 ...
>  $ cohort: Factor w/ 11 levels "1996","2000",..: 8 8 8
8 8 8 8 8 8 6 ...
>
> Now, I want to estimate if sites 1 and site 2 differ on some of the other
> variables. For instance there are relatively more males in site 1 with
> respect to site 2, more individuals of the main colony in site 2 with
> respect to site 1 and this sort of things.
>
> I thought I might do a binomial GLM considering as response variable the
> site, I tried to run the more general model to have a look to
> overdispersion
> but I believe there is something wrong even before worrying about
> overdispersion. I know (I did a chisq.test) that cohort2004 is very
> diversly
> represented between the two sites but it is not reflected in the results of
> GLM. Here there are the results of chisq.test and the GLM:
>
> 1)
> />
>
>
age_cohort<-as.table(rbind(c(142,95,46,33,14,59,18,12,7,1,0),c(258,144,54,70,20,11,6,8,2,3,1)))
> > dimnames(age_cohort)<-list(site=c("M","D"),
> +
> cohorts=c(2010,2009,2008,2007,2006,2004,2003,2002,2001,2000,1996))
> > age_cohort
>    cohorts
> site 2010 2009 2008 2007 2006 2004 2003 2002 2001 2000 1996
>   M  142   95   46   33   14   59   18   12    7    1    0
>   D  258  144   54   70   20   11    6    8    2    3    1
> > (Xsqagec <- chisq.test(age_cohort))  # Prints test summary
>
>        Pearson's Chi-squared test
>
> data:  age_cohort
> X-squared = 82.6016, df = 10, p-value = 1.549e-13
>
> Mensajes de aviso perdidos
> In chisq.test(age_cohort) : Chi-squared approximation may be incorrect
> > Xsqagec$observed   # observed counts
>    cohorts
> site 2010 2009 2008 2007 2006 2004 2003 2002 2001 2000 1996
>   M  142   95   46   33   14   59   18   12    7    1    0
>   D  258  144   54   70   20   11    6    8    2    3    1
> > Xsqagec$expected   # expected counts under the null
>    cohorts
> site     2010     2009     2008     2007     2006     2004     2003
>   M 170.1195 101.6464 42.52988 43.80578 14.46016 29.77092 10.20717
>   D 229.8805 137.3536 57.47012 59.19422 19.53984 40.22908 13.79283
>    cohorts
> site      2002     2001     2000      1996
>   M  8.505976 3.827689 1.701195 0.4252988
>   D 11.494024 5.172311 2.298805 0.5747012
> > Xsqagec$residuals  # Pearson residuals
>    cohorts
> site       2010       2009       2008       2007       2006
>   M -2.1559111 -0.6592367  0.5321050 -1.6326395 -0.1210101
>   D  1.8546283  0.5671101 -0.4577448  1.4044825  0.1040993
>    cohorts
> site       2004       2003       2002       2001       2000
>   M  5.3569686  2.4391720  1.1980192  1.6214643 -0.5376032
>   D -4.6083465 -2.0983042 -1.0305993 -1.3948690  0.4624746
>    cohorts
> site       1996
>   M -0.6521494
>   D  0.5610132
> > Xsqagec$stdres     # standardized residuals
>    cohorts
> site       2010       2009       2008       2007       2006
>   M -3.6665549 -0.9962228  0.7397057 -2.2733888 -0.1623984
>   D  3.6665549  0.9962228 -0.7397057  2.2733888  0.1623984
>    cohorts
> site       2004       2003       2002       2001       2000
>   M  7.3264142  3.2566808  1.5962909  2.1485311 -0.7105713
>   D -7.3264142 -3.2566808 -1.5962909 -2.1485311  0.7105713
>    cohorts
> site       1996
>   M -0.8606814
>   D  0.8606814
> /
>
>
> 2)
> /> model1<-glm(site~gender*colony*cohort,binomial)
> > summary(model1)
>
> Call:
> glm(formula = site ~ gender * colony * cohort, family = binomial)
>
> Deviance Residuals:
>     Min        1Q    Median        3Q       Max
> -1.84648  -0.96954  -0.00036   1.11269   2.03933
>
> Coefficients: (12 not defined because of singularities)
>                                 Estimate Std. Error z value
> (Intercept)                    -1.657e+01  2.400e+03  -0.007
> gender1                        -2.231e-01  9.220e-01  -0.242
> colonyother                     9.531e-02  8.006e-01   0.119
> cohort2002                      1.717e-08  3.393e+03   0.000
> cohort2003                      1.766e+01  2.400e+03   0.007
> cohort2004                      1.807e+01  2.400e+03   0.008
> cohort2006                      1.697e+01  2.400e+03   0.007
> cohort2007                      1.726e+01  2.400e+03   0.007
> cohort2008                      1.606e+01  2.400e+03   0.007
> cohort2009                      1.657e+01  2.400e+03   0.007
> cohort2010                      1.587e+01  2.400e+03   0.007
> gender1:colonyother             9.163e-01  1.087e+00   0.843
> gender1:cohort2002              1.719e+01  2.400e+03   0.007
> gender1:cohort2003             -1.823e-01  1.713e+00  -0.106
> gender1:cohort2004              2.231e-01  1.329e+00   0.168
> gender1:cohort2006             -5.878e-01  1.586e+00  -0.371
> gender1:cohort2007             -6.454e-02  1.784e+00  -0.036
> gender1:cohort2008              8.881e-01  1.156e+00   0.768
> gender1:cohort2009             -2.817e-02  1.199e+00  -0.023
> gender1:cohort2010                     NA         NA      NA
> colonyother:cohort2002                 NA         NA      NA
> colonyother:cohort2003                 NA         NA      NA
> colonyother:cohort2004          1.497e+01  1.697e+03   0.009
> colonyother:cohort2006         -1.707e+01  2.400e+03  -0.007
> colonyother:cohort2007         -7.885e-01  1.772e+00  -0.445
> colonyother:cohort2008                 NA         NA      NA
> colonyother:cohort2009                 NA         NA      NA
> colonyother:cohort2010                 NA         NA      NA
> gender1:colonyother:cohort2002         NA         NA      NA
> gender1:colonyother:cohort2003         NA         NA      NA
> gender1:colonyother:cohort2004 -9.163e-01  2.400e+03   0.000
> gender1:colonyother:cohort2006         NA         NA      NA
> gender1:colonyother:cohort2007 -2.575e+00  2.379e+00  -1.082
> gender1:colonyother:cohort2008         NA         NA      NA
> gender1:colonyother:cohort2009         NA         NA      NA
> gender1:colonyother:cohort2010         NA         NA      NA
>                               Pr(>|z|)
> (Intercept)                       0.994
> gender1                           0.809
> colonyother                       0.905
> cohort2002                        1.000
> cohort2003                        0.994
> cohort2004                        0.994
> cohort2006                        0.994
> cohort2007                        0.994
> cohort2008                        0.995
> cohort2009                        0.994
> cohort2010                        0.995
> gender1:colonyother               0.399
> gender1:cohort2002                0.994
> gender1:cohort2003                0.915
> gender1:cohort2004                0.867
> gender1:cohort2006                0.711
> gender1:cohort2007                0.971
> gender1:cohort2008                0.442
> gender1:cohort2009                0.981
> gender1:cohort2010                   NA
> colonyother:cohort2002               NA
> colonyother:cohort2003               NA
> colonyother:cohort2004            0.993
> colonyother:cohort2006            0.994
> colonyother:cohort2007            0.656
> colonyother:cohort2008               NA
> colonyother:cohort2009               NA
> colonyother:cohort2010               NA
> gender1:colonyother:cohort2002       NA
> gender1:colonyother:cohort2003       NA
> gender1:colonyother:cohort2004    1.000
> gender1:colonyother:cohort2006       NA
> gender1:colonyother:cohort2007    0.279
> gender1:colonyother:cohort2008       NA
> gender1:colonyother:cohort2009       NA
> gender1:colonyother:cohort2010       NA
>
> (Dispersion parameter for binomial family taken to be 1)
>
>    Null deviance: 311.91  on 224  degrees of freedom
> Residual deviance: 271.61  on 201  degrees of freedom
>  (774 observations deleted due to missingness)
> AIC: 319.61
>
> Number of Fisher Scoring iterations: 15
> /
>
> I thought that perhaps keeping the gender as explanatory variable was
> reducing a lot the sample size and it was the matter (I removed it):
>
>
>
> /> model1<-glm(site~colony*cohort,binomial)
> > summary(model1)
>
> Call:
> glm(formula = site ~ colony * cohort, family = binomial)
>
> Deviance Residuals:
>    Min       1Q   Median       3Q      Max
> -1.8683  -0.9712  -0.8757   1.3468   1.7470
>
> Coefficients: (6 not defined because of singularities)
>                        Estimate Std. Error z value Pr(>|z|)
> (Intercept)             -15.5661  1455.3976  -0.011    0.991
> colonyother               0.2544     0.2167   1.174    0.240
> cohort2000               14.4675  1455.3981   0.010    0.992
> cohort2001               16.8188  1455.3978   0.012    0.991
> cohort2002               15.9715  1455.3977   0.011    0.991
> cohort2003               16.6647  1455.3977   0.011    0.991
> cohort2004               17.1194  1455.3976   0.012    0.991
> cohort2006               15.2607  1455.3976   0.010    0.992
> cohort2007               14.9470  1455.3976   0.010    0.992
> cohort2008               15.3837  1455.3976   0.011    0.992
> cohort2009               15.1762  1455.3976   0.010    0.992
> cohort2010               14.8053  1455.3976   0.010    0.992
> colonyother:cohort2000        NA         NA      NA       NA
> colonyother:cohort2001        NA         NA      NA       NA
> colonyother:cohort2002        NA         NA      NA       NA
> colonyother:cohort2003        NA         NA      NA       NA
> colonyother:cohort2004   13.7583   550.0887   0.025    0.980
> colonyother:cohort2006  -15.5151  1455.3976  -0.011    0.991
> colonyother:cohort2007   -0.9163     0.5979  -1.533    0.125
> colonyother:cohort2008        NA         NA      NA       NA
> colonyother:cohort2009   -0.6912     0.5214  -1.326    0.185
> colonyother:cohort2010        NA         NA      NA       NA
>
> (Dispersion parameter for binomial family taken to be 1)
>
>    Null deviance: 1361.4  on 998  degrees of freedom
> Residual deviance: 1267.9  on 983  degrees of freedom
> AIC: 1299.9
>
> Number of Fisher Scoring iterations: 14
>  /
>
> Any comment/suggestion on this?
> Thanks for any help
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Binomial-GLM-chisq-test-or-tp4608941.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

lincoln

2012-May-07 07:28 UTC

head link

[R] Binomial GLM, chisq.test, or?

Thanks Tal for answering,

Anyway I still have no idea on why the binomial GLM is missing the
relationship between the response variable and the explanatory variable
"cohort".

Is there anyone who might help me to understand this?


--
View this message in context:
http://r.789695.n4.nabble.com/Binomial-GLM-chisq-test-or-tp4608941p4614184.html
Sent from the R help mailing list archive at Nabble.com.

Apparently Analagous Threads

Search for more seemingly similar threads

R help - May 2012 - Binomial GLM, chisq.test, or?

[R] Binomial GLM, chisq.test, or?

[R] Binomial GLM, chisq.test, or?

[R] Binomial GLM, chisq.test, or?

Apparently Analagous Threads