Mark Wardle
2007-Jul-21 20:43 UTC
[R] Binomial multi-level (hierarchical) modelling [partly stats question, not completely R related]
Dear all, This question is partly statistics and partly R and I apologise in advance for my (usual) verbosity! My data is a little more complicated that this suggests, but essentially I have proportion data from different studies (id), each from a specific country and region of the World. I would like to examine the variables that affect the proportion, but these factors are hierarchical. In case I am not explaining the problem correctly, I have created a simple analysis with simulated data below! My question is: How does one perform logistic regression using interdependent (hieraerchical) variables and decide how much each factor contributes to the variability? Originally I wanted to look at whether there was only a regional difference between studies, and simply aggregated the data and performed a standard contingency analysis. However, I wanted to make sure I wasn't aggregating and losing valuable covariates (?Simpson's paradox...). Hence I thought I would be better off using binomial regression instead, just to check other factors are not possible confounders: # At present I have a data frame with nested groups with proportion data (counts and totals) # make up some equivalent simulated data: d1 <- data.frame( id = factor(seq(1,20)), count = round(c(rnorm(10, mean=10, sd=4),rnorm(10, mean=20, sd=2)),0), total = round(rnorm(20, mean=50, sd=5),0), country = c('UK','UK','Fr','Fr','Sp','Sp',rep('Other',4), rep('Japan',8),rep('China',2)), region = c(rep('Europe',10), rep('Asia',10)) ) d1$remain <- with(d1, total-count) summary(d1) # show what we've done # current analysis - simple aggregate and perform assessment of contigency table d2 <- aggregate(d1[,2:3], by=list(d1$region), FUN=sum ) d2$remain <- with(d2, total-count) d2.f <- fisher.test(matrix(c(d2$count, d2$remain), ncol=2)) print(d2.f) # highly significant difference between europe and asia # I liked the result of this - highly significant for both simulated and real data, and easy to interpret for a stats-idiot like me! # a better (?) analysis? m1 <- glm(cbind(count, remain) ~ region, "binomial", data=d1) summary(m1) # with real data: # Call: # glm(formula = cbind(count, remain) ~ region, family = "binomial", # data = d1) # # Deviance Residuals: # Min 1Q Median 3Q Max # -5.2427766 -2.4016616 -1.0373422 -0.0004486 4.9658610 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.7402 0.1007 -17.287 < 2e-16 *** # regionAsia, Americas -0.6781 0.2128 -3.187 0.00144 ** # regionEurope -0.8692 0.1665 -5.219 1.80e-07 *** # regionIndia -16.6492 956.1380 -0.017 0.98611 # regionOther 0.3107 0.2882 1.078 0.28096 # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 187.02 on 18 degrees of freedom # Residual deviance: 141.37 on 14 degrees of freedom # (2 observations deleted due to missingness) # AIC: 209.28 # # Number of Fisher Scoring iterations: 14 # *** but how does one include nested parameters (ie. decide how much of the variability is dependent on individual study, country and region? especially given that these factors are inter-dependent - hierarchical.) m2 <- glm(cbind(count, remain) ~ region + country + id, "binomial", data=d1) # with the fake data, nothing is significant - how does one decide which effect to remove? # and with REAL data: # Null deviance: 1.8702e+02 on 18 degrees of freedom # Residual deviance: 5.4476e-10 on 0 degrees of freedom # (2 observations deleted due to missingness) # AIC: 95.914 # # Number of Fisher Scoring iterations: 22 # Therefore, take out "id" as this clearly isn't the correct approach! m3 <- glm(cbind(count, remain) ~ region + country, "binomial", data=d1) summary(m3) # and with REAL data: # Call: # glm(formula = cbind(count, remain) ~ region + country, family "binomial", # data = d1) # # Deviance Residuals: # Basu:2000 Brusco:2004 Filla:2000 Hadjivassiliou:2003 Juvonen:2005 Leggo:1997 Maruyama:2002 Nagaoka:1999 Onodera:2000 Pujana:1999 Saleem:2000 Schols:1997 # -6.322e-08 -2.978e-01 3.468e-01 7.383e-02 0.000e+00 -9.928e-02 3.328e+00 -1.225e+00 -2.354e+00 -1.718e-07 -2.002e-04 4.215e-08 # Silveira:2002 Soong:2001 Storey:2000 Takano:1998 Tang:2000 Warrenburg:2002 Watanabe:1998 # 0.000e+00 0.000e+00 -1.421e-07 0.000e+00 -2.019e-04 0.000e+00 -3.590e+00 # # Coefficients: (3 not defined because of singularities) # Estimate Std. Error z value Pr(>|z|) # (Intercept) -2.1102 0.3744 -5.637 1.73e-08 *** # regionAsia, Americas -0.3080 0.4187 -0.736 0.46193 # regionEurope -0.8342 0.7007 -1.191 0.23385 # regionIndia -18.1705 4285.1131 -0.004 0.99662 # regionOther 0.6807 0.4616 1.475 0.14026 # countryChina -20.0407 4247.6466 -0.005 0.99624 # countryFinland -0.9268 1.1712 -0.791 0.42877 # countryGermany 1.6833 0.6530 2.578 0.00994 ** # countryIndia -1.1087 1.0863 -1.021 0.30747 # countryItaly -1.3562 0.7773 -1.745 0.08104 . # countryJapan 0.5989 0.3893 1.538 0.12396 # countryJapan, USA NA NA NA NA # countryNetherlands 1.2081 0.6209 1.946 0.05167 . # countryPortugal / Brazil -1.7095 1.1664 -1.466 0.14273 # countrySpain -1.3182 1.1683 -1.128 0.25918 # countryTaiwan NA NA NA NA # countryUK NA NA NA NA # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 187.019 on 18 degrees of freedom # Residual deviance: 31.233 on 5 degrees of freedom # (2 observations deleted due to missingness) # AIC: 117.15 # # Number of Fisher Scoring iterations: 17 So I have evidence of marked overdispersion (if I read this correctly!) with my real data. It seems to me that this analysis needs a mixed model with hierarchical terms and a binomial link function (?using nlme). Is this possible? I tried using lme4's lmer() function like this: library(lme4) d1$prop <- with(d1, count/total) me1 <- lmer(prop ~ country + (1|region) + (0+country|region), family='binomial', weights=total, data=d1) But I don't think I am using this properly and can't seem to determine the variance due to the different hierarchical terms. Can one use the interaction of country*region instead of using a mixed model (where presumably country would be a fixed effect and region a random effct?) Any pointers much appreciated! Many thanks, Mark P.S. while experimenting while creating the line:> me1 <- lmer(prop ~ country + (1|region) + (0+country|region), family='binomial', weights=total, data=d1)I accidentally pressed <RETURN> and issued the following command: me1 <- lmer(prop ~ country + (0|region), family='binomial', weights=total, data=d1) which causes R to crash (and the application closes completely). This is with R version> sessionInfo()R version 2.5.1 (2007-06-27) i386-apple-darwin8.9.1 locale: en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] "stats" "graphics" "grDevices" "datasets" "utils" "methods" "base" other attached packages: lme4 Matrix lattice RODBC "0.99875-4" "0.999375-0" "0.15-11" "1.2-1" -- Dr. Mark Wardle Clinical research fellow and specialist registrar, Neurology Cardiff, UK