Renaud Lancelot
2002-Jan-13 19:48 UTC
[R] weighted regression: Osius & Rojek's test for logistic regression models
Dear all, I am trying to implement goodness-of-fit tests for logistic regression models described in : Hosmer, D.W., Lemeshow, S., 2000. Applied logistic regression. New-York, John Wiley & Sons, Inc., 373 p. (pp. 152-154) Namely I would like to reproduce Osius & Rojek's test (Osius, G., Rojek, D., 1992. Normal goodness-of-fit tests for multinomial models with large degrees of freedom. JASA, 87: 1145:1152.). This test involves a weighted linear regression of the transformed fitted probabilities against covariates, from which residual sum of squares is extracted. Using the UIS dataset provided at http://www-unix.oit.umass.edu/~statdata/stat-logistic.html and the same model than in the book, I find a large discrepancy between "my" RSS and the one reported by Hosmer and Lemeshow: 392.454 vs. 189.685. Weighted linear regression results, weights in "nu":> summary(fm)[snip] Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) 16.5302 0.7327 22.5616 0.0000 age -0.2175 0.0158 -13.7293 0.0000 ndrgfp1 -4.1627 0.2578 -16.1472 0.0000 ndrgfp2 -1.5025 0.0977 -15.3835 0.0000 ivhxPrevious 2.1504 0.2595 8.2872 0.0000 ivhxRecent 2.4429 0.2290 10.6676 0.0000 race -0.8732 0.2016 -4.3316 0.0000 treat -1.4424 0.1779 -8.1096 0.0000 site -0.6621 0.1977 -3.3486 0.0009 Residual standard error: 0.8755 on 512 degrees of freedom [snip]> sum( nu * resid(fm)^2 )[1] 392.454> sqrt(392.454 / 512)[1] 0.876 Other intermediate calculations are the same (Pearson's chi square, etc.) so I don't think the problem comes from the dataset. The whole code of the function and its application on the UIS data set are given below. I would highly appreciate any insight on this problem. Renaud ################### or.test <- function(object) { ### ancillary function ## ags adapted from agg.sum provided by Bill Dunlap ags <- function(x, by){ by <- data.frame(by) ord <- do.call("order", unname(by)) x <- x[ord] by <- by[ord, ] logical.diff <- function(group) group[-1] != group[ - length(group)] change <- logical.diff(by[[1]]) for(i in seq(along = by)[-1]) change <- change | logical.diff(by[[i]]) by <- by[c(T, change), , drop = F] by$x <- diff(c(0, cumsum(x)[c(change, T)])) by } ### ### computations ### mf <- model.frame(object) ## collapse the original data by covariate pattern xx <- ags(rep(1, nrow(mf)), mf[-1]) ## observed number of cases by covariate pattern yy <- unname(unlist(ags(mf[ , 1], mf[-1])[ncol(xx)])) ## fitted proba pp <- predict(object, newdata = xx, type = "response") ## number of rows with the same covariate pattern mm <- unname(unlist(xx[ncol(xx)])) ## new model frame xx <- xx[ , - ncol(xx)] ## weights nu <- mm * pp * (1 - pp) ## new response cc <- (1 - 2 * pp) / nu ### Pearson's X2 X2 <- sum( (yy - mm * pp)^2 / nu) ### weighted regression mod <- lm(cc ~ . , weights = nu, data = xx) rss <- sum( nu * resid(mod)^2 ) ### compute the stat. J <- nrow(xx) A <- 2 * (J - sum( 1 / mm)) z <- abs( (X2 - (J - length( coef(object) ) ) ) / sqrt(A + rss) ) ### report results print(object$call) cat("Osius & Rojek's goodness-of-fit test for logistic models.\n") cat("Null hypothesis: model fits the data well.\n") cat("z =", round(z, 3), "; P =", round(2 * (1 - pnorm(z)), 3), "\n") }> mod2 <- glm(dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race + treat + site + age:ndrgfp1 + race:site,family = binomial(logit), data = uis, epsilon = 1e-010)> summary(mod2, cor = 0)Call: glm(formula = dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race + treat + site + age:ndrgfp1 + race:site, family = binomial(logit), data = uis, epsilon = 1e-010) Deviance Residuals: Min 1Q Median 3Q Max -1.30291 -0.7884782 -0.5787787 0.9882528 2.616352 Coefficients: Value Std. Error t value (Intercept) -6.84386380 1.21931572 -5.612873 age 0.11663848 0.02887493 4.039437 ndrgfp1 1.66903502 0.40715174 4.099295 ndrgfp2 0.43368849 0.11690510 3.709748 ivhxPrevious -0.63463065 0.29871912 -2.124506 ivhxRecent -0.70494755 0.26158047 -2.694955 race 0.68410676 0.26413543 2.589985 treat 0.43492548 0.20375957 2.134503 site 0.51620102 0.25488809 2.025207 age:ndrgfp1 -0.01526971 0.00602675 -2.533656 race:site -1.42945705 0.52978052 -2.698206 (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: 653.7289 on 574 degrees of freedom Residual Deviance: 597.9629 on 564 degrees of freedom Number of Fisher Scoring Iterations: 5> or.test(mod2)glm(formula = dfree ~ age + ndrgfp1 + ndrgfp2 + ivhx + race + treat + site + age:ndrgfp1 + race:site, family = binomial(logit), data = uis, epsilon = 1e-010) Osius & Rojek's goodness-of-fit test for logistic models. Null hypothesis: model fits the data well. z = 0.085 ; P = 0.933 -- Dr Renaud Lancelot, v?t?rinaire CIRAD, D?partement Elevage et M?decine V?t?rinaire (CIRAD-Emvt) Programme Productions Animales http://www.cirad.fr/presentation/programmes/prod-ani.shtml ISRA-LNERV tel (221) 832 49 02 BP 2057 Dakar-Hann fax (221) 821 18 79 (CIRAD) Senegal e-mail renaud.lancelot at cirad.fr -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._