Jürgen Biedermann
2011-Apr-12 10:01 UTC
[R] cross-validation complex model AUC Nagelkerke R squared code
Hi there, I really tried hard to understand and find my own solution, but now I think I have to ask for your help. I already developed some script code for my problem but I doubt that it is correct. I have the following problem: Image you develop a logistic regression model with a binary outcome Y (0/1) with possible preditors (X1,X2,X3......). The development of the final model would be quite complex and undertake several steps (stepwise forward selection with LR-Test statistics, incorporating interaction effects etc.). The final prediction at the end however would be through a glm object (called fit.glm). Then, I think so, it would be no problem to calculate a Nagelkerke R squared measure and an AUC value (for example with the pROC package) following the script: BaseRate <- table(Data$Y[[1]])/sum(table(Data$Y)) L(0)=Likelihood(Null-Model)= (BaseRate*log(BaseRate)+(1-BaseRate)*log(1-BaseRate))*sum(table(Data$Y)) LIKM <- predict(fit.glm, type="response") L(M)=Likelihood(FittedModell)=sum(Data$Y*log(LIKM)+(1-Data$Y)*log(1-LIKM)) R2 = 1-(L(0)/L(M))^2/n R2_max=1-(L(0))^2/n R2_Nagelkerke=R2/R2max library(pROC) AUC <- auc(Data$Y,LIKM) I checked this kind of caculation of R2_Nagelkerke and AUC-Value with the built-in calculation in package "Design" and got consistent results. Now I implement a cross validation procedure, dividing the sample randomly into k-subsamples with equal size. Afterwards I calculate the predicted probabilities for each k-th subsample with a model (fit.glm_s) developed taking the same algorithm as for the whole data model (stepwise forward selection selection etc.) but using all but the k-th subsample. I store the predicted probabilities subsequently and build up my LIKM vector (see above) the following way. LIKM[sub] <- predict(fit.glm_s, data=Data[-sub,], type="response"). Now I use the same formula/script as above, the only change therefore consists in the calculation of the LIKM vector. BaseRate <- table(Data$Y[[1]])/sum(table(Data$Y)) L(0)=Likelihood(Null-Model)= (BaseRate*log(BaseRate)+(1-BaseRate)*log(1-BaseRate))*sum(table(Data$Y)) ...calculation of the cross-validated LIKM, see above ... L(M)=Likelihood(FittedModell)=sum(Data$Y*log(LIKM)+(1-Data$Y)*log(1-LIKM)) R2 = 1-(L(0)/L(M))^2/n R2_max=1-(L(0))^2/n R2_Nagelkerke=R2/R2max AUC <- auc(Data$Y,LIKM) When I compare my results (using more simply developed models) with the validate method in package "Design" (method="cross",B=10), it seems to me that I consistently underestimate the true expected Nagelkerke R Squared. Additionally, I'm very unsure about the way I try to calculate a cross-validated AUC. Do I have an error in my thoughts of how to obtain easily cross-validated AUC and R Squared for a model developed to predict a binary outcome? I hope my problem is understandable and you could try to help me. Best regards, J?rgen -- ----------------------------------- J?rgen Biedermann Bergmannstra?e 3 10961 Berlin-Kreuzberg Mobil: +49 176 247 54 354 Home: +49 30 250 11 713 e-mail: juergen.biedermann at gmail.com