thr3ads.net - R help - [R] cross-validation complex model AUC Nagelkerke R squared code [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Jürgen Biedermann

2011-Apr-12 10:01 UTC

[R] cross-validation complex model AUC Nagelkerke R squared code

Hi there,

I really tried hard to understand and find my own solution, but now I 
think I have to ask for your help.
I already developed some script code for my problem but I doubt that it 
is correct.

I have the following problem:

Image you develop a logistic regression model with a binary outcome Y 
(0/1) with possible preditors (X1,X2,X3......). The development of the 
final model would be quite complex and undertake several steps (stepwise 
forward selection with LR-Test statistics, incorporating interaction 
effects etc.). The final prediction at the end however would be through 
a glm object (called fit.glm). Then, I think so, it would be no problem 
to calculate a Nagelkerke R squared measure and an AUC value (for 
example with the pROC package) following the script:

BaseRate <- table(Data$Y[[1]])/sum(table(Data$Y))
L(0)=Likelihood(Null-Model)= 
(BaseRate*log(BaseRate)+(1-BaseRate)*log(1-BaseRate))*sum(table(Data$Y))
LIKM <- predict(fit.glm, type="response")
L(M)=Likelihood(FittedModell)=sum(Data$Y*log(LIKM)+(1-Data$Y)*log(1-LIKM))

R2 = 1-(L(0)/L(M))^2/n
R2_max=1-(L(0))^2/n
R2_Nagelkerke=R2/R2max

library(pROC)
AUC <- auc(Data$Y,LIKM)

I checked this kind of caculation of R2_Nagelkerke and AUC-Value with 
the built-in calculation in package "Design" and got consistent
results.

Now I implement a cross validation procedure, dividing the sample 
randomly into k-subsamples with equal size. Afterwards I calculate the 
predicted probabilities  for each k-th subsample with a model 
(fit.glm_s) developed taking the same algorithm as for the whole data 
model (stepwise forward selection selection etc.) but using all but the 
k-th subsample. I store the predicted probabilities subsequently and 
build up my LIKM vector (see above) the following way.

LIKM[sub] <- predict(fit.glm_s, data=Data[-sub,], type="response").

Now I use the same formula/script as above, the only change therefore 
consists in the calculation of the LIKM vector.

BaseRate <- table(Data$Y[[1]])/sum(table(Data$Y))
L(0)=Likelihood(Null-Model)= 
(BaseRate*log(BaseRate)+(1-BaseRate)*log(1-BaseRate))*sum(table(Data$Y))
...calculation of the cross-validated LIKM, see above ...
L(M)=Likelihood(FittedModell)=sum(Data$Y*log(LIKM)+(1-Data$Y)*log(1-LIKM))

R2 = 1-(L(0)/L(M))^2/n
R2_max=1-(L(0))^2/n
R2_Nagelkerke=R2/R2max

AUC <- auc(Data$Y,LIKM)

When I compare my results (using more simply developed models) with the 
validate method in package "Design" (method="cross",B=10),
it seems to
me that I consistently underestimate the true expected Nagelkerke R 
Squared. Additionally, I'm very unsure about the way I try to calculate 
a cross-validated AUC.

Do I have an error in my thoughts of how to obtain easily 
cross-validated AUC and R Squared for a model developed to predict a 
binary outcome?

I hope my problem is understandable and you could try to help me.

Best regards,
J?rgen


-- 
-----------------------------------
J?rgen Biedermann
Bergmannstra?e 3
10961 Berlin-Kreuzberg
Mobil: +49 176 247 54 354
Home: +49 30 250 11 713
e-mail: juergen.biedermann at gmail.com

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Apr 2011 - cross-validation complex model AUC Nagelkerke R squared code

[R] cross-validation complex model AUC Nagelkerke R squared code

Seemingly Similar Threads

Wisdom of the Ancients