Kyle Werner
2009-Jul-17 03:48 UTC
[R] Getting the C-index for a dataset that was not used to generate the logistic model
Does anyone know how to get the C-index from a logistic model - not using the dataset that was used to train the model, but instead using a fresh dataset on the same model? I have a dataset of 400 points that I've split into two halves, one for training the logistic model, and the other for evaluating it. The structure is as follows: column headers are "got a loan" (dichotomous), "hourly income" (continuous), and "owns own home" (dichotomous) The training data is *trainingData[1,] = c(0,12,0)* *...* etc and the validation data is *validationData[1,] = c(1,35,1)* *...* etc I use Prof. Harrell's excellent Design modules to perform a logistic regression on the training data like so: *logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)* *lrm(formula = logit.lrm)$stats[6]* (output is C 0.8739827 - i.e., just the C-index) ** I really like the ability to extract the C-index (or ROC AUC), because this is a factor that I find very helpful in comparing various models. However, I don't really want to get that from the data that the model was built on. Using that C-statistic would be cheating, in a sense, since I'm just testing the model on the data it was built against. I would rather get the C-statistic by applying the model I just generated to the other half of the data that I saved. I have tried doing this: *lrm(formula = logit.lrm,data=validationData)* However, this actually generates a new model (giving different coefficients to the variables). It doesn't simply apply the new data to the model from * logit.lrm* that I generated before. So, can someone point me in the right direction for evaluating the model that I built with trainingData, but getting the C-statistic against my validationData? Thanks so much, Kyle Werner [[alternative HTML version deleted]]
Kyle Werner
2009-Jul-17 04:14 UTC
[R] Getting the C-index for a dataset that was not used to generate the logistic model
Does anyone know how to get the C-index from a logistic model - not using the dataset that was used to train the model, but instead using a fresh dataset on the same model? I have a dataset of 400 points that I've split into two halves, one for training the logistic model, and the other for evaluating it. The structure is as follows: column headers are "got a loan" (dichotomous), "hourly income" (continuous), and "owns own home" (dichotomous) The training data is trainingData[1,] = c(0,12,0) ... etc and the validation data is validationData[1,] = c(1,35,1) ... etc I use Prof. Harrell's excellent Design modules to perform a logistic regression on the training data like so: logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData) lrm(formula = logit.lrm)$stats[6] (output is C 0.8739827 - i.e., just the C-index) I really like the ability to extract the C-index (or ROC AUC), because this is a factor that I find very helpful in comparing various models. However, I don't really want to get that from the data that the model was built on. Using that C-statistic would be cheating, in a sense, since I'm just testing the model on the data it was built against. I would rather get the C-statistic by applying the model I just generated to the other half of the data that I saved. I have tried doing this: lrm(formula = logit.lrm,data=validationData) However, this actually generates a new model (giving different coefficients to the variables). It doesn't simply apply the new data to the model from logit.lrm that I generated before. So, can someone point me in the right direction for evaluating the model that I built with trainingData, but getting the C-statistic against my validationData? Thanks so much, Kyle Werner (Resending because I accidentally HTML formatted my original post so it was scrubbed.)
Frank E Harrell Jr
2009-Jul-17 04:18 UTC
[R] Getting the C-index for a dataset that was not used to generate the logistic model
Kyle Werner wrote:> Does anyone know how to get the C-index from a logistic model - not using > the dataset that was used to train the model, but instead using a fresh > dataset on the same model? > > I have a dataset of 400 points that I've split into two halves, one for > training the logistic model, and the other for evaluating it. The structure > is as follows:Kyle - I would not trust data splitting with N < 20,000.> > column headers are "got a loan" (dichotomous), "hourly income" (continuous), > and "owns own home" (dichotomous) > The training data is > *trainingData[1,] = c(0,12,0)* > *...* > etc > > and the validation data is > *validationData[1,] = c(1,35,1)* > *...* > etc > > I use Prof. Harrell's excellent Design modules to perform a logistic > regression on the training data like so: > *logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)* > *lrm(formula = logit.lrm)$stats[6]* > (output is C 0.8739827 - i.e., just the C-index) > ** > I really like the ability to extract the C-index (or ROC AUC), because this > is a factor that I find very helpful in comparing various models. However, I > don't really want to get that from the data that the model was built on. > Using that C-statistic would be cheating, in a sense, since I'm just testing > the model on the data it was built against. I would rather get the > C-statistic by applying the model I just generated to the other half of the > data that I saved. > > I have tried doing this: > *lrm(formula = logit.lrm,data=validationData)* > However, this actually generates a new model (giving different coefficients > to the variables). It doesn't simply apply the new data to the model from * > logit.lrm* that I generated before.If you are just fitting a new model with the only predictor being the predicted log odds, it is true you will get a new slope and intercept, but this will not affect the c-index. So you can trust the output (for the c-index and other rank measures such as Dxy, tau, gamma). Or use rcorr.cens(predict(fit, newdata), newdata$y) and use Dxy=2*(C-.5). You can use somers2( ) if you don't need the standard error. Frank> > So, can someone point me in the right direction for evaluating the model > that I built with trainingData, but getting the C-statistic against my > validationData? > > Thanks so much, > > Kyle Werner > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University