thr3ads.net - R help - [R] Getting the C-index for a dataset that was not used to generate the logistic model [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Kyle Werner

2009-Jul-17 03:48 UTC

[R] Getting the C-index for a dataset that was not used to generate the logistic model

Does anyone know how to get the C-index from a logistic model - not using
the dataset that was used to train the model, but instead using a fresh
dataset on the same model?

I have a dataset of 400 points that I've split into two halves, one for
training the logistic model, and the other for evaluating it. The structure
is as follows:

column headers are "got a loan" (dichotomous), "hourly
income" (continuous),
and "owns own home" (dichotomous)
The training data is
*trainingData[1,] = c(0,12,0)*
*...*
etc

and the validation data is
*validationData[1,] = c(1,35,1)*
*...*
etc

I use Prof. Harrell's excellent Design modules to perform a logistic
regression on the training data like so:
*logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)*
*lrm(formula = logit.lrm)$stats[6]*
(output is C 0.8739827 - i.e., just the C-index)
**
I really like the ability to extract the C-index (or ROC AUC), because this
is a factor that I find very helpful in comparing various models. However, I
don't really want to get that from the data that the model was built on.
Using that C-statistic would be cheating, in a sense, since I'm just testing
the model on the data it was built against. I would rather get the
C-statistic by applying the model I just generated to the other half of the
data that I saved.

I have tried doing this:
*lrm(formula = logit.lrm,data=validationData)*
However, this actually generates a new model (giving different coefficients
to the variables). It doesn't simply apply the new data to the model from *
logit.lrm* that I generated before.

So, can someone point me in the right direction for evaluating the model
that I built with trainingData, but getting the C-statistic against my
validationData?

Thanks so much,

Kyle Werner

	[[alternative HTML version deleted]]

Kyle Werner

2009-Jul-17 04:14 UTC

head link

[R] Getting the C-index for a dataset that was not used to generate the logistic model

Does anyone know how to get the C-index from a logistic model - not
using the dataset that was used to train the model, but instead using
a fresh dataset on the same model?

I have a dataset of 400 points that I've split into two halves, one
for training the logistic model, and the other for evaluating it. The
structure is as follows:

column headers are "got a loan" (dichotomous), "hourly
income"
(continuous), and "owns own home" (dichotomous)
The training data is
trainingData[1,] = c(0,12,0)
...
etc

and the validation data is
validationData[1,] = c(1,35,1)
...
etc

I use Prof. Harrell's excellent Design modules to perform a logistic
regression on the training data like so:
logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)
lrm(formula = logit.lrm)$stats[6]
(output is C 0.8739827 - i.e., just the C-index)

I really like the ability to extract the C-index (or ROC AUC), because
this is a factor that I find very helpful in comparing various models.
However, I don't really want to get that from the data that the model
was built on. Using that C-statistic would be cheating, in a sense,
since I'm just testing the model on the data it was built against. I
would rather get the C-statistic by applying the model I just
generated to the other half of the data that I saved.

I have tried doing this:
lrm(formula = logit.lrm,data=validationData)
However, this actually generates a new model (giving different
coefficients to the variables). It doesn't simply apply the new data
to the model from logit.lrm that I generated before.

So, can someone point me in the right direction for evaluating the
model that I built with trainingData, but getting the C-statistic
against my validationData?

Thanks so much,

Kyle Werner
(Resending because I accidentally HTML formatted my original post so
it was scrubbed.)

Frank E Harrell Jr

2009-Jul-17 04:18 UTC

head link

[R] Getting the C-index for a dataset that was not used to generate the logistic model

Kyle Werner wrote:> Does anyone know how to get the C-index from a logistic model - not using
> the dataset that was used to train the model, but instead using a fresh
> dataset on the same model?
> 
> I have a dataset of 400 points that I've split into two halves, one for
> training the logistic model, and the other for evaluating it. The structure
> is as follows:
Kyle - I would not trust data splitting with N < 20,000.
> 
> column headers are "got a loan" (dichotomous), "hourly
income" (continuous),
> and "owns own home" (dichotomous)
> The training data is
> *trainingData[1,] = c(0,12,0)*
> *...*
> etc
> 
> and the validation data is
> *validationData[1,] = c(1,35,1)*
> *...*
> etc
> 
> I use Prof. Harrell's excellent Design modules to perform a logistic
> regression on the training data like so:
> *logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)*
> *lrm(formula = logit.lrm)$stats[6]*
> (output is C 0.8739827 - i.e., just the C-index)
> **
> I really like the ability to extract the C-index (or ROC AUC), because this
> is a factor that I find very helpful in comparing various models. However,
I
> don't really want to get that from the data that the model was built
on.
> Using that C-statistic would be cheating, in a sense, since I'm just
testing
> the model on the data it was built against. I would rather get the
> C-statistic by applying the model I just generated to the other half of the
> data that I saved.
> 
> I have tried doing this:
> *lrm(formula = logit.lrm,data=validationData)*
> However, this actually generates a new model (giving different coefficients
> to the variables). It doesn't simply apply the new data to the model
from *
> logit.lrm* that I generated before.
If you are just fitting a new model with the only predictor being the 
predicted log odds, it is true you will get a new slope and intercept, 
but this will not affect the c-index.  So you can trust the output (for 
the c-index and other rank measures such as Dxy, tau, gamma).

Or use rcorr.cens(predict(fit, newdata), newdata$y) and use 
Dxy=2*(C-.5).  You can use somers2( ) if you don't need the standard error.

Frank
> 
> So, can someone point me in the right direction for evaluating the model
> that I built with trainingData, but getting the C-statistic against my
> validationData?
> 
> Thanks so much,
> 
> Kyle Werner
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Maybe Matching Threads

Search for more possibly parallel threads

R help - Jul 2009 - Getting the C-index for a dataset that was not used to generate the logistic model

[R] Getting the C-index for a dataset that was not used to generate the logistic model

[R] Getting the C-index for a dataset that was not used to generate the logistic model

[R] Getting the C-index for a dataset that was not used to generate the logistic model

Maybe Matching Threads