thr3ads.net - R help - [R] predict.lrm ( Design package) [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Chris Mcowen

2010-Sep-20 12:03 UTC

[R] predict.lrm ( Design package)

Dear List,

I am familier with binary models, however i am now trying to get predictions
from a ordinal model and have a question.

I have a data set made up of 12 categorical predictors, the response variable is
classed as 1,2,3,4,5,6, this relates to threat level of the species ( on the
IUCN rating).

Previously i have combined levels 1 and 2 to form = non threatened and then
combined 3-6 to form threatened, and run a binary model. I have tested the
performance of this based on the brier score (0.198) and the AUC or C value
(0.75). I also partitioned the data set into training and test data and used the
predict function to get a predicted probability for the newdata. When
visualising the results with a cutoff value calculated with epi, roughly 75% of
the time the prediction was correct.

Now i am interested in predicting the threat level of a species not purely as
threatened or not but to specific IUCN levels. I have used the predict.lrm
function (predict.lrm(model1, type="fitted.ind"))  to generate
probabilities for each level,  and also (predict(model1, traist,
type="fitted")) see below.

When i call the model the Brier score is  0.201  and C value 0.677. However 
when i inspect the output and relate it to the corresponding species ( for which
i know the true IUCN rating) the model performs very badly, only getting 43%
correct. Interestingly i have noticed the probabilities are always highest for
levels 1 and 6, on no occasion do levels 2,3,4 or 5 have high probabilities?

I am unsure if this is just because the model can not discriminate between the
various levels due to insufficient data? Or if i am doing something wrong, if
this is the case i would greatly appreciate any advice or suggestion of a
different method.

Thanks in advance,

Chris 


model1 <- lrm(EXTINCTION~BS*FR+WO+LIF+REG+ALT+BIO+HAB+PD+SEA, x=TRUE,y=TRUE)
predict.lrm(model1, type="fitted.ind")

    EXTINCTION=1 EXTINCTION=2 EXTINCTION=3 EXTINCTION=4 EXTINCTION=5
1     0.19748393   0.05895033   0.12811186  0.086140778  0.068137104
2     0.27790178   0.07247496   0.14384976  0.087487677  0.064584865
3     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
4     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
5     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
6     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
7     0.13928899   0.04558050   0.10636220  0.077770389  0.065500459
8     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
9     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
10    0.33865077   0.07915126   0.14744522  0.083923247  0.059387585

    EXTINCTION=6
1     0.46117600
2     0.35370096
3     0.39223038
4     0.39223038
5     0.39223038
6     0.39223038
7     0.56549746
8     0.39223038
9     0.39223038

predict(model1, traist, type="fitted")
          y>=2       y>=3       y>=4       y>=5       y>=6
1   0.80251607 0.74356575 0.61545388 0.52931311 0.46117600
2   0.72209822 0.64962327 0.50577351 0.41828583 0.35370096
3   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
4   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
5   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
6   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
7   0.86071101 0.81513051 0.70876831 0.63099792 0.56549746
8   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
9   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
10  0.66134923 0.58219797 0.43475276 0.35082951 0.29144192
	[[alternative HTML version deleted]]

Frank Harrell

2010-Sep-20 13:50 UTC

head link

[R] predict.lrm ( Design package)

You sent a private note about this which I just took the time to answer. 
Please send only one note, and please post my reply to you to r-help.
Frank


-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
-- 
View this message in context:
http://r.789695.n4.nabble.com/predict-lrm-Design-package-tp2546894p2547011.html
Sent from the R help mailing list archive at Nabble.com.

Peter Dalgaard

2010-Sep-20 14:23 UTC

head link

[R] predict.lrm ( Design package)

On Sep 20, 2010, at 15:50 , Frank Harrell wrote:
> 
> You sent a private note about this which I just took the time to answer. 
> Please send only one note, and please post my reply to you to r-help.
I have been annoyed by this at times as well. However, I have come to suspect
that it is actually the mailer (computer program) rather than the sender (human
being) that is doing this. It appears that _sometimes_, depending on the phase
of the moon and the water levels in lake Wivenhoe, as well as may be other
circumstances, a wide reply gets sent as two separate mails, one to the original
sender and another to the Cc: list. Of course, with the familiar effect that you
follow-up in a private mail, only to see the question going public an instance
later. By principle, I rarely do follow-ups in private, so it is not that much
of an issue for me.
> Frank
> 
> 
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> -- 
> View this message in context:
http://r.789695.n4.nabble.com/predict-lrm-Design-package-tp2546894p2547011.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Chris Mcowen

2010-Sep-21 10:18 UTC

head link

[R] predict.lrm ( Design package) poor performance?

Thanks Frank,

I have one small question regarding this, understand you are very busy and if
you cant answer i would greatly appreciate any thoughts from the list.
> Split-sample validation is not reliable unless you have say 10,000 samples
to begin with
I am a little confused. When i ran the model with a binary response i had a
Brier score of 0.199 and a C-value of 0.73. When i looked at the data, 75% of
the predictions were "correct".

However when i ran the model with a ordinal response, the Brier score was 0.201
and the c-value 0.677 which to me, albeit with limited knowledge in the area,
suggests the model performance worse but not by a great deal. However when i
compare the predicted mean values with the "real" data i only get 45%
accuracy, it is this discrepancy i don't understand.

I appreciate you need a large dataset to partition it, but surely if the c-value
and brier score are similar then the number predicted "correct" should
also be similar?


Thanks

Chris
 
On 20 Sep 2010, at 14:46, Frank Harrell wrote:

A few comments; sorry I don't have time for any more.

- Combining categories is almost always a bad idea
- It can be harder to discriminate more categories but that's only because
the task is more difficult
- Split-sample validation is not reliable unless you have say 10,000 samples to
begin with; otherwise 100 fold repeats of 10-fold cross-validation, or (faster)
200 or so bootstraps is better.  Be sure to re-do all modeling decisions from
scratch for each re-sample.  If you did no statistical variable selection this
may not be an issue.
- You don't need ".lrm" after predict
- Not sure why your variable names are all upper case; harder to read this way

Good luck
Frank

Frank E Harrell Jr   Professor and Chairman        School of Medicine
                    Department of Biostatistics   Vanderbilt University

On Mon, 20 Sep 2010, Chris Mcowen wrote:
> Dear Professor Harell
> I am familier with binary models, however i am now trying to get
predictions from a ordinal model and have a
> question. 
> I have a data set made up of 12 categorical predictors, the response
variable is classed as 1,2,3,4,5,6, this
> relates to threat level of the species ( on the IUCN rating). 
> Previously i have combined levels 1 and 2 to form = non threatened and then
combined 3-6 to form threatened, and run
> a binary model. I have tested the performance of this based on the brier
score (0.198) and the AUC or C value
> (0.75). I also partitioned the data set into training and test data and
used the predict function to get a predicted
> probability for the newdata. When visualising the results with a cutoff
value calculated with epi, roughly 75% of
> the time the prediction was correct. 
> Now i am interested in predicting the threat level of a species not purely
as threatened or not but to specific IUCN
> levels. I have used the predict.lrm function (predict.lrm(model1,
type="fitted.ind"))  to generate probabilities for
> each level,  and also (predict(model1, traist, type="fitted"))
see below.
> When i call the model the Brier score is  0.201  and C value 0.677. However
when i inspect the output and relate it
> to the corresponding species ( for which i know the true IUCN rating) the
model performs very badly, only getting
> 43% correct. Interestingly i have noticed the probabilities are always
highest for levels 1 and 6, on no occasion do
> levels 2,3,4 or 5 have high probabilities?
> I am unsure if this is just because the model can not discriminate between
the various levels due to insufficient
> data? Or if i am doing something wrong, if this is the case i would greatly
appreciate any advice or suggestion of a
> different method. 
> Thanks in advance,
> Chris 
> model1 <- lrm(EXTINCTION~BS*FR+WO+LIF+REG+ALT+BIO+HAB+PD+SEA,
x=TRUE,y=TRUE)
> predict.lrm(model1, type="fitted.ind")
>     EXTINCTION=1 EXTINCTION=2 EXTINCTION=3 EXTINCTION=4 EXTINCTION=5
> 1     0.19748393   0.05895033   0.12811186  0.086140778  0.068137104
> 2     0.27790178   0.07247496   0.14384976  0.087487677  0.064584865
> 3     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 4     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 5     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 6     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 7     0.13928899   0.04558050   0.10636220  0.077770389  0.065500459
> 8     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 9     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 10    0.33865077   0.07915126   0.14744522  0.083923247  0.059387585
>     EXTINCTION=6
> 1     0.46117600
> 2     0.35370096
> 3     0.39223038
> 4     0.39223038
> 5     0.39223038
> 6     0.39223038
> 7     0.56549746
> 8     0.39223038
> 9     0.39223038
> predict(model1, traist, type="fitted")
>           y>=2       y>=3       y>=4       y>=5       y>=6
> 1   0.80251607 0.74356575 0.61545388 0.52931311 0.46117600
> 2   0.72209822 0.64962327 0.50577351 0.41828583 0.35370096
> 3   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 4   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 5   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 6   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 7   0.86071101 0.81513051 0.70876831 0.63099792 0.56549746
> 8   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 9   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 10  0.66134923 0.58219797 0.43475276 0.35082951 0.29144192
>

Chris Mcowen

2010-Sep-21 10:24 UTC

head link

[R] predict.lrm ( Design package) poor performance?

Thanks Frank,

I have one small question regarding this, understand you are very busy and if
you cant answer i would greatly appreciate any thoughts from the list.
> Split-sample validation is not reliable unless you have say 10,000 samples
to begin with
I am a little confused. When i ran the model with a binary response i had a
Brier score of 0.199 and a C-value of 0.73. When i looked at the data, 75% of
the predictions were "correct".

However when i ran the model with a ordinal response, the Brier score was 0.201
and the c-value 0.677 which to me, albeit with limited knowledge in the area,
suggests the model performance worse but not by a great deal. However when i
compare the predicted mean values with the "real" data i only get 45%
accuracy, it is this discrepancy i don't understand.

I appreciate you need a large dataset to partition it, but surely if the c-value
and brier score are similar then the number predicted "correct" should
also be similar?


Thanks

Chris

On 20 Sep 2010, at 14:46, Frank Harrell wrote:

A few comments; sorry I don't have time for any more.

- Combining categories is almost always a bad idea
- It can be harder to discriminate more categories but that's only because
the task is more difficult
- Split-sample validation is not reliable unless you have say 10,000 samples to
begin with; otherwise 100 fold repeats of 10-fold cross-validation, or (faster)
200 or so bootstraps is better.  Be sure to re-do all modeling decisions from
scratch for each re-sample.  If you did no statistical variable selection this
may not be an issue.
- You don't need ".lrm" after predict
- Not sure why your variable names are all upper case; harder to read this way

Good luck
Frank

Frank E Harrell Jr   Professor and Chairman        School of Medicine
                   Department of Biostatistics   Vanderbilt University

On Mon, 20 Sep 2010, Chris Mcowen wrote:
> Dear Professor Harell
> I am familier with binary models, however i am now trying to get
predictions from a ordinal model and have a
> question. 
> I have a data set made up of 12 categorical predictors, the response
variable is classed as 1,2,3,4,5,6, this
> relates to threat level of the species ( on the IUCN rating). 
> Previously i have combined levels 1 and 2 to form = non threatened and then
combined 3-6 to form threatened, and run
> a binary model. I have tested the performance of this based on the brier
score (0.198) and the AUC or C value
> (0.75). I also partitioned the data set into training and test data and
used the predict function to get a predicted
> probability for the newdata. When visualising the results with a cutoff
value calculated with epi, roughly 75% of
> the time the prediction was correct. 
> Now i am interested in predicting the threat level of a species not purely
as threatened or not but to specific IUCN
> levels. I have used the predict.lrm function (predict.lrm(model1,
type="fitted.ind"))  to generate probabilities for
> each level,  and also (predict(model1, traist, type="fitted"))
see below.
> When i call the model the Brier score is  0.201  and C value 0.677. However
when i inspect the output and relate it
> to the corresponding species ( for which i know the true IUCN rating) the
model performs very badly, only getting
> 43% correct. Interestingly i have noticed the probabilities are always
highest for levels 1 and 6, on no occasion do
> levels 2,3,4 or 5 have high probabilities?
> I am unsure if this is just because the model can not discriminate between
the various levels due to insufficient
> data? Or if i am doing something wrong, if this is the case i would greatly
appreciate any advice or suggestion of a
> different method. 
> Thanks in advance,
> Chris 
> model1 <- lrm(EXTINCTION~BS*FR+WO+LIF+REG+ALT+BIO+HAB+PD+SEA,
x=TRUE,y=TRUE)
> predict.lrm(model1, type="fitted.ind")
>    EXTINCTION=1 EXTINCTION=2 EXTINCTION=3 EXTINCTION=4 EXTINCTION=5
> 1     0.19748393   0.05895033   0.12811186  0.086140778  0.068137104
> 2     0.27790178   0.07247496   0.14384976  0.087487677  0.064584865
> 3     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 4     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 5     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 6     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 7     0.13928899   0.04558050   0.10636220  0.077770389  0.065500459
> 8     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 9     0.24605628   0.06777939   0.13931242  0.087996215  0.066625303
> 10    0.33865077   0.07915126   0.14744522  0.083923247  0.059387585
>    EXTINCTION=6
> 1     0.46117600
> 2     0.35370096
> 3     0.39223038
> 4     0.39223038
> 5     0.39223038
> 6     0.39223038
> 7     0.56549746
> 8     0.39223038
> 9     0.39223038
> predict(model1, traist, type="fitted")
>          y>=2       y>=3       y>=4       y>=5       y>=6
> 1   0.80251607 0.74356575 0.61545388 0.52931311 0.46117600
> 2   0.72209822 0.64962327 0.50577351 0.41828583 0.35370096
> 3   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 4   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 5   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 6   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 7   0.86071101 0.81513051 0.70876831 0.63099792 0.56549746
> 8   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 9   0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 10  0.66134923 0.58219797 0.43475276 0.35082951 0.29144192
>

Chris Mcowen

2010-Sep-22 12:04 UTC

head link

[R] predict.lrm ( Design package) poor performance?

Thats great thanks

I guess it is hard to not use % as a performance measure when that is what is
commonly used in everyday life.

So when i come to predicting the response of new data ( using the estimated mean
Y ) which i am more comfortable with i can say -

Species A - 2.12 - Therefore this is category 2
Species B - 2.72 - Therefore this is category 3

(on a side note, i had no species with a rating of 6 - the upper category?)

The problem comes in explaining this to my peers who are non-statsistcally
minded. What you are saying is if i bootstrapp and report the c-values and
Bieber scores etc this is sufficient to give an indication of confidence?



On 22 Sep 2010, at 12:36, Frank Harrell wrote:


% correct is an improper scoring rule and a discontinuous one to boot.  So it
will not always agree with more proper scoring rules.

When you have a more difficult task, e.g., discriminating more categories,
indexes such as the generalized c-index that utilize all the categories will
recognize the difficulty of the task and give a lower value.  No cause for
alarm.

Frank

-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
-- 
View this message in context:
http://r.789695.n4.nabble.com/predict-lrm-Design-package-tp2546894p2550118.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more maybe matching threads

R help - Sep 2010 - predict.lrm ( Design package)

[R] predict.lrm ( Design package)

[R] predict.lrm ( Design package)

[R] predict.lrm ( Design package)

[R] predict.lrm ( Design package) poor performance?

[R] predict.lrm ( Design package) poor performance?

[R] predict.lrm ( Design package) poor performance?

Maybe Matching Threads