thr3ads.net - R help - [R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap

If this information is useful, please help other people find it:
Share via:

Michal Figurski

2008-Jul-24 14:55 UTC

[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]

Thank you Frank and all for your advices.

Here I attach the raw data from the Pawinski's paper. I have obtained
permission from the corresponding Author to post it here for everyone.
The only condition of use is that the Authors retain ownership of the
data, and any publication resulting from these data must be managed by them.

The dataset is composed as follows: patient number / MMF dose in [g] /
Day of study (since start of drug administration) / MPA concentrations
[mg/L] in plasma at following time points: 0, 0.5 ... 12 hours / and the
value of AUC(0-12h) calculated using all time-points.

The goal of the analysis, as you can read from the paper, was to
estimate the value of AUC using maximum 3 time-points within 2 hours
post dose, that is using only 3 of the 4 time-points: 0, 0.5, 1, 2 - but
always include the "0" time-point.

In my analysis of similar problem I was also concerned about the fact
that data come from several visits of a single patient. I have examined
the effect of "PT" with repeated "day" using mixed effects
model, and
these effects turned out to be insignificant. Do you guys think it is
enough justification to use the dataset as if coming from 50 separate
patients?

Also, as to estimation of the bias, variance, etc, Pawinski used CI and
Sy/x. In my analysis I additionally used RMSE values. Please excuse
another naive question, but: do you think it is sufficient information
to compare between models and account for bias?

Regarding the "multiple stepwise regression" - according to the cited
SPSS manual, there are 5 options to select from. I don't think they used
'stepwise selection' option, because their models were already
pre-defined. Variables were pre-selected based on knowledge of
pharmacokinetics of this drug and other factors. I think this part I
understand pretty well.

I see the Frank's point about recalibration on Fig.2 - although the
expectation was set that the prediction be within 15% of the original
value. In my opinion it is *very strict* - I actually used 20% in my
work. This is because of very high variability and imprecision in the
results themselves. These are real biological data and you have to
account for errors like analytical errors (HPLC method), timing errors
and so on, when you look at these data. In other words, if you take two
blood samples at each time-point from a particular patient, and run
them, you will 100% certainly get two distinct (although similar)
profiles. You will get even more difference, if you run one set of
samples on one day, and another set on second day.

Therefore the value of AUC(0-12) itself, to which we compare the
predicted AUC, is not 'holy' - some variability here is inherent.

Nevertheless, I see that the Fig.2 may be incorrect, if we look from
orthodox statistical perspective. I used the same plots in my work as
well - it's too late now. How should I properly estimate the Rsq then?

I greatly appreciate your time and advices in this matter.

--
Michal J. Figurski

Frank E Harrell Jr wrote:> Gustaf Rydevik wrote:
>> On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
>> <figurski at mail.med.upenn.edu> wrote:
>>
>> Hi,
>>
>> I believe that you misunderstand the passage. Do you know what
>> multiple stepwise regression is?
>>
>> Since they used SPSS, I copied from
>>
http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm
>>
>>
>> "Stepwise selection is a combination of forward and backward
procedures.
>> Step 1
>>
>> The first predictor variable is selected in the same way as in forward
>> selection. If the probability associated with the test of significance
>> is less than or equal to the default .05, the predictor variable with
>> the largest correlation with the criterion variable enters the
>> equation first.
>>
>>
>> Step 2
>>
>> The second variable is selected based on the highest partial
>> correlation. If it can pass the entry requirement (PIN=.05), it also
>> enters the equation.
>>
>> Step 3
>>
>>> From this point, stepwise selection differs from forward selection:
>> the variables already in the equation are examined for removal
>> according to the removal criterion (POUT=.10) as in backward
>> elimination.
>>
>> Step 4
>>
>> Variables not in the equation are examined for entry. Variable
>> selection ends when no more variables meet entry and removal criteria.
>> -----------
>>
>>
>> It is the outcome of this *entire process*,step1-4, that they compare
>> with the outcome of their *entire bootstrap/crossvalidation/selection
>> process*, Step1-4 in the methods section, and find that their approach
>> gives better result
>> What you are doing is only step4 in the article's method
>> section,estimating the parameters of a model *when you already know
>> which variables to include*.It is the way this step is conducted that
>> I am sceptical about.
>>
>> Regards,
>>
>> Gustaf
>>
> 
> Perfectly stated Gustaf.  This is a great example of needing to truly 
> understand a method to be able to use it in the right context.
> 
> After having read most of the paper by Pawinski et al now, there are 
> other problems.
> 
> 1. The paper nowhere uses bootstrapping.  It uses repeated 2-fold 
> cross-validation, a procedure not usually recommended.
> 
> 2. The resampling procedure used in the paper treated the 50 
> pharmacokinetic profiles on 21 renal transplant patients as if these 
> were from 50 patients.  The cluster bootstrap should have been used 
> instead.
> 
> 3. Figure 2 showed the fitted regression line to the predicted vs. 
> observed AUCs.  It should have shown the line of identify instead.  In 
> other words, the authors allowed a subtle recalibration to creep into 
> the analysis (and inverted the x- and y-variables in the plots).  The 
> fitted lines are far enough away from the line of identity as to show 
> that the predicted values are not well calibrated.  The r^2 values 
> claimed by the authors used the wrong formulas which allowed an 
> automatic after-the-fact recalibration (new overall slope and intercept 
> are estimated in the test dataset).  Hence the achieved r^2 are misleading.-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Dataset.csv
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20080724/f8ce0b2b/attachment.pl>

Frank E Harrell Jr

2008-Jul-24 18:38 UTC

head link

[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]

Michal Figurski wrote:> Thank you Frank and all for your advices.
> 
> Here I attach the raw data from the Pawinski's paper. I have obtained
> permission from the corresponding Author to post it here for everyone.
> The only condition of use is that the Authors retain ownership of the
> data, and any publication resulting from these data must be managed by 
> them.
> 
> The dataset is composed as follows: patient number / MMF dose in [g] /
> Day of study (since start of drug administration) / MPA concentrations
> [mg/L] in plasma at following time points: 0, 0.5 ... 12 hours / and the
> value of AUC(0-12h) calculated using all time-points.
> 
> The goal of the analysis, as you can read from the paper, was to
> estimate the value of AUC using maximum 3 time-points within 2 hours
> post dose, that is using only 3 of the 4 time-points: 0, 0.5, 1, 2 - but
> always include the "0" time-point.
> 
> In my analysis of similar problem I was also concerned about the fact
> that data come from several visits of a single patient. I have examined
> the effect of "PT" with repeated "day" using mixed
effects model, and
> these effects turned out to be insignificant. Do you guys think it is
> enough justification to use the dataset as if coming from 50 separate
> patients?
I don't think that is the way to assess it, but rather estimation the 
intra-subject correlation should be used.  Or compare variances from the 
cluster bootstrap and the ordinary bootstrap.
> 
> Also, as to estimation of the bias, variance, etc, Pawinski used CI and
> Sy/x. In my analysis I additionally used RMSE values. Please excuse
> another naive question, but: do you think it is sufficient information
> to compare between models and account for bias?
RMSE is usually a good approch.
> 
> Regarding the "multiple stepwise regression" - according to the
cited
> SPSS manual, there are 5 options to select from. I don't think they
used
> 'stepwise selection' option, because their models were already
> pre-defined. Variables were pre-selected based on knowledge of
> pharmacokinetics of this drug and other factors. I think this part I
> understand pretty well.
> 
> I see the Frank's point about recalibration on Fig.2 - although the
> expectation was set that the prediction be within 15% of the original
> value. In my opinion it is *very strict* - I actually used 20% in my
> work. This is because of very high variability and imprecision in the
> results themselves. These are real biological data and you have to
> account for errors like analytical errors (HPLC method), timing errors
> and so on, when you look at these data. In other words, if you take two
> blood samples at each time-point from a particular patient, and run
> them, you will 100% certainly get two distinct (although similar)
> profiles. You will get even more difference, if you run one set of
> samples on one day, and another set on second day.
> 
> Therefore the value of AUC(0-12) itself, to which we compare the
> predicted AUC, is not 'holy' - some variability here is inherent.
> 
> Nevertheless, I see that the Fig.2 may be incorrect, if we look from
> orthodox statistical perspective. I used the same plots in my work as
> well - it's too late now. How should I properly estimate the Rsq then?
Validation Rsq is 1 - sum of squared errors / sum of squared total.

Frank
> 
> I greatly appreciate your time and advices in this matter.
> 
> -- 
> Michal J. Figurski
> 
> Frank E Harrell Jr wrote:
>> Gustaf Rydevik wrote:
>>> On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
>>> <figurski at mail.med.upenn.edu> wrote:
>>>
>>> Hi,
>>>
>>> I believe that you misunderstand the passage. Do you know what
>>> multiple stepwise regression is?
>>>
>>> Since they used SPSS, I copied from
>>>
http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm
>>>
>>>
>>> "Stepwise selection is a combination of forward and backward
procedures.
>>> Step 1
>>>
>>> The first predictor variable is selected in the same way as in
forward
>>> selection. If the probability associated with the test of
significance
>>> is less than or equal to the default .05, the predictor variable
with
>>> the largest correlation with the criterion variable enters the
>>> equation first.
>>>
>>>
>>> Step 2
>>>
>>> The second variable is selected based on the highest partial
>>> correlation. If it can pass the entry requirement (PIN=.05), it
also
>>> enters the equation.
>>>
>>> Step 3
>>>
>>>> From this point, stepwise selection differs from forward
selection:
>>> the variables already in the equation are examined for removal
>>> according to the removal criterion (POUT=.10) as in backward
>>> elimination.
>>>
>>> Step 4
>>>
>>> Variables not in the equation are examined for entry. Variable
>>> selection ends when no more variables meet entry and removal
criteria.
>>> -----------
>>>
>>>
>>> It is the outcome of this *entire process*,step1-4, that they
compare
>>> with the outcome of their *entire
bootstrap/crossvalidation/selection
>>> process*, Step1-4 in the methods section, and find that their
approach
>>> gives better result
>>> What you are doing is only step4 in the article's method
>>> section,estimating the parameters of a model *when you already know
>>> which variables to include*.It is the way this step is conducted
that
>>> I am sceptical about.
>>>
>>> Regards,
>>>
>>> Gustaf
>>>
>>
>> Perfectly stated Gustaf.  This is a great example of needing to truly 
>> understand a method to be able to use it in the right context.
>>
>> After having read most of the paper by Pawinski et al now, there are 
>> other problems.
>>
>> 1. The paper nowhere uses bootstrapping.  It uses repeated 2-fold 
>> cross-validation, a procedure not usually recommended.
>>
>> 2. The resampling procedure used in the paper treated the 50 
>> pharmacokinetic profiles on 21 renal transplant patients as if these 
>> were from 50 patients.  The cluster bootstrap should have been used 
>> instead.
>>
>> 3. Figure 2 showed the fitted regression line to the predicted vs. 
>> observed AUCs.  It should have shown the line of identify instead.  In 
>> other words, the authors allowed a subtle recalibration to creep into 
>> the analysis (and inverted the x- and y-variables in the plots).  The 
>> fitted lines are far enough away from the line of identity as to show 
>> that the predicted values are not well calibrated.  The r^2 values 
>> claimed by the authors used the wrong formulas which allowed an 
>> automatic after-the-fact recalibration (new overall slope and 
>> intercept are estimated in the test dataset).  Hence the achieved r^2 
>> are misleading.
> 
> 
> ------------------------------------------------------------------------
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Maybe Matching Threads

Search for more reasonably related threads

R help - Jul 2008 - [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]

[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]

[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]

Maybe Matching Threads