thr3ads.net - R help - [R] Coefficients of Logistic Regression from bootstrap

If this information is useful, please help other people find it:
Share via:

Michal Figurski

2008-Jul-21 15:04 UTC

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Hello all,

I am trying to optimize my logistic regression model by using bootstrap. 
I was previously using SAS for this kind of tasks, but I am now 
switching to R.

My data frame consists of 5 columns and has 109 rows. Each row is a 
single record composed of the following values: Subject_name, numeric1, 
numeric2, numeric3 and outcome (yes or no). All three numerics are used 
to predict outcome using LR.

In SAS I have written a macro, that was splitting the dataset, running 
LR on one half of data and making predictions on second half. Then it 
was collecting the equation coefficients from each iteration of 
bootstrap. Later I was just taking medians of these coefficients from 
all iterations, and used them as an optimal model - it really worked well!

Now I want to do the same in R. I tried to use the 'validate' or 
'calibrate' functions from package "Design", and I also
experimented
with function 'sm.binomial.bootstrap' from package "sm". I
tried also
the function 'boot' from package "boot", though without
success - in my
case it randomly selected _columns_ from my data frame, while I wanted 
it to select _rows_.

Though the main point here is the optimized LR equation. I would 
appreciate any help on how to extract the LR equation coefficients from 
any of these bootstrap functions, in the same form as given by 'glm' or 
'lrm'.

Many thanks in advance!

-- 
Michal J. Figurski
HUP, Pathology & Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

Frank E Harrell Jr

2008-Jul-21 18:41 UTC

head link

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Michal Figurski wrote:> Hello all,
> 
> I am trying to optimize my logistic regression model by using bootstrap. 
> I was previously using SAS for this kind of tasks, but I am now 
> switching to R.
> 
> My data frame consists of 5 columns and has 109 rows. Each row is a 
> single record composed of the following values: Subject_name, numeric1, 
> numeric2, numeric3 and outcome (yes or no). All three numerics are used 
> to predict outcome using LR.
> 
> In SAS I have written a macro, that was splitting the dataset, running 
> LR on one half of data and making predictions on second half. Then it 
> was collecting the equation coefficients from each iteration of 
> bootstrap. Later I was just taking medians of these coefficients from 
> all iterations, and used them as an optimal model - it really worked well!
Why not use maximum likelihood estimation, i.e., the coefficients from 
the original fit.  How does the bootstrap improve on that?
> 
> Now I want to do the same in R. I tried to use the 'validate' or 
> 'calibrate' functions from package "Design", and I also
experimented
> with function 'sm.binomial.bootstrap' from package "sm".
I tried also
> the function 'boot' from package "boot", though without
success - in my
> case it randomly selected _columns_ from my data frame, while I wanted 
> it to select _rows_.
validate and calibrate in Design do resampling on the rows

Resampling is mainly used to get a nearly unbiased estimate of the model 
performance, i.e., to correct for overfitting.

Frank Harrell
> 
> Though the main point here is the optimized LR equation. I would 
> appreciate any help on how to extract the LR equation coefficients from 
> any of these bootstrap functions, in the same form as given by
'glm' or
> 'lrm'.
> 
> Many thanks in advance!
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Gustaf Rydevik

2008-Jul-23 14:39 UTC

head link

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
<figurski at mail.med.upenn.edu> wrote:> Gustaf,
>
> I am sorry, but I don't get the point. Let's just focus on
predictive
> performance from the cited passage, that is the number of values predicted
> within 15% of the original value.
> So, the predictive performance from the model fit on entire dataset was 56%
> of profiles, while from bootstrapped model it was 82% of profiles. Well - I
> see a stunning purpose in the bootstrap step here: it turns an useless
> equation into a clinically applicable model!
>
> Honestly, I also can't see how this can be better than fitting on
entire
> dataset, but here you have a proof that it is.
>
> I think that another argument supporting this approach is model validation.
> If you fit model on entire data, you have no data left to validate its
> predictions.
>
> On the other hand, I agree with you that the passage in methods section
> looks awkward.
>
> In my work on a similar problem, that is going to appear in August in Ther
> Drug Monit, I used medians since beginning and all the comparisons were
done
> based on models with median coefficients. I think this is what the authors
> of that paper did, though they might just have had a problem with
describing
> it correctly, and unfortunately it passed through review process unchanged.
>

Hi,

I believe that you misunderstand the passage. Do you know what
multiple stepwise regression is?

Since they used SPSS, I copied from
http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm

"Stepwise selection is a combination of forward and backward procedures.
Step 1

The first predictor variable is selected in the same way as in forward
selection. If the probability associated with the test of significance
is less than or equal to the default .05, the predictor variable with
the largest correlation with the criterion variable enters the
equation first.

Step 2

The second variable is selected based on the highest partial
correlation. If it can pass the entry requirement (PIN=.05), it also
enters the equation.

Step 3
>From this point, stepwise selection differs from forward selection:the variables already in the equation are examined for removal
according to the removal criterion (POUT=.10) as in backward
elimination.

Step 4

Variables not in the equation are examined for entry. Variable
selection ends when no more variables meet entry and removal criteria.
-----------

It is the outcome of this *entire process*,step1-4, that they compare
with the outcome of their *entire bootstrap/crossvalidation/selection
process*, Step1-4 in the methods section, and find that their approach
gives better result
What you are doing is only step4 in the article's method
section,estimating the parameters of a model *when you already know
which variables to include*.It is the way this step is conducted that
I am sceptical about.

Regards,

Gustaf

-- 
Gustaf Rydevik, M.Sci.
tel: +46(0)703 051 451
address:Essingetorget 40,112 66 Stockholm, SE
skype:gustaf_rydevik

Michal Figurski

2008-Jul-23 16:22 UTC

head link

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Thank you all for your words of wisdom.

I start getting into what you mean by bootstrap. Not surprisingly, it 
seems to be something else than I do. The bootstrap is a tool, and I 
would rather compare it to a hammer than to a gun. People say that 
hammer is for driving nails. This situation is as if I planned to use it 
to break rocks.

The key point is that I don't really care about the bias or variance of 
the mean in the model. These things are useful for statisticians; 
regular people (like me, also a chemist) do not understand them and have 
no use for them (well, now I somewhat understand). My goal is very 
practical: I need an equation that can predict patient's outcome, based 
on some data, with maximum reliability and accuracy.

I have found from the mentioned paper (and from my own experience) that 
re-sampling and running the regression on re-sampled dataset multiple 
times does improve predictions. You have a proof of that in that paper, 
page 1502, and to me it is rather a stunning proof: compare 56% to 82% 
of correctly predicted values (correct means within 15% of original value).

I can understand that it's somewhat new for many of you, and some tried 
to discourage me from this approach (shooting my foot). This concept was 
devised by, I believe, Mr Michael Hale, a respectable biostatistician. 
It utilises bootstrap concept of resampling, though, after recent 
discussion, I think it should be called another name.

In addition to better predictive performance, using this concept I also 
get a second dataset with each iteration, that can be used for 
validation of the model. In this approach the validation data are 
accumulated throughout the bootstrap, and then used in the end to 
calculate log residuals using equation with median coefficients. I am 
sure you can question that in many ways, but to me this is as good as 
you can get.

To be more practical, I will ask the authors of this paper if I can post 
their original dataset in this forum (I have it somewhere) - if you guys 
think it's interesting enough. Then anyone of you could use it, follow 
the procedure, and criticize, if they wish.

--
Michal J. Figurski
HUP, Pathology & Laboratory Medicine
Xenobiotics Toxicokinetics Research Laboratory
3400 Spruce St. 7 Maloney
Philadelphia, PA 19104
tel. (215) 662-3413

S Ellison wrote:> jeez, but you've kicked up a storm!
> 
> penn'orth on the bootstrap; and since I'm a chemist, you can ignore
at
> will.
> 
> The bootstrap starts with your data and the model you developed with
> it. Resampling gives a fair idea of what the variance _around your
> current estimate_ is. But it cannot tell you how biased you are or
> improve your estimate, because there is no more information in your
> data. 
> 
> Toy example. Let's say I get some results from some measurement
> procedure, like this.
> 
> set.seed(408) #so we get the same random sample (!)
> 
> y<-rnorm(12,5) #OK, not a very convincing measurement, but....
> 
> #Now let's add a bit of bias
> y<-y+3
> 
> mean(y) #... is my (biased) estimate of the mean value.
> 
> #Now let's pretend I don't know the true answer OR the bias, which
is
> what happens 
> #in the real world, and try bootsrapping. Let's get a rather generous 
> #10000 resamples from my data;
> 
> m<-matrix(sample(y, length(y)*10000, replace=T), ncol=length(y))
> #This gives me a matrix with 10000 rows, each of which is a resample 
> #of my 12 data. 
> 
> #And now we can calculate 10000 bootstrapped means in one shot:
> bs.mean<-apply(m,1,mean) #which applies 'mean' to each row.
> 
> #We hope the variance of these things is about 1/12, 'cos we got y from
> a normal distribution 
> #with var 1 and we had 12 of them.  let's see...
> var(bs.mean)
> 
> #which should resemble
> 1/12
> 
> #and does.. roughly. 
> #And for interest, compare with what we go direct from the data;
> var(y)/12
> #which in this case was slightly further from the 'true' variance.
It
> won't always be, though; 
> #that depends on the data.
> 
> #Anyway, the bootstrap variance looks about right. So ... on to bias
> 
> #Now, where would we expect the bootstrapped mean value to be? 
> #At the true value, or where we started?
> mean(bs.mean)
> 
> #Oh dear. It's still biased. And it looks very much like the mean of
> y.
> #It's clearly told us nothing about the true mean.
> 
> #Bottom line; All you have is your data. Bootstrapping uses your data.
> 
> #Therefore, bootstrapping can tell you no more than you can get from
> your data.
> #But it's still useful if you have some rather more complicated
> statistic derived from 
> #a non-linear fit, because it lets you get some idea of the variance.
> #But not the bias.
> 
> This may be why some folk felt that your solution as worded (an
> ever-present peril, wording) was not an answer to the right question.
> The fitting procedure already gives you the 'best estimate' (where
> 'best' means max likelihood, this time), and bootstrapping really
cannot
> improve on that. It can only start at your current 'best' and move
away
> from it in a random direction.  That can't possibly improve the
> estimated coefficients. And the more you bootstrap, the closer the mean
> gets to where you started. 
> So "how does the bootstrap improve on that?" was a very pertinent
> question - to which the correct answer was "it can't - but it can
> suggest what the variance might be". 
> 
> As to whether you wanted advice on whether to bootstrap or not; well,
> it's an open forum and aid is voluntary. R help always generates at
> least three replies, one of which is "tell me more about the
problem",
> one of which is "why are you doing it that way?" and one of which
is
> "that is probably not the problem you should be trying to solve".
On a
> good day you also get the one that goes "this might solve it".
> 
> Incidentally, the boot package and the simpleboot package both do
> bootstrapping; they might solve your problem. 
> 
> Then there's advice. Folk obviously can't impose unless you let
them -
> but they do know a lot about statistics and if they say something is
> silly, it is at least worth finding out why so that you (and I, for that
> matter) can better defend our silliness. 
> Also, of course, if you see someone trying to do something silly - eg
> pull the trigger while the gun is pointed at their foot - would you
> really give them the instruction they asked for on how to get the safety
> catch off? Or tell them that what they are doing is silly? 
> (Me, well, it's their foot but if I help them, they may sue me later)
> 
> 
> If any of the above helps without sounding horribly patronising, I win.
> If not, well, you have another email to burn!
> 
> happy booting
> 
> Steve Ellison
> 
>>>> Michal Figurski <figurski at mail.med.upenn.edu>
22/07/2008 20:42 >>>
> 
> 
> *******************************************************************
> This email and any attachments are confidential. Any u...{{dropped:8}}

Michal Figurski

2008-Jul-30 20:05 UTC

head link

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Tim,

If I understand correctly, you are saying that one can't improve on 
estimating a mean by doing bootstrap and summarizing means of many such 
steps. As far as I understand (again), you're saying that this way one 
can only add bias without any improvement...

Well, this is in contradiction to some guides to bootstrap, that I found 
on the web (I did my homework), for example to this one:
http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap 
Statistic Mean.htm

It is all confusing, guys... Once somebody said, that there are as many 
opinions on a topic, as there are statisticians...

Also, translating your statements into the example of hammer and rock, 
you are saying that one cannot use hammer to break rocks because it was 
created to drive nails.

With all respect, despite my limited knowledge, I do not agree.
The big point is that the mean, or standard error, or confidence 
intervals of the data itself are *meaningless* in the pharmacokinetic 
dataset. These data are time series of a highly variable quantity, that 
is known to display a peak (or two in the case of Pawinski's paper). It 
is as if you tried to calculate a mean of a chromatogram (example for 
chemists, sorry).

Nevertheless, I thank all of you, experts, for your insight and advice. 
In the end, I learned a lot, though I keep my initial view. Summarizing 
your criticism of the procedure described in Pawinski's paper:
  - Some of you say that this isn't bootstrap at all. In terms of 
terminology I totally submit to that, because I know too little. Would 
anyone suggest a name?
  - Most of you say that this procedure is not the best one, that there 
are better ways. I will definitely do my homework on penalized 
regression, though no one of you has actually discredited this 
methodology. Therefore, though possibly not optimal, it remains valid.
  - The criticism on "predictive performance" is that one has to take 
into account also other important quantities, like bias, variance, etc. 
Fortunately I did that in my work: using RMSE and log residuals from the 
validation process. I just observed that models with relatively small 
RMSE and log residuals (compared to other models) usually possess good 
predictive performance. And vice versa.
Predictive performance has also a great advantage over RMSE or variance 
or anything else suggested here - it is easily understood by 
non-statisticians. I don't think it is /too simple/ in Einstein's terms,
it's just simple.

Kind regards,

--
Michal J. Figurski


Tim Hesterberg wrote:> I'll address the question of whether you can use the bootstrap to
> improve estimates, and whether you can use the bootstrap to "virtually
> increase the size of the sample".
> 
> Short answer - no, with some exceptions (bumping / Random Forests).
> 
> Longer answer:
> Suppose you have data (x1, ..., xn) and a statistic ThetaHat,
> that you take a number of bootstrap samples (all of size n) and
> let ThetaHatBar be the average of those bootstrap statistics from
> those samples.
> 
> Is ThetaHatBar better than ThetaHat?  Usually not.  Usually it
> is worse.  You have not collected any new data, you are just using the
> existing data in a different way, that is usually harmful:
> * If the statistic is the sample mean, all this does is to add
>   some noise to the estimate
> * If the statistic is nonlinear, this gives an estimate that
>   has roughly double the bias, without improving the variance.
> 
> What are the exceptions?  The prime example is tree models (random
> forests) - taking bootstrap averages helps smooth out the
> discontinuities in tree models.  For a simple example, suppose that a
> simple linear regression model really holds:
> 	y = beta x + epsilon
> but that you fit a tree model; the tree model predictions are
> a step function.  If you bootstrap the data, the boundaries of
> the step function will differ from one sample to another, so
> the average of the bootstrap samples smears out the steps, getting
> closer to the smooth linear relationship.
> 
> Aside from such exceptions, the bootstrap is used for inference
> (bias, standard error, confidence intervals), not improving on
> ThetaHat.
> 
> Tim Hesterberg
> 
>> Hi Doran,
>>
>> Maybe I am wrong, but I think bootstrap is a general resampling method
which
>> can be used for different purposes...Usually it works well when you do
not
>> have a presentative sample set (maybe with limited number of samples).
>> Therefore, I am positive with Michal...
>>
>> P.S., overfitting, in my opinion, is used to depict when you got a
model
>> which is quite specific for the training dataset but cannot be
generalized
>> with new samples......
>>
>> Thanks,
>>
>> --Jerry
>> 2008/7/21 Doran, Harold <HDoran at air.org>:
>>
>>>> I used bootstrap to virtually increase the size of my
>>>> dataset, it should result in estimates more close to that
>>>> from the population - isn't it the purpose of bootstrap?
>>> No, not really. The bootstrap is a resampling method for variance
>>> estimation. It is often used when there is not an easy way, or a
closed
>>> form expression, for estimating the sampling variance of a
statistic.
>

Michal Figurski

2008-Jul-31 17:36 UTC

head link

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Gustaf,

Summarizing things I don't understand:
  - Honestly, I was thinking I can use bootstrap to obtain better
estimate of a mean - provided that I want it. So, I can't?
  - If I can't obtain reliable estimates of CI and variance from a small
dataset, but I can do it with bootstrap - isn't it a "virtual
increase"
of the size of dataset? OK, these are just words, I won't fight for that.
  - I don't understand why a procedure works for 26 models and doesn't
work for one... Intuitively this doesn't make sense...
  - I don't understand why resampling *cannot* improve... while it does?
I know the proof is going to be hard to follow, but let me try! (The
proof of the opposite is in the paper).
  - I truly don't understand what I don't understand about what I am
doing. This is getting too much convoluted for me...

And a remark about what I don't agree with Gustaf:

The text below, quoted from Pawinski et al ("Twenty six..."), is
missing
an important information - that they repeated that step 50 times - each
time with "randomly selected subset". Excuse my ignorance again, but
this looks like bootstrap (re-sampling), doesn't it? Although I won't
argue for names.

I want to assure everyone here that I did *exactly* what they did. I
work in the same lab, that this paper came from, and I just had their
procedure in SPSS translated to SAS. Moreover, the translation was done
with help of a _trustworthy biostatistician_ - I was not that good with
SAS at the time to do it myself. The biostatistician wrote the
randomization and regression subroutines. I later improved them using
macros (less code) and added validation part. It was then approved by
that biostatistician.
OK, I did not exactly do the same, because I repeated the step 100 times
for 34 *pre-defined* models and on a different dataset. But that's about
all the difference.

I hope this solves everyone's dilemma whether I did what is described in
Pawinski's paper or not.

This discussion, though, started with my question on: how to do it in R,
instead of SAS, and with logistic (not linear) regression. Thank you,
Gustaf, for the code - this was the help I needed.

--
Michal J. Figurski


Gustaf Rydevik wrote:
> " For example, in here, the statistical estimator is  the sample mean.
> Using bootstrap sampling, you can do beyond your statistical
> estimators. You can now get even the distribution of your estimator
> and the statistics (such as confidence interval, variance) of your
> estimator."
> 
> Again you are misinterpreting text. The phrase about "doing beyond
> your statistical estimators", is explained in the next sentence, where
> he says that using bootstrap gives you information about the mean
> *estimator* (and not more information about the population mean).
> And since you're not interested in this information, in your case
> bootstrap/resampling is not useful at all.
>
> As another example of misinterpretation: In your email from  a week
> ago, it sounds like you believe that the authors of the original paper
> are trying to improve on a fixed model
> Figurski:
> "Regarding the "multiple stepwise regression" - according to
the cited
> SPSS manual, there are 5 options to select from. I don't think they
used
> 'stepwise selection' option, because their models were already
> pre-defined. Variables were pre-selected based on knowledge of
> pharmacokinetics of this drug and other factors. I think this part I
> understand pretty well."
> 
> This paragraph is wrong. Sorry, no way around it.
> 
> Quoting from the paper Pawinski etal:
> "  *__Twenty-six____(!)*     1-, 2-, or 3-sample estimation
> models were fit (r2  0.341? 0.862) to a randomly
> selected subset of the profiles using linear regression
> and were used to estimate AUC0?12h for the profiles not
> included in the regression fit, comparing those estimates
> with the corresponding AUC0?12h values, calculated
> with the linear trapezoidal rule, including all 12
> timed MPA concentrations. The 3-sample models were
> constrained to include no samples past 2 h."
> (emph. mine)
>
> They clearly state that they are choosing among 26 different models by
> using their bootstrap-like procedure, not improving on a single,
> predefined model.
> This procedure is statistically sound (more or less at least), and not
> controversial.
> 
> However, (again) what you are wanting to do is *not* what they did in
> their paper!
>
> resampling can not improve on the performance of a pre-specified
> model. This is intuitively obvious, but moreover its mathematically
> provable! That's why we're so certain of our standpoint. If you
really
> wish, I (or someone else) could write out a proof, but I'm unsure if
> you would be able to follow.
>
> In the end, it doesn't really matter. What you are doing amounts to
> doing a regression 50 times, when once would suffice. No big harm
> done, just a bit of unnecessary work. And proof to a statistically
> competent reviewer that you don't really understand what you're
doing.
> The better option would be to either study some more statistics
> yourself, or find a statistician that can do your analysis for you,
> and trust him to do it right.
>
> Anyhow, good luck with your research.
> 
> Best regards,
> 
> Gustaf

Apparently Analagous Threads

Search for more reasonably related threads

R help - Jul 2008 - Coefficients of Logistic Regression from bootstrap - how to get them?

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Apparently Analagous Threads