Michal Figurski
2008-Jul-21 15:04 UTC
[R] Coefficients of Logistic Regression from bootstrap - how to get them?
Hello all, I am trying to optimize my logistic regression model by using bootstrap. I was previously using SAS for this kind of tasks, but I am now switching to R. My data frame consists of 5 columns and has 109 rows. Each row is a single record composed of the following values: Subject_name, numeric1, numeric2, numeric3 and outcome (yes or no). All three numerics are used to predict outcome using LR. In SAS I have written a macro, that was splitting the dataset, running LR on one half of data and making predictions on second half. Then it was collecting the equation coefficients from each iteration of bootstrap. Later I was just taking medians of these coefficients from all iterations, and used them as an optimal model - it really worked well! Now I want to do the same in R. I tried to use the 'validate' or 'calibrate' functions from package "Design", and I also experimented with function 'sm.binomial.bootstrap' from package "sm". I tried also the function 'boot' from package "boot", though without success - in my case it randomly selected _columns_ from my data frame, while I wanted it to select _rows_. Though the main point here is the optimized LR equation. I would appreciate any help on how to extract the LR equation coefficients from any of these bootstrap functions, in the same form as given by 'glm' or 'lrm'. Many thanks in advance! -- Michal J. Figurski HUP, Pathology & Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413
Frank E Harrell Jr
2008-Jul-21 18:41 UTC
[R] Coefficients of Logistic Regression from bootstrap - how to get them?
Michal Figurski wrote:> Hello all, > > I am trying to optimize my logistic regression model by using bootstrap. > I was previously using SAS for this kind of tasks, but I am now > switching to R. > > My data frame consists of 5 columns and has 109 rows. Each row is a > single record composed of the following values: Subject_name, numeric1, > numeric2, numeric3 and outcome (yes or no). All three numerics are used > to predict outcome using LR. > > In SAS I have written a macro, that was splitting the dataset, running > LR on one half of data and making predictions on second half. Then it > was collecting the equation coefficients from each iteration of > bootstrap. Later I was just taking medians of these coefficients from > all iterations, and used them as an optimal model - it really worked well!Why not use maximum likelihood estimation, i.e., the coefficients from the original fit. How does the bootstrap improve on that?> > Now I want to do the same in R. I tried to use the 'validate' or > 'calibrate' functions from package "Design", and I also experimented > with function 'sm.binomial.bootstrap' from package "sm". I tried also > the function 'boot' from package "boot", though without success - in my > case it randomly selected _columns_ from my data frame, while I wanted > it to select _rows_.validate and calibrate in Design do resampling on the rows Resampling is mainly used to get a nearly unbiased estimate of the model performance, i.e., to correct for overfitting. Frank Harrell> > Though the main point here is the optimized LR equation. I would > appreciate any help on how to extract the LR equation coefficients from > any of these bootstrap functions, in the same form as given by 'glm' or > 'lrm'. > > Many thanks in advance! >-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Gustaf Rydevik
2008-Jul-23 14:39 UTC
[R] Coefficients of Logistic Regression from bootstrap - how to get them?
On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski <figurski at mail.med.upenn.edu> wrote:> Gustaf, > > I am sorry, but I don't get the point. Let's just focus on predictive > performance from the cited passage, that is the number of values predicted > within 15% of the original value. > So, the predictive performance from the model fit on entire dataset was 56% > of profiles, while from bootstrapped model it was 82% of profiles. Well - I > see a stunning purpose in the bootstrap step here: it turns an useless > equation into a clinically applicable model! > > Honestly, I also can't see how this can be better than fitting on entire > dataset, but here you have a proof that it is. > > I think that another argument supporting this approach is model validation. > If you fit model on entire data, you have no data left to validate its > predictions. > > On the other hand, I agree with you that the passage in methods section > looks awkward. > > In my work on a similar problem, that is going to appear in August in Ther > Drug Monit, I used medians since beginning and all the comparisons were done > based on models with median coefficients. I think this is what the authors > of that paper did, though they might just have had a problem with describing > it correctly, and unfortunately it passed through review process unchanged. >Hi, I believe that you misunderstand the passage. Do you know what multiple stepwise regression is? Since they used SPSS, I copied from http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm "Stepwise selection is a combination of forward and backward procedures. Step 1 The first predictor variable is selected in the same way as in forward selection. If the probability associated with the test of significance is less than or equal to the default .05, the predictor variable with the largest correlation with the criterion variable enters the equation first. Step 2 The second variable is selected based on the highest partial correlation. If it can pass the entry requirement (PIN=.05), it also enters the equation. Step 3>From this point, stepwise selection differs from forward selection:the variables already in the equation are examined for removal according to the removal criterion (POUT=.10) as in backward elimination. Step 4 Variables not in the equation are examined for entry. Variable selection ends when no more variables meet entry and removal criteria. ----------- It is the outcome of this *entire process*,step1-4, that they compare with the outcome of their *entire bootstrap/crossvalidation/selection process*, Step1-4 in the methods section, and find that their approach gives better result What you are doing is only step4 in the article's method section,estimating the parameters of a model *when you already know which variables to include*.It is the way this step is conducted that I am sceptical about. Regards, Gustaf -- Gustaf Rydevik, M.Sci. tel: +46(0)703 051 451 address:Essingetorget 40,112 66 Stockholm, SE skype:gustaf_rydevik
Michal Figurski
2008-Jul-23 16:22 UTC
[R] Coefficients of Logistic Regression from bootstrap - how to get them?
Thank you all for your words of wisdom. I start getting into what you mean by bootstrap. Not surprisingly, it seems to be something else than I do. The bootstrap is a tool, and I would rather compare it to a hammer than to a gun. People say that hammer is for driving nails. This situation is as if I planned to use it to break rocks. The key point is that I don't really care about the bias or variance of the mean in the model. These things are useful for statisticians; regular people (like me, also a chemist) do not understand them and have no use for them (well, now I somewhat understand). My goal is very practical: I need an equation that can predict patient's outcome, based on some data, with maximum reliability and accuracy. I have found from the mentioned paper (and from my own experience) that re-sampling and running the regression on re-sampled dataset multiple times does improve predictions. You have a proof of that in that paper, page 1502, and to me it is rather a stunning proof: compare 56% to 82% of correctly predicted values (correct means within 15% of original value). I can understand that it's somewhat new for many of you, and some tried to discourage me from this approach (shooting my foot). This concept was devised by, I believe, Mr Michael Hale, a respectable biostatistician. It utilises bootstrap concept of resampling, though, after recent discussion, I think it should be called another name. In addition to better predictive performance, using this concept I also get a second dataset with each iteration, that can be used for validation of the model. In this approach the validation data are accumulated throughout the bootstrap, and then used in the end to calculate log residuals using equation with median coefficients. I am sure you can question that in many ways, but to me this is as good as you can get. To be more practical, I will ask the authors of this paper if I can post their original dataset in this forum (I have it somewhere) - if you guys think it's interesting enough. Then anyone of you could use it, follow the procedure, and criticize, if they wish. -- Michal J. Figurski HUP, Pathology & Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 S Ellison wrote:> jeez, but you've kicked up a storm! > > penn'orth on the bootstrap; and since I'm a chemist, you can ignore at > will. > > The bootstrap starts with your data and the model you developed with > it. Resampling gives a fair idea of what the variance _around your > current estimate_ is. But it cannot tell you how biased you are or > improve your estimate, because there is no more information in your > data. > > Toy example. Let's say I get some results from some measurement > procedure, like this. > > set.seed(408) #so we get the same random sample (!) > > y<-rnorm(12,5) #OK, not a very convincing measurement, but.... > > #Now let's add a bit of bias > y<-y+3 > > mean(y) #... is my (biased) estimate of the mean value. > > #Now let's pretend I don't know the true answer OR the bias, which is > what happens > #in the real world, and try bootsrapping. Let's get a rather generous > #10000 resamples from my data; > > m<-matrix(sample(y, length(y)*10000, replace=T), ncol=length(y)) > #This gives me a matrix with 10000 rows, each of which is a resample > #of my 12 data. > > #And now we can calculate 10000 bootstrapped means in one shot: > bs.mean<-apply(m,1,mean) #which applies 'mean' to each row. > > #We hope the variance of these things is about 1/12, 'cos we got y from > a normal distribution > #with var 1 and we had 12 of them. let's see... > var(bs.mean) > > #which should resemble > 1/12 > > #and does.. roughly. > #And for interest, compare with what we go direct from the data; > var(y)/12 > #which in this case was slightly further from the 'true' variance. It > won't always be, though; > #that depends on the data. > > #Anyway, the bootstrap variance looks about right. So ... on to bias > > #Now, where would we expect the bootstrapped mean value to be? > #At the true value, or where we started? > mean(bs.mean) > > #Oh dear. It's still biased. And it looks very much like the mean of > y. > #It's clearly told us nothing about the true mean. > > #Bottom line; All you have is your data. Bootstrapping uses your data. > > #Therefore, bootstrapping can tell you no more than you can get from > your data. > #But it's still useful if you have some rather more complicated > statistic derived from > #a non-linear fit, because it lets you get some idea of the variance. > #But not the bias. > > This may be why some folk felt that your solution as worded (an > ever-present peril, wording) was not an answer to the right question. > The fitting procedure already gives you the 'best estimate' (where > 'best' means max likelihood, this time), and bootstrapping really cannot > improve on that. It can only start at your current 'best' and move away > from it in a random direction. That can't possibly improve the > estimated coefficients. And the more you bootstrap, the closer the mean > gets to where you started. > So "how does the bootstrap improve on that?" was a very pertinent > question - to which the correct answer was "it can't - but it can > suggest what the variance might be". > > As to whether you wanted advice on whether to bootstrap or not; well, > it's an open forum and aid is voluntary. R help always generates at > least three replies, one of which is "tell me more about the problem", > one of which is "why are you doing it that way?" and one of which is > "that is probably not the problem you should be trying to solve". On a > good day you also get the one that goes "this might solve it". > > Incidentally, the boot package and the simpleboot package both do > bootstrapping; they might solve your problem. > > Then there's advice. Folk obviously can't impose unless you let them - > but they do know a lot about statistics and if they say something is > silly, it is at least worth finding out why so that you (and I, for that > matter) can better defend our silliness. > Also, of course, if you see someone trying to do something silly - eg > pull the trigger while the gun is pointed at their foot - would you > really give them the instruction they asked for on how to get the safety > catch off? Or tell them that what they are doing is silly? > (Me, well, it's their foot but if I help them, they may sue me later) > > > If any of the above helps without sounding horribly patronising, I win. > If not, well, you have another email to burn! > > happy booting > > Steve Ellison > >>>> Michal Figurski <figurski at mail.med.upenn.edu> 22/07/2008 20:42 >>> > > > ******************************************************************* > This email and any attachments are confidential. Any u...{{dropped:8}}
Michal Figurski
2008-Jul-30 20:05 UTC
[R] Coefficients of Logistic Regression from bootstrap - how to get them?
Tim, If I understand correctly, you are saying that one can't improve on estimating a mean by doing bootstrap and summarizing means of many such steps. As far as I understand (again), you're saying that this way one can only add bias without any improvement... Well, this is in contradiction to some guides to bootstrap, that I found on the web (I did my homework), for example to this one: http://people.revoledu.com/kardi/tutorial/Bootstrap/Lyra/Bootstrap Statistic Mean.htm It is all confusing, guys... Once somebody said, that there are as many opinions on a topic, as there are statisticians... Also, translating your statements into the example of hammer and rock, you are saying that one cannot use hammer to break rocks because it was created to drive nails. With all respect, despite my limited knowledge, I do not agree. The big point is that the mean, or standard error, or confidence intervals of the data itself are *meaningless* in the pharmacokinetic dataset. These data are time series of a highly variable quantity, that is known to display a peak (or two in the case of Pawinski's paper). It is as if you tried to calculate a mean of a chromatogram (example for chemists, sorry). Nevertheless, I thank all of you, experts, for your insight and advice. In the end, I learned a lot, though I keep my initial view. Summarizing your criticism of the procedure described in Pawinski's paper: - Some of you say that this isn't bootstrap at all. In terms of terminology I totally submit to that, because I know too little. Would anyone suggest a name? - Most of you say that this procedure is not the best one, that there are better ways. I will definitely do my homework on penalized regression, though no one of you has actually discredited this methodology. Therefore, though possibly not optimal, it remains valid. - The criticism on "predictive performance" is that one has to take into account also other important quantities, like bias, variance, etc. Fortunately I did that in my work: using RMSE and log residuals from the validation process. I just observed that models with relatively small RMSE and log residuals (compared to other models) usually possess good predictive performance. And vice versa. Predictive performance has also a great advantage over RMSE or variance or anything else suggested here - it is easily understood by non-statisticians. I don't think it is /too simple/ in Einstein's terms, it's just simple. Kind regards, -- Michal J. Figurski Tim Hesterberg wrote:> I'll address the question of whether you can use the bootstrap to > improve estimates, and whether you can use the bootstrap to "virtually > increase the size of the sample". > > Short answer - no, with some exceptions (bumping / Random Forests). > > Longer answer: > Suppose you have data (x1, ..., xn) and a statistic ThetaHat, > that you take a number of bootstrap samples (all of size n) and > let ThetaHatBar be the average of those bootstrap statistics from > those samples. > > Is ThetaHatBar better than ThetaHat? Usually not. Usually it > is worse. You have not collected any new data, you are just using the > existing data in a different way, that is usually harmful: > * If the statistic is the sample mean, all this does is to add > some noise to the estimate > * If the statistic is nonlinear, this gives an estimate that > has roughly double the bias, without improving the variance. > > What are the exceptions? The prime example is tree models (random > forests) - taking bootstrap averages helps smooth out the > discontinuities in tree models. For a simple example, suppose that a > simple linear regression model really holds: > y = beta x + epsilon > but that you fit a tree model; the tree model predictions are > a step function. If you bootstrap the data, the boundaries of > the step function will differ from one sample to another, so > the average of the bootstrap samples smears out the steps, getting > closer to the smooth linear relationship. > > Aside from such exceptions, the bootstrap is used for inference > (bias, standard error, confidence intervals), not improving on > ThetaHat. > > Tim Hesterberg > >> Hi Doran, >> >> Maybe I am wrong, but I think bootstrap is a general resampling method which >> can be used for different purposes...Usually it works well when you do not >> have a presentative sample set (maybe with limited number of samples). >> Therefore, I am positive with Michal... >> >> P.S., overfitting, in my opinion, is used to depict when you got a model >> which is quite specific for the training dataset but cannot be generalized >> with new samples...... >> >> Thanks, >> >> --Jerry >> 2008/7/21 Doran, Harold <HDoran at air.org>: >> >>>> I used bootstrap to virtually increase the size of my >>>> dataset, it should result in estimates more close to that >>>> from the population - isn't it the purpose of bootstrap? >>> No, not really. The bootstrap is a resampling method for variance >>> estimation. It is often used when there is not an easy way, or a closed >>> form expression, for estimating the sampling variance of a statistic. >
Michal Figurski
2008-Jul-31 17:36 UTC
[R] Coefficients of Logistic Regression from bootstrap - how to get them?
Gustaf, Summarizing things I don't understand: - Honestly, I was thinking I can use bootstrap to obtain better estimate of a mean - provided that I want it. So, I can't? - If I can't obtain reliable estimates of CI and variance from a small dataset, but I can do it with bootstrap - isn't it a "virtual increase" of the size of dataset? OK, these are just words, I won't fight for that. - I don't understand why a procedure works for 26 models and doesn't work for one... Intuitively this doesn't make sense... - I don't understand why resampling *cannot* improve... while it does? I know the proof is going to be hard to follow, but let me try! (The proof of the opposite is in the paper). - I truly don't understand what I don't understand about what I am doing. This is getting too much convoluted for me... And a remark about what I don't agree with Gustaf: The text below, quoted from Pawinski et al ("Twenty six..."), is missing an important information - that they repeated that step 50 times - each time with "randomly selected subset". Excuse my ignorance again, but this looks like bootstrap (re-sampling), doesn't it? Although I won't argue for names. I want to assure everyone here that I did *exactly* what they did. I work in the same lab, that this paper came from, and I just had their procedure in SPSS translated to SAS. Moreover, the translation was done with help of a _trustworthy biostatistician_ - I was not that good with SAS at the time to do it myself. The biostatistician wrote the randomization and regression subroutines. I later improved them using macros (less code) and added validation part. It was then approved by that biostatistician. OK, I did not exactly do the same, because I repeated the step 100 times for 34 *pre-defined* models and on a different dataset. But that's about all the difference. I hope this solves everyone's dilemma whether I did what is described in Pawinski's paper or not. This discussion, though, started with my question on: how to do it in R, instead of SAS, and with logistic (not linear) regression. Thank you, Gustaf, for the code - this was the help I needed. -- Michal J. Figurski Gustaf Rydevik wrote:> " For example, in here, the statistical estimator is the sample mean. > Using bootstrap sampling, you can do beyond your statistical > estimators. You can now get even the distribution of your estimator > and the statistics (such as confidence interval, variance) of your > estimator." > > Again you are misinterpreting text. The phrase about "doing beyond > your statistical estimators", is explained in the next sentence, where > he says that using bootstrap gives you information about the mean > *estimator* (and not more information about the population mean). > And since you're not interested in this information, in your case > bootstrap/resampling is not useful at all. > > As another example of misinterpretation: In your email from a week > ago, it sounds like you believe that the authors of the original paper > are trying to improve on a fixed model > Figurski: > "Regarding the "multiple stepwise regression" - according to the cited > SPSS manual, there are 5 options to select from. I don't think they used > 'stepwise selection' option, because their models were already > pre-defined. Variables were pre-selected based on knowledge of > pharmacokinetics of this drug and other factors. I think this part I > understand pretty well." > > This paragraph is wrong. Sorry, no way around it. > > Quoting from the paper Pawinski etal: > " *__Twenty-six____(!)* 1-, 2-, or 3-sample estimation > models were fit (r2 0.341? 0.862) to a randomly > selected subset of the profiles using linear regression > and were used to estimate AUC0?12h for the profiles not > included in the regression fit, comparing those estimates > with the corresponding AUC0?12h values, calculated > with the linear trapezoidal rule, including all 12 > timed MPA concentrations. The 3-sample models were > constrained to include no samples past 2 h." > (emph. mine) > > They clearly state that they are choosing among 26 different models by > using their bootstrap-like procedure, not improving on a single, > predefined model. > This procedure is statistically sound (more or less at least), and not > controversial. > > However, (again) what you are wanting to do is *not* what they did in > their paper! > > resampling can not improve on the performance of a pre-specified > model. This is intuitively obvious, but moreover its mathematically > provable! That's why we're so certain of our standpoint. If you really > wish, I (or someone else) could write out a proof, but I'm unsure if > you would be able to follow. > > In the end, it doesn't really matter. What you are doing amounts to > doing a regression 50 times, when once would suffice. No big harm > done, just a bit of unnecessary work. And proof to a statistically > competent reviewer that you don't really understand what you're doing. > The better option would be to either study some more statistics > yourself, or find a statistician that can do your analysis for you, > and trust him to do it right. > > Anyhow, good luck with your research. > > Best regards, > > Gustaf
Apparently Analagous Threads
- [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]
- [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]
- bootstrap vs. resampleing
- bootstrapped correlation confint lower than -1 ?
- bootstrap bca confidence intervals for large number of statistics in one model; library("boot")