Michal Figurski
2008-Jul-24 14:55 UTC
[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]
Thank you Frank and all for your advices. Here I attach the raw data from the Pawinski's paper. I have obtained permission from the corresponding Author to post it here for everyone. The only condition of use is that the Authors retain ownership of the data, and any publication resulting from these data must be managed by them. The dataset is composed as follows: patient number / MMF dose in [g] / Day of study (since start of drug administration) / MPA concentrations [mg/L] in plasma at following time points: 0, 0.5 ... 12 hours / and the value of AUC(0-12h) calculated using all time-points. The goal of the analysis, as you can read from the paper, was to estimate the value of AUC using maximum 3 time-points within 2 hours post dose, that is using only 3 of the 4 time-points: 0, 0.5, 1, 2 - but always include the "0" time-point. In my analysis of similar problem I was also concerned about the fact that data come from several visits of a single patient. I have examined the effect of "PT" with repeated "day" using mixed effects model, and these effects turned out to be insignificant. Do you guys think it is enough justification to use the dataset as if coming from 50 separate patients? Also, as to estimation of the bias, variance, etc, Pawinski used CI and Sy/x. In my analysis I additionally used RMSE values. Please excuse another naive question, but: do you think it is sufficient information to compare between models and account for bias? Regarding the "multiple stepwise regression" - according to the cited SPSS manual, there are 5 options to select from. I don't think they used 'stepwise selection' option, because their models were already pre-defined. Variables were pre-selected based on knowledge of pharmacokinetics of this drug and other factors. I think this part I understand pretty well. I see the Frank's point about recalibration on Fig.2 - although the expectation was set that the prediction be within 15% of the original value. In my opinion it is *very strict* - I actually used 20% in my work. This is because of very high variability and imprecision in the results themselves. These are real biological data and you have to account for errors like analytical errors (HPLC method), timing errors and so on, when you look at these data. In other words, if you take two blood samples at each time-point from a particular patient, and run them, you will 100% certainly get two distinct (although similar) profiles. You will get even more difference, if you run one set of samples on one day, and another set on second day. Therefore the value of AUC(0-12) itself, to which we compare the predicted AUC, is not 'holy' - some variability here is inherent. Nevertheless, I see that the Fig.2 may be incorrect, if we look from orthodox statistical perspective. I used the same plots in my work as well - it's too late now. How should I properly estimate the Rsq then? I greatly appreciate your time and advices in this matter. -- Michal J. Figurski Frank E Harrell Jr wrote:> Gustaf Rydevik wrote: >> On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski >> <figurski at mail.med.upenn.edu> wrote: >> >> Hi, >> >> I believe that you misunderstand the passage. Do you know what >> multiple stepwise regression is? >> >> Since they used SPSS, I copied from >> http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm >> >> >> "Stepwise selection is a combination of forward and backward procedures. >> Step 1 >> >> The first predictor variable is selected in the same way as in forward >> selection. If the probability associated with the test of significance >> is less than or equal to the default .05, the predictor variable with >> the largest correlation with the criterion variable enters the >> equation first. >> >> >> Step 2 >> >> The second variable is selected based on the highest partial >> correlation. If it can pass the entry requirement (PIN=.05), it also >> enters the equation. >> >> Step 3 >> >>> From this point, stepwise selection differs from forward selection: >> the variables already in the equation are examined for removal >> according to the removal criterion (POUT=.10) as in backward >> elimination. >> >> Step 4 >> >> Variables not in the equation are examined for entry. Variable >> selection ends when no more variables meet entry and removal criteria. >> ----------- >> >> >> It is the outcome of this *entire process*,step1-4, that they compare >> with the outcome of their *entire bootstrap/crossvalidation/selection >> process*, Step1-4 in the methods section, and find that their approach >> gives better result >> What you are doing is only step4 in the article's method >> section,estimating the parameters of a model *when you already know >> which variables to include*.It is the way this step is conducted that >> I am sceptical about. >> >> Regards, >> >> Gustaf >> > > Perfectly stated Gustaf. This is a great example of needing to truly > understand a method to be able to use it in the right context. > > After having read most of the paper by Pawinski et al now, there are > other problems. > > 1. The paper nowhere uses bootstrapping. It uses repeated 2-fold > cross-validation, a procedure not usually recommended. > > 2. The resampling procedure used in the paper treated the 50 > pharmacokinetic profiles on 21 renal transplant patients as if these > were from 50 patients. The cluster bootstrap should have been used > instead. > > 3. Figure 2 showed the fitted regression line to the predicted vs. > observed AUCs. It should have shown the line of identify instead. In > other words, the authors allowed a subtle recalibration to creep into > the analysis (and inverted the x- and y-variables in the plots). The > fitted lines are far enough away from the line of identity as to show > that the predicted values are not well calibrated. The r^2 values > claimed by the authors used the wrong formulas which allowed an > automatic after-the-fact recalibration (new overall slope and intercept > are estimated in the test dataset). Hence the achieved r^2 are misleading.-------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Dataset.csv URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20080724/f8ce0b2b/attachment.pl>
Frank E Harrell Jr
2008-Jul-24 18:38 UTC
[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]
Michal Figurski wrote:> Thank you Frank and all for your advices. > > Here I attach the raw data from the Pawinski's paper. I have obtained > permission from the corresponding Author to post it here for everyone. > The only condition of use is that the Authors retain ownership of the > data, and any publication resulting from these data must be managed by > them. > > The dataset is composed as follows: patient number / MMF dose in [g] / > Day of study (since start of drug administration) / MPA concentrations > [mg/L] in plasma at following time points: 0, 0.5 ... 12 hours / and the > value of AUC(0-12h) calculated using all time-points. > > The goal of the analysis, as you can read from the paper, was to > estimate the value of AUC using maximum 3 time-points within 2 hours > post dose, that is using only 3 of the 4 time-points: 0, 0.5, 1, 2 - but > always include the "0" time-point. > > In my analysis of similar problem I was also concerned about the fact > that data come from several visits of a single patient. I have examined > the effect of "PT" with repeated "day" using mixed effects model, and > these effects turned out to be insignificant. Do you guys think it is > enough justification to use the dataset as if coming from 50 separate > patients?I don't think that is the way to assess it, but rather estimation the intra-subject correlation should be used. Or compare variances from the cluster bootstrap and the ordinary bootstrap.> > Also, as to estimation of the bias, variance, etc, Pawinski used CI and > Sy/x. In my analysis I additionally used RMSE values. Please excuse > another naive question, but: do you think it is sufficient information > to compare between models and account for bias?RMSE is usually a good approch.> > Regarding the "multiple stepwise regression" - according to the cited > SPSS manual, there are 5 options to select from. I don't think they used > 'stepwise selection' option, because their models were already > pre-defined. Variables were pre-selected based on knowledge of > pharmacokinetics of this drug and other factors. I think this part I > understand pretty well. > > I see the Frank's point about recalibration on Fig.2 - although the > expectation was set that the prediction be within 15% of the original > value. In my opinion it is *very strict* - I actually used 20% in my > work. This is because of very high variability and imprecision in the > results themselves. These are real biological data and you have to > account for errors like analytical errors (HPLC method), timing errors > and so on, when you look at these data. In other words, if you take two > blood samples at each time-point from a particular patient, and run > them, you will 100% certainly get two distinct (although similar) > profiles. You will get even more difference, if you run one set of > samples on one day, and another set on second day. > > Therefore the value of AUC(0-12) itself, to which we compare the > predicted AUC, is not 'holy' - some variability here is inherent. > > Nevertheless, I see that the Fig.2 may be incorrect, if we look from > orthodox statistical perspective. I used the same plots in my work as > well - it's too late now. How should I properly estimate the Rsq then?Validation Rsq is 1 - sum of squared errors / sum of squared total. Frank> > I greatly appreciate your time and advices in this matter. > > -- > Michal J. Figurski > > Frank E Harrell Jr wrote: >> Gustaf Rydevik wrote: >>> On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski >>> <figurski at mail.med.upenn.edu> wrote: >>> >>> Hi, >>> >>> I believe that you misunderstand the passage. Do you know what >>> multiple stepwise regression is? >>> >>> Since they used SPSS, I copied from >>> http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm >>> >>> >>> "Stepwise selection is a combination of forward and backward procedures. >>> Step 1 >>> >>> The first predictor variable is selected in the same way as in forward >>> selection. If the probability associated with the test of significance >>> is less than or equal to the default .05, the predictor variable with >>> the largest correlation with the criterion variable enters the >>> equation first. >>> >>> >>> Step 2 >>> >>> The second variable is selected based on the highest partial >>> correlation. If it can pass the entry requirement (PIN=.05), it also >>> enters the equation. >>> >>> Step 3 >>> >>>> From this point, stepwise selection differs from forward selection: >>> the variables already in the equation are examined for removal >>> according to the removal criterion (POUT=.10) as in backward >>> elimination. >>> >>> Step 4 >>> >>> Variables not in the equation are examined for entry. Variable >>> selection ends when no more variables meet entry and removal criteria. >>> ----------- >>> >>> >>> It is the outcome of this *entire process*,step1-4, that they compare >>> with the outcome of their *entire bootstrap/crossvalidation/selection >>> process*, Step1-4 in the methods section, and find that their approach >>> gives better result >>> What you are doing is only step4 in the article's method >>> section,estimating the parameters of a model *when you already know >>> which variables to include*.It is the way this step is conducted that >>> I am sceptical about. >>> >>> Regards, >>> >>> Gustaf >>> >> >> Perfectly stated Gustaf. This is a great example of needing to truly >> understand a method to be able to use it in the right context. >> >> After having read most of the paper by Pawinski et al now, there are >> other problems. >> >> 1. The paper nowhere uses bootstrapping. It uses repeated 2-fold >> cross-validation, a procedure not usually recommended. >> >> 2. The resampling procedure used in the paper treated the 50 >> pharmacokinetic profiles on 21 renal transplant patients as if these >> were from 50 patients. The cluster bootstrap should have been used >> instead. >> >> 3. Figure 2 showed the fitted regression line to the predicted vs. >> observed AUCs. It should have shown the line of identify instead. In >> other words, the authors allowed a subtle recalibration to creep into >> the analysis (and inverted the x- and y-variables in the plots). The >> fitted lines are far enough away from the line of identity as to show >> that the predicted values are not well calibrated. The r^2 values >> claimed by the authors used the wrong formulas which allowed an >> automatic after-the-fact recalibration (new overall slope and >> intercept are estimated in the test dataset). Hence the achieved r^2 >> are misleading. > > > ------------------------------------------------------------------------ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Possibly Parallel Threads
- Coefficients of Logistic Regression from bootstrap - how to get them?
- [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]
- Estimating correlation in multiple measures data
- Beginner's question: number formatting
- Problem with plotting survival predictions from cph model