I?ve been using Frank Harrell?s rms package to do bootstrap model validation. Is it the case that the optimum penalization may still give a model which is substantially overfitted? I calculated corrected R^2, optimism in R^2, and corrected slope for various penalties for a simple example: x1 <- rnorm(45) x2 <- rnorm(45) x3 <- rnorm(45) y <- x1 + 2*x2 + rnorm(45,0,3) ols0 <- ols(y ~ x1 + x2 + x3, x=TRUE, y=TRUE) corrected.Rsq <- rep(0,60) optimism.Rsq <- rep(0,60) corrected.slope <- rep(0,60) for (pen in 1:60) { olspen <- ols(y ~ x1 + x2 + x3, penalty=pen, x=TRUE, y=TRUE) val <- validate(olspen, B=200) corrected.Rsq[pen] <- val["R-square", "index.corrected"] optimism.Rsq[pen] <- val["R-square", "optimism"] corrected.slope[pen] <- val["Slope", "index.corrected"] } plot(corrected.Rsq) x11(); plot(optimism.Rsq) x11(); plot(corrected.slope) p <- pentrace(ols0, penalty=1:60) ols9 <- ols(y ~ x1 + x2 + x3, penalty=9, x=TRUE, y=TRUE) validate(ols9, B=200) index.orig training test optimism index.corrected n R-square 0.2523404 0.2820734 0.1911878 0.09088563 0.1614548 200 MSE 7.8497722 7.3525300 8.4918212 -1.13929116 8.9890634 200 Intercept 0.0000000 0.0000000 -0.1353572 0.13535717 -0.1353572 200 Slope 1.0000000 1.0000000 1.1707137 -0.17071372 1.1707137 200 pentrace tells me that of the penalties 1, 2,..., 60, corrected AIC is maximised by a penalty of 9. This is consistent with the corrected R^2 plot, which shows a maximum somewhere around 10. However, a penalty of 9 still gives an R^2 optimism of 0.09 (training R^2=0.28, test R^2=0.19), suggesting overfitting. Do we just have to live with this R^2 optimism? It can be decreased by taking a bigger penalty, but then the corrected R^2 is reduced. Also, a penalty of 9 gives a corrected slope of about 1.17 (corrected slope of 1 is achieved with a penalty of about 1 or 2). Thanks for any help/advice you can give. Mark -- Mark Seeto Statistician National Acoustic Laboratories A Division of Australian Hearing
Charles C. Berry
2010-Jun-29 04:38 UTC
[R] Model validation and penalization with rms package
On Tue, 29 Jun 2010, Mark Seeto wrote:> I?ve been using Frank Harrell?s rms package to do bootstrap model > validation. Is it the case that the optimum penalization may still > give a model which is substantially overfitted? > > I calculated corrected R^2, optimism in R^2, and corrected slope for > various penalties for a simple example: > > x1 <- rnorm(45) > x2 <- rnorm(45) > x3 <- rnorm(45) > y <- x1 + 2*x2 + rnorm(45,0,3) > > ols0 <- ols(y ~ x1 + x2 + x3, x=TRUE, y=TRUE) > > corrected.Rsq <- rep(0,60) > optimism.Rsq <- rep(0,60) > corrected.slope <- rep(0,60) > > for (pen in 1:60) { > olspen <- ols(y ~ x1 + x2 + x3, penalty=pen, x=TRUE, y=TRUE) > val <- validate(olspen, B=200) > corrected.Rsq[pen] <- val["R-square", "index.corrected"] > optimism.Rsq[pen] <- val["R-square", "optimism"] > corrected.slope[pen] <- val["Slope", "index.corrected"] > } > plot(corrected.Rsq) > x11(); plot(optimism.Rsq) > x11(); plot(corrected.slope) > p <- pentrace(ols0, penalty=1:60) > ols9 <- ols(y ~ x1 + x2 + x3, penalty=9, x=TRUE, y=TRUE) > validate(ols9, B=200) > index.orig training test optimism index.corrected n > R-square 0.2523404 0.2820734 0.1911878 0.09088563 0.1614548 200 > MSE 7.8497722 7.3525300 8.4918212 -1.13929116 8.9890634 200 > Intercept 0.0000000 0.0000000 -0.1353572 0.13535717 -0.1353572 200 > Slope 1.0000000 1.0000000 1.1707137 -0.17071372 1.1707137 200 > > pentrace tells me that of the penalties 1, 2,..., 60, corrected AIC is > maximised by a penalty of 9. This is consistent with the corrected R^2 > plot, which shows a maximum somewhere around 10. However, a penalty of > 9 still gives an R^2 optimism of 0.09 (training R^2=0.28, test > R^2=0.19), suggesting overfitting. > > Do we just have to live with this R^2 optimism? It can be decreased by > taking a bigger penalty, but then the corrected R^2 is reduced. Also, > a penalty of 9 gives a corrected slope of about 1.17 (corrected slope > of 1 is achieved with a penalty of about 1 or 2).Your best bet, as you are a statistician, is to read up on the bias-variance tradeoff (aka dilemma). I recommend that you focus on "Section 5. Bias, variance, and estimation error" in Friedman's "On Bias, Variance, 0/1 Loss, and the Curse-of-Dimensionality" available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.6820&rep=rep1&type=pdf In short, overfitting aka 'optimism' is due to variance, and you can temper it by adding bias. But as you add more you lose the signal in the data. The resubstituted training data will tend to overstate the prediction accuracy unless you penalize so severely that the prediction is unrelated to the training data. HTH, Chuck> > Thanks for any help/advice you can give. > > Mark > -- > Mark Seeto > Statistician > > National Acoustic Laboratories > A Division of Australian Hearing > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901