This isn't a question about R, but I'm hoping someone will be willing to help. I've been looking at calibration plots in multiple regression (plotting observed response Y on the vertical axis versus predicted response [Y hat] on the horizontal axis). According to Frank Harrell's "Regression Modeling Strategies" book (pp. 61-63), when making such a plot on new data (having obtained a model from other data) we should expect the points to be around a line with slope < 1, indicating overfitting. As he writes, "Typically, low predictions will be too low and high predictions too high." However, when I make these plots, both with real data and with simple simulated data, I get the opposite: the points are scattered around a line with slope >1. Low predictions are too high and high predictions are too low. For example, I generated 200 cases, fitted models on the first half of the data, and made calibration plots for those models on the second half of the data:> x1 <- rnorm(200, 0, 1) > x2 <- rnorm(200, 0, 1) > x3 <- rnorm(200, 0, 1) > x4 <- rnorm(200, 0, 1) > x5 <- rnorm(200, 0, 1) > x6 <- rnorm(200, 0, 1) > y <- x1 + x2 + rnorm(200, 0, 2) > d <- data.frame(y, x1, x2, x3, x4, x5, x6) > > lm1 <- lm(y ~ ., data = d[1:100,]) > lm2 <- lm(y ~ x1 + x2, data = d[1:100,]) > > plot(predict(lm1, d[101:200, ]), d$y[101:200]); abline(0,1) > x11(); plot(predict(lm2, d[101:200, ]), d$y[101:200]); abline(0,1)The plots for both lm1 and lm2 show the points scattered around a line with slope > 1, contrary to what Frank Harrell says should happen. I am either misunderstanding something or making a mistake in the code (I'm almost 100% certain I'm not mixing up the axes). I would be most appreciative if anyone could explain where I'm going wrong. Thanks for any help you can provide. Mark Seeto
Frank E Harrell Jr
2010-Apr-09 00:15 UTC
[R] Overfitting/Calibration plots (Statistics question)
Mark Seeto wrote:> This isn't a question about R, but I'm hoping someone will be willing > to help. I've been looking at calibration plots in multiple regression > (plotting observed response Y on the vertical axis versus predicted > response [Y hat] on the horizontal axis). > > According to Frank Harrell's "Regression Modeling Strategies" book > (pp. 61-63), when making such a plot on new data (having obtained a > model from other data) we should expect the points to be around a line > with slope < 1, indicating overfitting. As he writes, "Typically, low > predictions will be too low and high predictions too high." > > However, when I make these plots, both with real data and with simple > simulated data, I get the opposite: the points are scattered around a > line with slope >1. Low predictions are too high and high predictions > are too low. > > For example, I generated 200 cases, fitted models on the first half of > the data, and made calibration plots for those models on the second > half of the data: > >> x1 <- rnorm(200, 0, 1) >> x2 <- rnorm(200, 0, 1) >> x3 <- rnorm(200, 0, 1) >> x4 <- rnorm(200, 0, 1) >> x5 <- rnorm(200, 0, 1) >> x6 <- rnorm(200, 0, 1) >> y <- x1 + x2 + rnorm(200, 0, 2) >> d <- data.frame(y, x1, x2, x3, x4, x5, x6) >> >> lm1 <- lm(y ~ ., data = d[1:100,]) >> lm2 <- lm(y ~ x1 + x2, data = d[1:100,]) >> >> plot(predict(lm1, d[101:200, ]), d$y[101:200]); abline(0,1) >> x11(); plot(predict(lm2, d[101:200, ]), d$y[101:200]); abline(0,1) > > The plots for both lm1 and lm2 show the points scattered around a line > with slope > 1, contrary to what Frank Harrell says should happen. > > I am either misunderstanding something or making a mistake in the code > (I'm almost 100% certain I'm not mixing up the axes). I would be most > appreciative if anyone could explain where I'm going wrong. > > Thanks for any help you can provide. > > Mark SeetoMark, Try set.seed(1) slope1 <- slope2 <- numeric(100) for(i in 1:100) { x1 <- rnorm(200, 0, 1) x2 <- rnorm(200, 0, 1) x3 <- rnorm(200, 0, 1) x4 <- rnorm(200, 0, 1) x5 <- rnorm(200, 0, 1) x6 <- rnorm(200, 0, 1) y <- x1 + x2 + rnorm(200, 0, 2) d <- data.frame(y, x1, x2, x3, x4, x5, x6) lm1 <- lm(y ~ ., data = d[1:100,]) lm2 <- lm(y ~ x1 + x2, data = d[1:100,]) slope1[i] <- coef(lsfit(predict(lm1, d[101:200, ]), d$y[101:200]))[2] slope2[i] <- coef(lsfit(predict(lm2, d[101:200, ]), d$y[101:200]))[2] } mean(slope1) mean(slope2) I get > [1] 0.8873878 > [1] 0.9603158 Frank> > ______________________________________________ > R-help at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University
Thank you very much for your help, Prof. Harrell. I was making the bad mistake of judging the appearance of the calibration plots without actually calculating the regression line. I was misjudging slopes of 0.8 or 0.9 as being slopes greater than 1. Kind regards, Mark Seeto> Mark, > > Try > > set.seed(1) > slope1 <- slope2 <- numeric(100) > > for(i in 1:100) { > x1 <- rnorm(200, 0, 1) > x2 <- rnorm(200, 0, 1) > x3 <- rnorm(200, 0, 1) > x4 <- rnorm(200, 0, 1) > x5 <- rnorm(200, 0, 1) > x6 <- rnorm(200, 0, 1) > y <- x1 + x2 + rnorm(200, 0, 2) > d <- data.frame(y, x1, x2, x3, x4, x5, x6) > > lm1 <- lm(y ~ ., data = d[1:100,]) > lm2 <- lm(y ~ x1 + x2, data = d[1:100,]) > > slope1[i] <- coef(lsfit(predict(lm1, d[101:200, ]), d$y[101:200]))[2] > slope2[i] <- coef(lsfit(predict(lm2, d[101:200, ]), d$y[101:200]))[2] > } > > mean(slope1) > mean(slope2) > > I get > >> [1] 0.8873878 >> [1] 0.9603158 > > Frank > > > >> >> ______________________________________________ >> R-help at r-project.org mailing list >> stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > Frank E Harrell Jr ? Professor and Chairman ? ? ? ?School of Medicine > ? ? ? ? ? ? ? ? ? ? Department of Biostatistics ? Vanderbilt University >
Maybe Matching Threads
- Comparing two regression slopes
- Windows Front end-crash error
- as.formula and lme ( Fixed effects: Error in as.vector(x, "list") : cannot coerce to vector)
- as.formula and lme ( Fixed effects: Error in as.vector(x, "list") : cannot coerce to vector)
- Anova Type II and Contrasts