I'm trying to learn how to use R to: * Make a random partition of a data frame between in-sample and out-of-sample * Estimate a model (e.g. lm()) for the in-sample * Make predictions for all observations * Compare the in-sample error sigma against the out-of-sample error sigma. I came up with the following code. I think it's okay, but I can't help feeling this is still clunky. Could all ye R wizards please comment on this, and tell me how I can do it better? --------------------------------------------------------------------------- # Simulate some data for a linear regression (100 points) x = runif(100); y = 2 + 3*x + rnorm(100) D = data.frame(x, y) # Choose a random subset of 25 points which will be "in sample" d = sort(sample(100, 25)) # Sorting just makes d more readable cat("Subset of insample points --\n"); print(d) # Estimate a linear regression using all points m1 = lm(y ~ x, D) # Estimate a linear regression using only the subset m2 = lm(y ~ x, D, subset=d) # Get to predictions -- yhat1 = predict.lm(m1, D); yhat2 = predict.lm(m2, D) # And standard deviations of errors -- full.s = sd(y - yhat1) insample.s = sd(y[d] - yhat2[d]) outsample.s = sd(y[-d] - yhat2[-d]) cat("Sigmas of prediction errors --\n") cat(" All points used in estimation, in sample : ", full.s, "\n") cat(" 25 points used in estimation, in sample : ", insample.s, "\n") cat(" 25 points used in estimation, out of sample : ", outsample.s, "\n") --------------------------------------------------------------------------- Here's what I get when I run it: $ R --slave < insampleoutsample.R Subset of insample points -- [1] 4 6 7 13 20 21 24 25 26 27 29 33 34 36 39 45 47 48 59 60 88 89 91 96 98 Sigmas of prediction errors -- All points used in estimation, in sample : 0.9405517 25 points used in estimation, in sample : 1.000709 25 points used in estimation, out of sample : 0.9586921 -- Ajay Shah Consultant ajayshah at mayin.org Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi