The subject is a Generalized Additive Model. Experts caution us against overfitting the data, which can cause inaccurate results. I am not a statistician (my background is in Computer Science). Perhaps some kind soul would take a look and vet the model for overfitting the data. The study estimated the ebb and flow of traffic through a voting place. Just one voting place was studied; the election was the U.S. mid-term election about a year ago. Procedure: The voting day was divided into five-minute bins, and the number of voters arriving in each bin was recorded. The voting day was 13 hours long, giving 156 bins. See http://tinyurl.com/36vzop for the scatterplot. There is a rather high random variation, due in part to the fact that the bin width was intentionally set to be narrow, in order to improve the amount of timing information gathered. http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used, with the loess smoothing algorithm (locally weighted regression). The default span was used. http://tinyurl.com/34av6l gives the scatterplot and the fitted curve. The two seem to match reasonably well. However, when I tried to generate the standard errors, things went awry. (Please see http://tinyurl.com/38ej2t ) There are three curves, seemingly the fitted curve and the curves for plus and minus two standard errors. The shapes seem okay, but there are large errors in the y values. Question: Have I overfitted the data? Feedback? Tom Thomas L. Jones, PhD, Computer Science
thomas L Jones asks:> The subject is a Generalized Additive Model. Experts caution us > against overfitting the data, which can cause inaccurate results.Inaccurate *predictions*, to be more precies. The main problem with overfitting is that your model will capture too much of the noise in the data along with the signal. This noise then becomes prediction errors. The thing about randomness is not the absence of pattern. Randomness can sometimes appear as a fairly striking pattern. The problem is that next time it's a different pattern.> I am not a statistician (my background is in Computer > Science). Perhaps some kind soul would take a look and vet the model > for overfitting the data.You haven't given us very much to go on: just plots. To help you we need to see what you have really done, not just what you think you've done. This requires us to see some code (and data wouldn't hurt, too).> > The study estimated the ebb and flow of traffic through a voting > place. Just one voting place was studied; the election was the > U.S. mid-term election about a year ago. Procedure: The voting day > was divided into five-minute bins, and the number of voters arriving > in each bin was recorded. The voting day was 13 hours long, giving > 156 bins. > > See http://tinyurl.com/36vzop for the scatterplot. There is a rather > high random variation, due in part to the fact that the bin width > was intentionally set to be narrow, in order to improve the amount > of timing information gathered.A natural sort of model to consider first would have been poisson with a log link. Is that what you used? You may need to be a bit careful with overdispersion if you want realistic standard errors.> > http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used, > with the loess smoothing algorithm (locally weighted > regression). The default span was used. http://tinyurl.com/34av6l > gives the scatterplot and the fitted curve. The two seem to match > reasonably well. >This looks pretty reasonable to me.> However, when I tried to generate the standard errors, things went > awry. (Please see http://tinyurl.com/38ej2t ) There are three > curves, seemingly the fitted curve and the curves for plus and minus > two standard errors. The shapes seem okay, but there are large > errors in the y values.How did you "try to generate standard errors"? This is where actual code becomes important to work out what you have really done. This looks to me like a plot of the additive component of the model in the log scale, with standard errors on that. This would explain why the component is on a totally different scale to the one you show above (there you had the response scale), and in particular why it goes negative. That would also account for the apparent distortion in the curve itself relative to its image on the response scale. Components, by construction, have mean zero. It's the intercept that lifts them to the right level for predictions, and the inverse link that takes them back to the response scale.> > Question: Have I overfitted the data?Most likely not. You may need to understand the model you are fitting a bit more, though, as well as the tools.> > Feedback? > > Tom > Thomas L. Jones, PhD, Computer Science >
The figures don't obviously scream out `overfitting' to me, and the standard errors don't look excessively wide, given the data. Unless there is a strong reason for using `lo', you could also try the `gam' function in package `mgcv': it attempts to estimate the appropriate degree of smoothing automatically. If you get similar curves using mgcv::gam then you have some re-assurance that you don't have overfit here. On Saturday 16 February 2008 22:25, Thomas L Jones, PhD wrote:> The subject is a Generalized Additive Model. Experts caution us against > overfitting the data, which can cause inaccurate results. I am not a > statistician (my background is in Computer Science). Perhaps some kind soul > would take a look and vet the model for overfitting the data. > > The study estimated the ebb and flow of traffic through a voting place. > Just one voting place was studied; the election was the U.S. mid-term > election about a year ago. Procedure: The voting day was divided into > five-minute bins, and the number of voters arriving in each bin was > recorded. The voting day was 13 hours long, giving 156 bins. > > See http://tinyurl.com/36vzop for the scatterplot. There is a rather high > random variation, due in part to the fact that the bin width was > intentionally set to be narrow, in order to improve the amount of timing > information gathered. > > http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used, with > the loess smoothing algorithm (locally weighted regression). The default > span was used. http://tinyurl.com/34av6l gives the scatterplot and the > fitted curve. The two seem to match reasonably well. > > However, when I tried to generate the standard errors, things went awry. > (Please see http://tinyurl.com/38ej2t ) There are three curves, seemingly > the fitted curve and the curves for plus and minus two standard errors. The > shapes seem okay, but there are large errors in the y values. > > Question: Have I overfitted the data? > > Feedback? > > Tom > Thomas L. Jones, PhD, Computer Science > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, minimal, > self-contained, reproducible code.--> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK > +44 1225 386603 www.maths.bath.ac.uk/~sw283