thr3ads.net - R help - [R] Possible overfitting of a GAM [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Thomas L Jones, PhD

2008-Feb-16 22:25 UTC

[R] Possible overfitting of a GAM

The subject is a Generalized Additive Model. Experts caution us against 
overfitting the data, which can cause inaccurate results. I am not a 
statistician (my background is in Computer Science). Perhaps some kind soul 
would take a look and vet the model for overfitting the data.

The study estimated the ebb and flow of traffic through a voting place. Just 
one voting place was studied; the election was the U.S. mid-term election 
about a year ago. Procedure: The voting day was divided into five-minute 
bins, and the number of voters arriving in each bin was recorded. The voting 
day was 13 hours long, giving 156 bins.

See http://tinyurl.com/36vzop for the scatterplot. There is a rather high 
random variation, due in part to the fact that the bin width was 
intentionally set to be narrow, in order to improve the amount of timing 
information gathered.

http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used, with 
the loess smoothing algorithm (locally weighted regression). The default 
span was used. http://tinyurl.com/34av6l gives the scatterplot and the 
fitted curve. The two seem to match reasonably well.

However, when I tried to generate the standard errors, things went awry. 
(Please see http://tinyurl.com/38ej2t ) There are three curves, seemingly 
the fitted curve and the curves for plus and minus two standard errors. The 
shapes seem okay, but there are large errors in the y values.

Question: Have I overfitted the data?

Feedback?

Tom
Thomas L. Jones, PhD, Computer Science

Bill.Venables at csiro.au

2008-Feb-17 01:39 UTC

head link

[R] Possible overfitting of a GAM

thomas L Jones asks:
> The subject is a Generalized Additive Model. Experts caution us
> against overfitting the data, which can cause inaccurate results. 
Inaccurate *predictions*, to be more precies.  The main problem with
overfitting is that your model will capture too much of the noise in
the data along with the signal.  This noise then becomes prediction
errors.  The thing about randomness is not the absence of pattern.
Randomness can sometimes appear as a fairly striking pattern.  The
problem is that next time it's a different pattern.
> I am not a statistician (my background is in Computer
> Science). Perhaps some kind soul would take a look and vet the model
> for overfitting the data.
You haven't given us very much to go on: just plots.  To help you we
need to see what you have really done, not just what you think you've 
done. This requires us to see some code (and data wouldn't hurt, too).
> 
> The study estimated the ebb and flow of traffic through a voting
> place. Just one voting place was studied; the election was the
> U.S. mid-term election about a year ago. Procedure: The voting day
> was divided into five-minute bins, and the number of voters arriving
> in each bin was recorded. The voting day was 13 hours long, giving
> 156 bins.
> 
> See http://tinyurl.com/36vzop for the scatterplot. There is a rather
> high random variation, due in part to the fact that the bin width
> was intentionally set to be narrow, in order to improve the amount
> of timing information gathered.
A natural sort of model to consider first would have been poisson with
a log link.  Is that what you used?  You may need to be a bit careful
with overdispersion if you want realistic standard errors.
> 
> http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used,
> with the loess smoothing algorithm (locally weighted
> regression). The default span was used. http://tinyurl.com/34av6l
> gives the scatterplot and the fitted curve. The two seem to match
> reasonably well.
> 
This looks pretty reasonable to me.
> However, when I tried to generate the standard errors, things went
> awry.  (Please see http://tinyurl.com/38ej2t ) There are three
> curves, seemingly the fitted curve and the curves for plus and minus
> two standard errors. The shapes seem okay, but there are large
> errors in the y values.
How did you "try to generate standard errors"?  This is where actual 
code becomes important to work out what you have really done.

This looks to me like a plot of the additive component of the model in
the log scale, with standard errors on that.  This would explain why
the component is on a totally different scale to the one you show
above (there you had the response scale), and in particular why it
goes negative.    That would also account for the apparent distortion
in the curve itself relative to its image on the response scale. 
Components, by construction, have mean zero.  It's the intercept that 
lifts them to the right level for predictions, and the inverse link 
that takes them back to the response scale.  
> 
> Question: Have I overfitted the data?
Most likely not.  You may need to understand the model you are fitting
a bit more, though, as well as the tools.

> 
> Feedback?
> 
> Tom
> Thomas L. Jones, PhD, Computer Science
>

Simon Wood

2008-Feb-18 08:55 UTC

head link

[R] Possible overfitting of a GAM

The figures don't obviously scream out `overfitting' to me, and the
standard
errors don't look excessively wide, given the data. Unless there is a strong
reason for using `lo', you could also try the `gam' function in package 
`mgcv': it attempts to estimate the appropriate degree of smoothing 
automatically. If you get similar curves using mgcv::gam then you have some 
re-assurance that you don't have overfit here. 

On Saturday 16 February 2008 22:25, Thomas L Jones, PhD
wrote:> The subject is a Generalized Additive Model. Experts caution us against
> overfitting the data, which can cause inaccurate results. I am not a
> statistician (my background is in Computer Science). Perhaps some kind soul
> would take a look and vet the model for overfitting the data.
>
> The study estimated the ebb and flow of traffic through a voting place.
> Just one voting place was studied; the election was the U.S. mid-term
> election about a year ago. Procedure: The voting day was divided into
> five-minute bins, and the number of voters arriving in each bin was
> recorded. The voting day was 13 hours long, giving 156 bins.
>
> See http://tinyurl.com/36vzop for the scatterplot. There is a rather high
> random variation, due in part to the fact that the bin width was
> intentionally set to be narrow, in order to improve the amount of timing
> information gathered.
>
> http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used, with
> the loess smoothing algorithm (locally weighted regression). The default
> span was used. http://tinyurl.com/34av6l gives the scatterplot and the
> fitted curve. The two seem to match reasonably well.
>
> However, when I tried to generate the standard errors, things went awry.
> (Please see http://tinyurl.com/38ej2t ) There are three curves, seemingly
> the fitted curve and the curves for plus and minus two standard errors. The
> shapes seem okay, but there are large errors in the y values.
>
> Question: Have I overfitted the data?
>
> Feedback?
>
> Tom
> Thomas L. Jones, PhD, Computer Science
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide commented, minimal,
> self-contained, reproducible code.
-- > Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
> +44 1225 386603  www.maths.bath.ac.uk/~sw283

Possibly Parallel Threads

Search for more maybe matching threads

R help - Feb 2008 - Possible overfitting of a GAM

[R] Possible overfitting of a GAM

[R] Possible overfitting of a GAM

[R] Possible overfitting of a GAM

Possibly Parallel Threads