thr3ads.net - R help - [R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... ) [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Corrado

2009-Sep-03 05:56 UTC

[R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

Dear R-friends,

How do you test the goodness of prediction of a model, when you predict on a 
set of data DIFFERENT from the training set?

I explain myself: you train your model M (e.g. glm,gam,regression tree, brt) 
on a set of data A with a response variable Y. You then predict the value of 
that same response variable Y on a different set of data B (e.g. predict.glm, 
predict.gam and so on). Dataset A and dataset B are different in the sense that 
they contain the same variable, for example temperature, measured in different 
sites, or on a different interval (e.g. B is a subinterval of A for 
interpolation, or a different interval for extrapolation). If you have the 
measured values for Y on the new interval, i.e. B, how do you measure how good 
is the prediction, that is how well model fits the Y on B (that is, how well 
does it predict)?

In other words:

Y~T,data=A for training
Y~T,data=B for predicting

I have devised a couple of method based around 1) standard deviation 2) R^2, 
but I am unhappy with them.

Regards 
-- 
Corrado Topi

Global Climate Change & Biodiversity Indicators
Area 18,Department of Biology
University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk

Kingsford Jones

2009-Sep-03 14:06 UTC

head link

[R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

There are many ways to measure prediction quality, and what you choose
depends on the data and your goals.  A common measure for a
quantitative response is mean squared error (i.e. 1/n * sum((observed
- predicted)^2)) which incorporates bias and variance.  Common terms
for what you are looking for are "test error" and "generalization
error".


hth,
Kingsford



On Wed, Sep 2, 2009 at 11:56 PM, Corrado<ct529 at york.ac.uk>
wrote:> Dear R-friends,
>
> How do you test the goodness of prediction of a model, when you predict on
a
> set of data DIFFERENT from the training set?
>
> I explain myself: you train your model M (e.g. glm,gam,regression tree,
brt)
> on a set of data A with a response variable Y. You then predict the value
of
> that same response variable Y on a different set of data B (e.g.
predict.glm,
> predict.gam and so on). Dataset A and dataset B are different in the sense
that
> they contain the same variable, for example temperature, measured in
different
> sites, or on a different interval (e.g. B is a subinterval of A for
> interpolation, or a different interval for extrapolation). If you have the
> measured values for Y on the new interval, i.e. B, how do you measure how
good
> is the prediction, that is how well model fits the Y on B (that is, how
well
> does it predict)?
>
> In other words:
>
> Y~T,data=A for training
> Y~T,data=B for predicting
>
> I have devised a couple of method based around 1) standard deviation 2)
R^2,
> but I am unhappy with them.
>
> Regards
> --
> Corrado Topi
>
> Global Climate Change & Biodiversity Indicators
> Area 18,Department of Biology
> University of York, York, YO10 5YW, UK
> Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Corrado

2009-Sep-10 17:48 UTC

head link

[R] AIC and goodness of prediction - was: Re: goodness of "prediction" using a model (lm, glm, gam, brt,

Dear Kingsford,

I apologise for breaking the thread, but I thought there were some more people 
who would be interested.

What you propose is what I am using at the moment: the sum of the squares of 
the residuals, plus  variance / stdev. I am not really satisfied. I have also 
tried using R2, and it works well .... but some people go a bit wild eyed when 
they see a negative R2 (which is perfectly reasonable when you use R2 as a 
measure of goodness of fit on prediction on a dataset different from the 
training set).

I was then wondering whether it would make sense to use AIC: the K in the 
formula will still be the number of parameters of the trained model, the
"sum
of square residuals" would be the (predicted - observed)^2, N would be the 
number of samples in the test dataset. I think it should work well.

What do you / other R list members think?

Regards

On Thursday 03 September 2009 15:06:14 Kingsford Jones
wrote:> There are many ways to measure prediction quality, and what you choose
> depends on the data and your goals.  A common measure for a
> quantitative response is mean squared error (i.e. 1/n * sum((observed
> - predicted)^2)) which incorporates bias and variance.  Common terms
> for what you are looking for are "test error" and
"generalization
> error".
>
>
> hth,
> Kingsford
>
> On Wed, Sep 2, 2009 at 11:56 PM, Corrado<ct529 at york.ac.uk> wrote:
> > Dear R-friends,
> >
> > How do you test the goodness of prediction of a model, when you
predict
> > on a set of data DIFFERENT from the training set?
> >
> > I explain myself: you train your model M (e.g. glm,gam,regression
tree,
> > brt) on a set of data A with a response variable Y. You then predict
the
> > value of that same response variable Y on a different set of data B
(e.g.
> > predict.glm, predict.gam and so on). Dataset A and dataset B are
> > different in the sense that they contain the same variable, for
example
> > temperature, measured in different sites, or on a different interval
> > (e.g. B is a subinterval of A for interpolation, or a different
interval
> > for extrapolation). If you have the measured values for Y on the new
> > interval, i.e. B, how do you measure how good is the prediction, that
is
> > how well model fits the Y on B (that is, how well does it predict)?
> >
> > In other words:
> >
> > Y~T,data=A for training
> > Y~T,data=B for predicting
> >
> > I have devised a couple of method based around 1) standard deviation
2)
> > R^2, but I am unhappy with them.
> >
> > Regards
> > --
> > Corrado Topi
> >
> > Global Climate Change & Biodiversity Indicators
> > Area 18,Department of Biology
> > University of York, York, YO10 5YW, UK
> > Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html and provide commented,
> > minimal, self-contained, reproducible code.


-- 
Corrado Topi

Global Climate Change & Biodiversity Indicators
Area 18,Department of Biology
University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk

jamesmcc

2009-Sep-11 21:41 UTC

head link

[R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

I think it's important to say why you're unhappy with your current
measures?
Are they not capturing aspects of the data you understand? 

I typically use several residual measures in conjunction, each has it's
benefits/drawbacks. I just throw them all in a table. 


-- 
View this message in context:
http://www.nabble.com/goodness-of-%22prediction%22-using-a-model-%28lm%2C-glm%2C-gam%2C-brt%2C-regression-tree-....-%29-tp25270261p25408808.html
Sent from the R help mailing list archive at Nabble.com.

Maybe Matching Threads

Search for more apparently analagous threads

R help - Sep 2009 - goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

[R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

[R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

[R] AIC and goodness of prediction - was: Re: goodness of "prediction" using a model (lm, glm, gam, brt,

[R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

Maybe Matching Threads