thr3ads.net - R devel - [Rd] plot(<lm>): new behavior in R-2.2.0 alpha [Sep 2005]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2005-Sep-13 14:17 UTC

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

As some of you R-devel readers may know, the plot() method for
"lm" objects is based in large parts on contributions by John
Maindonald, subsequently "massaged" by me and other R-core
members.

In the statistics litterature on applied regression, people have
had  diverse oppinions on what (and how many!) plots should be
used for goodness-of-fit / residual diagnostics, and to my
knowledge most people have agreed to want to see one (or more)
version of a Tukey-Anscombe plot {Residuals ~ Fitted} and a QQ
normal plot.
Another consideration was to be somewhat close to what S
(S-plus) was doing.  So we have two versions of residuals vs
fitted, one for checking  E[error] = 0, the other for checking 
Var[error] = constant.  So we got to the first three plots of
plot.lm() about which I don't want to debate at the moment
{though, there's room for improvement even there: e.g., I know of at
 least one case where plot(<lm>) wasn't used because the user
 was missing the qqline() she was so used to in the QQ plot}

The topic of this e-mail is the (default) 4th plot which I had
changed; really prompted by the following:
More than three months ago, John wrote
  http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
    (which became a thread of about 20 messages, from Apr.23 -- 29, 2005)

and currently, 
NEWS for R 2.2.0 alpha contains
>> USER-VISIBLE CHANGES
>> 
>>    o plot(<lm object>) uses a new default for the fourth panel
when
>>      'which' is not specified.
>>      ___ may change before release ___
and the header is

plot.lm <- 
function (x, which = c(1:3, 5), 
          caption = c("Residuals vs Fitted", 
	              "Normal Q-Q", "Scale-Location", 
		      "Cook's distance", "Residuals vs Leverage", 
		      "Cook's distance vs Leverage"), 
	   ......... ) {..............}

So we now have 6 possible plots, where 1,2,3 and 5 are the
defaults (and 1,2,3,4 where the old defaults).

For the influential points and combination of 'influential' and
'outlier'
there have been quite a few more proposals in the past. R <= 2.1.x
has been plotting the  Cook's distances vs. observation number, whereas
quite a few people in the past have noted that all influence
measures being more or less complicated functions of residuals
and "hat values" aka "leverages", (R_i, h_{ii}), it would
really
make sense and fit more to the other plots
to plot residuals vs. Leverages --- with the additional idea of
adding *contours* of (equal) Cook's distances to that plot, in
case one would really want to seem them.

In the mean time, this has been *active* in R-devel for quite a
while, and we haven't received any new comments.

One remaining problem I'd like to address is the "balanced AOV"
situation, something probably pretty rare nowadays in real
practice, but common of course in teaching ANOVA.
As you may remember, in a balanced design, all observations have
the same leverages h_{ii}, and the plot  R_i  vs  h_ii is really
not so useful.  In that case,  the cook distances CD_i = c *  R_i ^2
and so  CD_i  vs  i {the old "4-th plot in plot.lm"} is
graphically identical to   R_i^2 vs i.
Now in that case (of identical h_ii's), I think one would really
want  "R_i  vs  i".

Question to the interested parties:

  Should there be an automatism
	 ``when h_ii == const''  {"==" with a bit of numerical fuzz}
  plot a)  R_i   vs i
  or   b)  CD_i  vs i

or should users have to manually use
    plot(<lm>,  which=1:4, ...)
in such a case?

Feedback very welcome, 
particularly, you first look at the examples in help(plot.lm) 
in *R-devel* aka R-2.2.0 alpha.

Martin Maechler, ETH Zurich

John Fox

2005-Sep-13 20:41 UTC

head link

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Dear Martin,

A couple of comments on the new plots (numbers 5 and 6): Perhaps some more
thought could be given to the plotted contours for Cook's D (which are 0.5
and 1.0 in the example -- large Cook's Ds). A rule-of-thumb cut-off for this
example is 4/(n - p) = 4/(50 - 5) = 0.089, and the discrepancy will grow
with n. I'm not terribly fond of number 6, since it seems natural to me to
think of the relationship among these quantities as influence on
coefficients = leverage * outlyingness (which corresponds to 5); also note
how in the example, the labels for large residuals overplot. Finally, your
remarks about balanced data are cogent and suggest going with 1:3 in this
case (since R_i vs. i is pretty redundant with the QQ plot). 

Regards,
 John

--------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
Canada L8S 4M4
905-525-9140x23604
http://socserv.mcmaster.ca/jfox 
-------------------------------- 
> -----Original Message-----
> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
> Sent: Tuesday, September 13, 2005 9:18 AM
> To: R-devel at stat.math.ethz.ch
> Cc: John Maindonald; Werner Stahel; John Fox
> Subject: plot(<lm>): new behavior in R-2.2.0 alpha
> 
> As some of you R-devel readers may know, the plot() method 
> for "lm" objects is based in large parts on contributions by 
> John Maindonald, subsequently "massaged" by me and other 
> R-core members.
> 
> In the statistics litterature on applied regression, people 
> have had  diverse oppinions on what (and how many!) plots 
> should be used for goodness-of-fit / residual diagnostics, 
> and to my knowledge most people have agreed to want to see 
> one (or more) version of a Tukey-Anscombe plot {Residuals ~ 
> Fitted} and a QQ normal plot.
> Another consideration was to be somewhat close to what S
> (S-plus) was doing.  So we have two versions of residuals vs 
> fitted, one for checking  E[error] = 0, the other for 
> checking Var[error] = constant.  So we got to the first three plots of
> plot.lm() about which I don't want to debate at the moment 
> {though, there's room for improvement even there: e.g., I 
> know of at  least one case where plot(<lm>) wasn't used 
> because the user  was missing the qqline() she was so used to 
> in the QQ plot}
> 
> The topic of this e-mail is the (default) 4th plot which I 
> had changed; really prompted by the following:
> More than three months ago, John wrote
>   http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
>     (which became a thread of about 20 messages, from Apr.23 
> -- 29, 2005)
> 
> and currently,
> NEWS for R 2.2.0 alpha contains
> 
> >> USER-VISIBLE CHANGES
> >> 
> >>    o plot(<lm object>) uses a new default for the fourth
panel when
> >>      'which' is not specified.
> >>      ___ may change before release ___
> 
> and the header is
> 
> plot.lm <-
> function (x, which = c(1:3, 5), 
>           caption = c("Residuals vs Fitted", 
> 	              "Normal Q-Q", "Scale-Location", 
> 		      "Cook's distance", "Residuals vs Leverage",
> 		      "Cook's distance vs Leverage"), 
> 	   ......... ) {..............}
> 
> So we now have 6 possible plots, where 1,2,3 and 5 are the 
> defaults (and 1,2,3,4 where the old defaults).
> 
> For the influential points and combination of 'influential' 
> and 'outlier'
> there have been quite a few more proposals in the past. R <= 
> 2.1.x has been plotting the  Cook's distances vs. observation 
> number, whereas quite a few people in the past have noted 
> that all influence measures being more or less complicated 
> functions of residuals and "hat values" aka
"leverages",
> (R_i, h_{ii}), it would really make sense and fit more to the 
> other plots to plot residuals vs. Leverages --- with the 
> additional idea of adding *contours* of (equal) Cook's 
> distances to that plot, in case one would really want to seem them.
> 
> In the mean time, this has been *active* in R-devel for quite 
> a while, and we haven't received any new comments.
> 
> One remaining problem I'd like to address is the "balanced
AOV"
> situation, something probably pretty rare nowadays in real 
> practice, but common of course in teaching ANOVA.
> As you may remember, in a balanced design, all observations 
> have the same leverages h_{ii}, and the plot  R_i  vs  h_ii 
> is really not so useful.  In that case,  the cook distances 
> CD_i = c *  R_i ^2 and so  CD_i  vs  i {the old "4-th plot in 
> plot.lm"} is
> graphically identical to   R_i^2 vs i.
> Now in that case (of identical h_ii's), I think one would 
> really want  "R_i  vs  i".
> 
> Question to the interested parties:
> 
>   Should there be an automatism
> 	 ``when h_ii == const''  {"==" with a bit of numerical
fuzz}
>   plot a)  R_i   vs i
>   or   b)  CD_i  vs i
> 
> or should users have to manually use
>     plot(<lm>,  which=1:4, ...)
> in such a case?
> 
> Feedback very welcome,
> particularly, you first look at the examples in help(plot.lm) 
> in *R-devel* aka R-2.2.0 alpha.
> 
> Martin Maechler, ETH Zurich
> 
>

John Maindonald

2005-Sep-14 06:09 UTC

head link

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Following the >20 messages that Martin mentioned, I had
private discussion with John Fox, which in part lies behind
following questions:

(1) In plot 5, should we have, maybe as an option, vertical
lines at 2hbar and 3hbar, as in the plots produced by the
function that John Fox sent me.  I think this would be a useful
addition, but made no move to add it at that time, considering
it best to put that on a toThink About list.

(2) John also sent code for a plot that place contours of the
covratio on the points that are shown in plot 5.  This could
be added as an option to plot 5.
(The covratio statistic is a measure of the effect of omitting a
point on the variance-covariance matrix.  It is a function of
the residual and the leverage.)  Also there is the possibility
(I am not keen on this) to show Bonferroni critical values for
studentized residuals.

(3) My reaction to the new plot 6 (David Firth's proposal) is
sufficiently similar to John Fox's that I would not use it as a
matter of course.  I think it useful however, precisely because
it does offer a perspective on the information in plot 5 that
is, at first look, startlingly different.

(4) Are there other diagnostics that ought to be included in
stats? (perhaps in a function other than plot.lm(), which risks
being overloaded).  One strong claiment is vif() (variance
inflation factor), of which there are versions both in car and
(written by myself) in DAAG.  John Fox's function does more
than mine.  Thus, assuming that he is willing for it to be taken
across, that should go into stats.

(5) termplot() provides partial residual (component + residual)
plots, which I think extraordinarily useful.  They deserve to be
widely used.
Should partial regression plots also be available?

(6) It should be fairly easy to construct a function that would
examine the distribution of statistics of interest under repeated
bootstrap sampling or simulation.  This can be useful when
with small samples, when it is easy to over-interpret diagnostic
statistics.

(7) There are special issues, not just for aov models, but also
for glm() and (extending the discussion quite a lot) the models
that are fitted by lme()/lmer() [nlme/lme4].

(8) Are there special issues that require attention for large
datasets? [I'm sure there are, but regression diagnostics may
not be the best point of entry into the discussion.]

(9) How about a help(Diagnostics) entry?

(10) Maybe it would be useful to form a (small?) group to look at
what should go into:
  (a) stats
  (b) a specialist diagnostics package
Even if this idea is taken up, some preliminary wider canvassing
of the opinions of members of this list seems desirable.

John Maindonald.

On 14 Sep 2005, at 12:17 AM, Martin Maechler wrote:
> As some of you R-devel readers may know, the plot() method for
> "lm" objects is based in large parts on contributions by John
> Maindonald, subsequently "massaged" by me and other R-core
> members.
>
> In the statistics litterature on applied regression, people have
> had  diverse oppinions on what (and how many!) plots should be
> used for goodness-of-fit / residual diagnostics, and to my
> knowledge most people have agreed to want to see one (or more)
> version of a Tukey-Anscombe plot {Residuals ~ Fitted} and a QQ
> normal plot.
> Another consideration was to be somewhat close to what S
> (S-plus) was doing.  So we have two versions of residuals vs
> fitted, one for checking  E[error] = 0, the other for checking
> Var[error] = constant.  So we got to the first three plots of
> plot.lm() about which I don't want to debate at the moment
> {though, there's room for improvement even there: e.g., I know of at
>  least one case where plot(<lm>) wasn't used because the user
>  was missing the qqline() she was so used to in the QQ plot}
>
> The topic of this e-mail is the (default) 4th plot which I had
> changed; really prompted by the following:
> More than three months ago, John wrote
>   http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html
>     (which became a thread of about 20 messages, from Apr.23 -- 29,  
> 2005)
>
> and currently,
> NEWS for R 2.2.0 alpha contains
>
>
>>> USER-VISIBLE CHANGES
>>>
>>>    o plot(<lm object>) uses a new default for the fourth
panel when
>>>      'which' is not specified.
>>>      ___ may change before release ___
>>>
>
> and the header is
>
> plot.lm <-
> function (x, which = c(1:3, 5),
>           caption = c("Residuals vs Fitted",
>                   "Normal Q-Q", "Scale-Location",
>               "Cook's distance", "Residuals vs
Leverage",
>               "Cook's distance vs Leverage"),
>        ......... ) {..............}
>
> So we now have 6 possible plots, where 1,2,3 and 5 are the
> defaults (and 1,2,3,4 where the old defaults).
>
> For the influential points and combination of 'influential' and  
> 'outlier'
> there have been quite a few more proposals in the past. R <= 2.1.x
> has been plotting the  Cook's distances vs. observation number,  
> whereas
> quite a few people in the past have noted that all influence
> measures being more or less complicated functions of residuals
> and "hat values" aka "leverages", (R_i, h_{ii}), it
would really
> make sense and fit more to the other plots
> to plot residuals vs. Leverages --- with the additional idea of
> adding *contours* of (equal) Cook's distances to that plot, in
> case one would really want to seem them.
>
> In the mean time, this has been *active* in R-devel for quite a
> while, and we haven't received any new comments.
>
> One remaining problem I'd like to address is the "balanced
AOV"
> situation, something probably pretty rare nowadays in real
> practice, but common of course in teaching ANOVA.
> As you may remember, in a balanced design, all observations have
> the same leverages h_{ii}, and the plot  R_i  vs  h_ii is really
> not so useful.  In that case,  the cook distances CD_i = c *  R_i ^2
> and so  CD_i  vs  i {the old "4-th plot in plot.lm"} is
> graphically identical to   R_i^2 vs i.
> Now in that case (of identical h_ii's), I think one would really
> want  "R_i  vs  i".
>
> Question to the interested parties:
>
>   Should there be an automatism
>      ``when h_ii == const''  {"==" with a bit of
numerical fuzz}
>   plot a)  R_i   vs i
>   or   b)  CD_i  vs i
>
> or should users have to manually use
>     plot(<lm>,  which=1:4, ...)
> in such a case?
>
> Feedback very welcome,
> particularly, you first look at the examples in help(plot.lm)
> in *R-devel* aka R-2.2.0 alpha.
>
> Martin Maechler, ETH Zurich
>
>
John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

Werner Stahel

2005-Sep-16 07:37 UTC

head link

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Dear Martin, dear Johns

Thanks for including me into your discussion. 

I am a strong supporter of "Residuals vs. Hii"
>> One remaining problem I'd like to address is the "balanced
AOV"
>> situation, ...
In order to keep the plots consistent, I suggest to draw a
histogram. Other alternatives will or can be interesting in the 
general case and therefore are not a suitable substitute for
this plot. 

A plot to be developed may be the following:
Define a distance in the subspace of x-space that is in some way
orthogonal (eg, with respect to the covariance matrix of the x's)
to the fit. Then plot residuals vs. this distance, with
different symbols for small, medium and large fit.
... but this is still a project.

A related project: Daniel (and Wood) introduced a term WSSD, a
distance in x-space. He then studied, for pairs of points, 
difference in residuals as a function of WSSD. If the function
increases, this indicates a lack of fit.

Back to currently available methods:

John Maindonald discusses different contours. I like the
implementation I get currently in R-devel: contours of Cook's
distances, since they are popular and we can then argue that the
plot of D_i vs. i is no more needed.

For most plots, I like to see a smoother along with the points.
I suggest to add the option to include smoothers, not only as an
argument to plot.lm, but even as an option().
I have heared of the intense discussions about options().
With Martin, we arrived at the conclusion that options() should
never influence calculations and results, but is suitable to
adjust outputs (numerical: digits=, graphical: smooth=) to the
user's taste.
>> (4) Are there other diagnostics that ought to be included in
>> stats? (perhaps in a function other than plot.lm(), which risks
>> being overloaded).  One strong claiment is vif() (variance
>> inflation factor),
I clearly support to add either vif or -- equivalent and more
intuitive to me -- R^2_j, the coefficient of determination of 
lm(X_j~.) However
-- this should be included in the coefficient table of print.lm
-- this adds another useless and misleading quantity for dummy
x-variables 
It is therefore quite a different question.

I have my own version of print for my own version of a function
regr(...) that calls lm, glm and other regression functions. 
If you are interested, I can send these functions within a few weeks.
>> (5) termplot() provides partial residual (component + residual)
>> plots, which I think extraordinarily useful.  They deserve to be
>> widely used.
>> Should partial regression plots also be available?
The plot method for my regr objects includes termplots.
I prefer residuals without component effects, but add a
reference line that allows for assessing the component effects.
>> (6) It should be fairly easy to construct a function that would
>> examine the distribution of statistics of interest under repeated
>> bootstrap sampling or simulation.  This can be useful when
>> with small samples, when it is easy to over-interpret diagnostic
>> statistics.
As we focus on plots, my plot method includes the option
(default) to add smooths for 20 simulated datasets (according to
the fitted model). 
>> (8) Are there special issues that require attention for large
>> datasets? [I'm sure there are, but regression diagnostics may
>> not be the best point of entry into the discussion.]
A cynical remark that I like to make about the state of
statistics: 
There is no program that is able to produce a scatterplot of two
variables adequately. 
The functions that I have seen work only for textbook
situations.
Large sample is one situation where they fail, others being
-- multiple points (due to rounding or classification)
-- outliers

This seems to be enough for one message ...

Cheers,

Werner 
----------------- This message was sent by ---------------------------
Werner Stahel                              http://stat.ethz.ch/~stahel
Seminar fuer Statistik                     phone  :    +41 1 632 34 30
ETH-Zentrum, LEO D8                        fax    :    +41 1 632 12 28
CH-8092 Zurich, Switzerland                meet me: Leonhardstr.27, D8

John Fox

2005-Sep-16 12:38 UTC

head link

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Dear Werner,

> -----Original Message-----
> From: Werner Stahel [mailto:stahel at stat.math.ethz.ch] 
> Sent: Friday, September 16, 2005 2:37 AM
> To: Martin Maechler
> Cc: R-devel at stat.math.ethz.ch; John Maindonald; Werner 
> Stahel; John Fox
> Subject: Re: plot(<lm>): new behavior in R-2.2.0 alpha
> 
. . .
> 
> For most plots, I like to see a smoother along with the points.
> I suggest to add the option to include smoothers, not only as 
> an argument to plot.lm, but even as an option().
I agree that smoothers are useful.

. . .
> I clearly support to add either vif or -- equivalent and more 
> intuitive to me -- R^2_j, the coefficient of determination of
> lm(X_j~.) However
> -- this should be included in the coefficient table of print.lm
> -- this adds another useless and misleading quantity for 
> dummy x-variables It is therefore quite a different question.
> 
Generalized variance inflation factors (as computed by the vif function in
car) apply to sets of related regressors in a term, such as dummy
regressors.


. . .

Regards,
 John

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Sep 2005 - plot(<lm>): new behavior in R-2.2.0 alpha

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

[Rd] plot(<lm>): new behavior in R-2.2.0 alpha

Apparently Analagous Threads