None of Andy's comments) are inconsistent with the point that
rlm() and lqs(), if they disagree with lm(), likely offer better
places to start than does lm(), in identifying points that should
be examined as in some sense outliers. All such methods are
to be used, not as crutches, but as sources of insight.
Incidentally, the "rlm" class inherits from "lm", and
plot.lm()
(or, preferably, the plot generic) can be used with "rlm" objects.
This is not the case for "lqs" objects.
John Maindonald.
On 17 Mar 2007, at 10:00 PM, Andy Liaw wrote:
> From: "Liaw, Andy" <andy_liaw@merck.com>
> Date: 17 March 2007 4:21:53 AM
> To: Bert Gunter <gunter.berton@gene.com>,
> ted.harding@manchester.ac.uk, r-help@stat.math.ethz.ch
> Subject: Re: [R] Bad points in regression [Broadcast]
>
>
> (My turn on the soapbox ...)
>
> I'd like to add a bit of caveat to Bert's view. I'd argue
(perhaps
> even
> plead) that robust/resistant procedures be used with care. They
> should
> not be used as a shortcut to avoid careful analysis of data. I
> recalled
> that in my first course on regression, the professor made it clear
> that
> we're fitting models to data, not the other way around. When the
> model
> fits badly to (some of the) the data, do examine and think carefully
> about what happened. Verify that "bad data" are indeed bad,
> instead of
> using statistical criteria to make that judgment. A scientific
> colleague reminded me of this point when I tried to sell him the
> idea of
> robust/resistant methods: Don't use these methods as a crutch to
> stand
> on badly run experiments (or poorly fitted models).
>
> Cheers,
> Andy
>
> From: Bert Gunter
>>
>> (mount soapbox...)
>>
>> While I know the prior discussion represents common practice,
>> I would argue
>> -- perhaps even plead -- that the modern(?? >30 years old
>> now) alternative of robust/resistant estimation be used,
>> especially in the readily available situation of
>> least-squares regression. RSiteSearch("Robust") will bring up
>> numerous possibilities.rrcov and robustbase are at least two
>> packages devoted to this, but the functionality is available
>> in many others (e.g.
>> rlm() in MASS).
>>
>> Bert Gunter
>> Genentech Nonclinical Statistics
>> South San Francisco, CA 94404
>> 650-467-7374
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: r-help-bounces@stat.math.ethz.ch
>> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Ted Harding
>> Sent: Friday, March 16, 2007 6:44 AM
>> To: r-help@stat.math.ethz.ch
>> Subject: Re: [R] Bad points in regression
>>
>> On 16-Mar-07 12:41:50, Alberto Monteiro wrote:
>>> Ted Harding wrote:
>>>>
>>>>> alpha <- 0.3
>>>>> beta <- 0.4
>>>>> sigma <- 0.5
>>>>> err <- rnorm(100)
>>>>> err[15] <- 5; err[25] <- -4; err[50] <- 10 x <-
1:100 y
>> <- alpha +
>>>>> beta * x + sigma * err ll <- lm(y ~ x)
>>>>> plot(ll)
>>>>
>>>> ll is the output of a linear model fiited by lm(), and so
>> has several
>>>> components (see ?lm in the section "Value"), one of
which is
>>>> "residuals" (which can be abbreviated to
"res").
>>>>
>>>> So, in the case of your example,
>>>>
>>>> which(abs(ll$res)>2)
>>>> 15 25 50
>>>>
>>>> extracts the information you want (and the ">2"
was inspired by
>>>> looking at the "residuals" plot from your
"plot(ll)").
>>>>
>>> Ok, but how can I grab those points _in general_? What is the
>>> criterium that plot used to mark those points as bad points?
>>
>> Ahh ... ! I see what you're after. OK, look at the plot
>> method for lm():
>>
>> ?plot.lm
>> ## S3 method for class 'lm':
>> plot(x, which = 1:4,
>> caption = c("Residuals vs Fitted", "Normal Q-Q
plot",
>> "Scale-Location plot", "Cook's distance
plot"),
>> panel = points,
>> sub.caption = deparse(x$call), main = "",
>> ask = prod(par("mfcol")) < length(which) &&
dev.interactive(),
>> ...,
>> id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75)
>>
>>
>> where (see further down):
>>
>> id.n: number of points to be labelled in each plot, starting with
>> the most extreme.
>>
>> and note, in the default parameter-values listing above:
>>
>> id.n = 3
>>
>> Hence, the 3 most extreme points (according to the criterion
>> being plotted in each plot) are marked in each plot.
>>
>> So, for instance3, try
>>
>> plot(ll,id.n=5)
>>
>> and you will get points 10,15,25,28,50. And so on. But that
>> pre-supposes that you know how many points are "exceptional".
>>
>>
>> What is meant by "extreme"is not stated in the help page
>> ?plot.lm, but can be identified by inspecting the code for
>> plot.lm(), which you can see by entering
>>
>> plot.lm
>>
>> In your example, if you omit the line which assigns anomalous
>> values to err[15[, err[25] and err[50], then you are likely
>> to observe that different points get identified on different
>> plots. For instance, I just got the following results for the
>> default id.n=3:
>>
>> [1] Residuals vs Fitted: 41,53,59
>> [2] Standardised Residuals: 41,53,59
>> [3] sqrt(Stand Res) vs Fitted: 41,53,59
>> [4] Cook's Distance: 59,96,97
>>
>>
>> There are several approaches (with somewhat different
>> outcomes) to identifying "outliers". If you apply one of
>> these, you will probably get the identities of the points anyway.
>>
>> Again in the context of your example (where in fact you
>> deliberately set 3 points to have exceptional errors, thus
>> coincidentally the same as the default value 3 of id.n), you
>> could try different values for id.n and inspect the graphs to
>> see whether a given value of id.n marks some points that do
>> not look exceptional relative to the mass of the other points.
>>
>> So, the above plot(ll,id.n=5) gave me one point, "10" on the
>> residuals plot, which apparently belonged to the general
>> distribution of residuals.
>>
>> Hoping this helps,
>> Ted.
>>
>> --------------------------------------------------------------------
>> E-Mail: (Ted Harding) <ted.harding@nessie.mcc.ac.uk>
>> Fax-to-email: +44 (0)870 094 0861
>> Date: 16-Mar-07 Time: 13:43:54
>> ------------------------------ XFMail ------------------------------
[[alternative HTML version deleted]]