thr3ads.net - R help - [R] pros and cons of "robust regression"? (i.e. rlm vs lm) [Apr 2006]

If this information is useful, please help other people find it:
Share via:

r user

2006-Apr-06 15:51 UTC

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

Can anyone comment or point me to a discussion of the
pros and cons of robust regressions, vs. a more
"manual" approach to trimming outliers and/or
"normalizing" data used in regression analysis?

Ruben Roa

2006-Apr-06 16:09 UTC

head link

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at
stat.math.ethz.ch] On Behalf Of Berton Gunter
Sent: 06 April 2006 14:22
To: 'r user'; 'rhelp'
Subject: Re: [R] pros and cons of "robust regression"? (i.e. rlm vs
lm)

There is a **Huge** literature on robust regression, including many books that
you can search on at e.g. Amazon. I think it fair to say that we have known
since at least the 1970's that practically any robust downweighting
procedure (see, e.g "M-estimation") is preferable (more efficient,
better continuity properties, better estimates) to trimming "outliers"
defined by arbitrary threshholds. An excellent but now probably dated
introductory discussion can be found in "UNDERSTANDING ROBUST AND
EXPLORATORY DATA ANALYSIS" edited by Hoaglin, Tukey, Mosteller, et. al.

----
In the mixture-of-distributions approach of ADMB's robust_regression(x,y,a)
command, there is no need to abandon the likelihood function for a more general
function. The outliers are assumed to come from another, contaminating
distribution, with extra parameter a, and then a proper, more complete,
likelihood function is used. Also it seems that the mixture-of-distributions
approach is more interpretable, more related to physical mechanisms generating
departures from the distributional assumptions. In a paper on nonlinear models
for the growth of certain marine animals where I used ADMB robust regression, I
argued that the outliers were produced by human errors in the reading of age in
certain hard structures in the body of the animals. This was consistent with the
structure of the likelihood which consisted of the mixture of a normal and
another contaminating distribution with fatter tails, operating mostly at higher
values of the predictor variable (age).
Ruben

Berton Gunter

2006-Apr-06 16:21 UTC

head link

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

There is a **Huge** literature on robust regression, including many books
that you can search on at e.g. Amazon. I think it fair to say that we have
known since at least the 1970's that practically any robust downweighting
procedure (see, e.g "M-estimation") is preferable (more efficient,
better
continuity properties, better estimates) to trimming "outliers"
defined by
arbitrary threshholds. An excellent but now probably dated introductory
discussion can be found in "UNDERSTANDING ROBUST AND EXPLORATORY DATA
ANALYSIS" edited by Hoaglin, Tukey, Mosteller, et. al.

The rub in all this is that nice small sample inference results go our the
window, though bootstrapping can help with this. Nevertheless, for a variety
of reasons, my recommendation is simply to **never** use lm and **always**
use rlm (with maybe a few minor caveats). Many would disagree with this,
however.

I don't think "normalizing" data as it's conventionally used
has anything to
do with robust regression, btw.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of r user
> Sent: Thursday, April 06, 2006 8:51 AM
> To: rhelp
> Subject: [R] pros and cons of "robust regression"? (i.e. rlm vs
lm)
> 
> Can anyone comment or point me to a discussion of the
> pros and cons of robust regressions, vs. a more
> "manual" approach to trimming outliers and/or
> "normalizing" data used in regression analysis?
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

Liaw, Andy

2006-Apr-06 16:56 UTC

head link

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

To add to Bert's comments:

-  "Normalizing" data (e.g., subtracting mean and dividing by SD) can
help
numerical stability of the computation, but that's mostly unnecessary with
modern hardware.  As Bert said, that has nothing to do with robustness.

-  Instead of _replacing_ lm() with rlm() or other robust procedure, I'd do
both of them.  Some scientists view robust procedures that omit some data
points (e.g., by assigning basically 0 weight to them) in automatic fashion
and just trust the result as bad science, and I think they have a point.
Use of robust procedure does not free one from examining the data carefully
and looking at diagnostics.  Careful treatment of outliers is esspecially
important, I think, for data coming from a confirmatory experiment.  If the
conclusion you draw depends on downweighting or omitting certain data
points, you ought to have very good reason for doing so.  I think it can not
be over-emphasized how important it is not to take outlier deletion lightly.
I've seen many cases that what seems like outlier originally turned out to
be legitimate data, and omission of them just lead to overly optimistic
assessment of variability.

Andy

From: Berton Gunter> 
> There is a **Huge** literature on robust regression, 
> including many books that you can search on at e.g. Amazon. I 
> think it fair to say that we have known since at least the 
> 1970's that practically any robust downweighting procedure 
> (see, e.g "M-estimation") is preferable (more efficient, 
> better continuity properties, better estimates) to trimming 
> "outliers" defined by arbitrary threshholds. An excellent but 
> now probably dated introductory discussion can be found in 
> "UNDERSTANDING ROBUST AND EXPLORATORY DATA ANALYSIS" edited 
> by Hoaglin, Tukey, Mosteller, et. al.
> 
> The rub in all this is that nice small sample inference 
> results go our the window, though bootstrapping can help with 
> this. Nevertheless, for a variety of reasons, my 
> recommendation is simply to **never** use lm and **always** 
> use rlm (with maybe a few minor caveats). Many would disagree 
> with this, however.
> 
> I don't think "normalizing" data as it's conventionally
used
> has anything to do with robust regression, btw.
> 
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>  
> "The business of the statistician is to catalyze the 
> scientific learning process."  - George E. P. Box
>  
>  
> 
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of r user
> > Sent: Thursday, April 06, 2006 8:51 AM
> > To: rhelp
> > Subject: [R] pros and cons of "robust regression"? (i.e. rlm
vs lm)
> > 
> > Can anyone comment or point me to a discussion of the
> > pros and cons of robust regressions, vs. a more
> > "manual" approach to trimming outliers and/or
> > "normalizing" data used in regression analysis?
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Liaw, Andy

2006-Apr-06 18:03 UTC

head link

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

Some people use the two terms sort of interchangeably.
 
"Beneficial" in what sense?  If there are no polynomial terms
involving a
variable, any linear transformation of a variable (by itself) does not
change the quality of the fit at all.  The scaling gets reflected in the
coefficient and its SE, but t-statistics and p-value stay the same, as well
as predictions, R-squared, etc.  Even when there's polynomial term,
R-squared and predictions do not change.
 
HTH,
Andy

-----Original Message-----
From: roger bos [mailto:roger.bos@gmail.com] 
Sent: Thursday, April 06, 2006 1:15 PM
To: Berton Gunter; Liaw, Andy
Cc: rhelp
Subject: Re: [R] pros and cons of "robust regression"? (i.e. rlm vs
lm)


I'm asking this question purely for my own benefit, not to try to correct
anyone.  The procedure you refer to as "normalization" I have always
heard
referred to as "standardization".  Is the former the proper term? 
Also, you
say its not necessary given today's hardware, but isn't it beneficial to
get
all the variables in a similar range?  Is thre any other transformation that
you would suggest?  I use rlm (and "normalization") in my models I use
every
day, so I was happy to read the above comments. 
 
Thanks,
 
Roger


 
On 4/6/06, Berton Gunter <gunter.berton@gene.com
<mailto:gunter.berton@gene.com> > wrote: 

Thanks, Andy. Well said. Excellent points. The final weights from rlm serve
this diagnostic purpose, of course. 

-- Bert

> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw@merck.com
<mailto:andy_liaw@merck.com>
]> Sent: Thursday, April 06, 2006 9:56 AM
> To: 'Berton Gunter'; 'r user'; 'rhelp' 
> Subject: RE: [R] pros and cons of "robust regression"? (i.e.
> rlm vs lm)
>
> To add to Bert's comments:
>
> -  "Normalizing" data (e.g., subtracting mean and dividing by 
> SD) can help
> numerical stability of the computation, but that's mostly
> unnecessary with
> modern hardware.  As Bert said, that has nothing to do with
> robustness.
>
> -  Instead of _replacing_ lm() with rlm() or other robust 
> procedure, I'd do
> both of them.  Some scientists view robust procedures that
> omit some data
> points (e.g., by assigning basically 0 weight to them) in
> automatic fashion
> and just trust the result as bad science, and I think they 
> have a point.
> Use of robust procedure does not free one from examining the
> data carefully
> and looking at diagnostics.  Careful treatment of outliers is
> esspecially
> important, I think, for data coming from a confirmatory 
> experiment.  If the
> conclusion you draw depends on downweighting or omitting certain data
> points, you ought to have very good reason for doing so.  I
> think it can not
> be over-emphasized how important it is not to take outlier 
> deletion lightly.
> I've seen many cases that what seems like outlier originally
> turned out to
> be legitimate data, and omission of them just lead to overly
> optimistic
> assessment of variability. 
>
> Andy
>
> From: Berton Gunter
> >
> > There is a **Huge** literature on robust regression,
> > including many books that you can search on at e.g. Amazon. I
> > think it fair to say that we have known since at least the 
> > 1970's that practically any robust downweighting procedure
> > (see, e.g "M-estimation") is preferable (more efficient,
> > better continuity properties, better estimates) to trimming 
> > "outliers" defined by arbitrary threshholds. An excellent
but
> > now probably dated introductory discussion can be found in
> > "UNDERSTANDING ROBUST AND EXPLORATORY DATA ANALYSIS" edited 
> > by Hoaglin, Tukey, Mosteller, et. al.
> >
> > The rub in all this is that nice small sample inference
> > results go our the window, though bootstrapping can help with
> > this. Nevertheless, for a variety of reasons, my 
> > recommendation is simply to **never** use lm and **always**
> > use rlm (with maybe a few minor caveats). Many would disagree
> > with this, however.
> >
> > I don't think "normalizing" data as it's
conventionally used
> > has anything to do with robust regression, btw.
> >
> > -- Bert Gunter
> > Genentech Non-Clinical Statistics
> > South San Francisco, CA
> >
> > "The business of the statistician is to catalyze the 
> > scientific learning process."  - George E. P. Box
> >
> >
> >
> > > -----Original Message-----
> > > From: r-help-bounces@stat.math.ethz.ch
<mailto:r-help-bounces@stat.math.ethz.ch> > > > [mailto:r-help-bounces@stat.math.ethz.ch<mailto:r-help-bounces@stat.math.ethz.ch> ] On Behalf Of r
user> > > Sent: Thursday, April 06, 2006 8:51 AM 
> > > To: rhelp
> > > Subject: [R] pros and cons of "robust regression"?
(i.e.
> rlm vs lm)
> > >
> > > Can anyone comment or point me to a discussion of the 
> > > pros and cons of robust regressions, vs. a more
> > > "manual" approach to trimming outliers and/or
> > > "normalizing" data used in regression analysis?
> > > 
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch> 
mailing
list> > > https://stat.ethz.ch/mailman/listinfo/r-help
<https://stat.ethz.ch/mailman/listinfo/r-help> > > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
<http://www.R-project.org/posting-guide.html> > > >
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch> 
mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
<https://stat.ethz.ch/mailman/listinfo/r-help> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
<http://www.R-project.org/posting-guide.html> > > 
> >
>
>
> --------------------------------------------------------------
> ----------------
> Notice:  This e-mail message, together with any attachments,
> contains information of Merck & Co., Inc. (One Merck Drive, 
> Whitehouse Station, New Jersey, USA 08889), and/or its
> affiliates (which may be known outside the United States as
> Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as
> Banyu) that may be confidential, proprietary copyrighted 
> and/or legally privileged. It is intended solely for the use
> of the individual or entity named on this message.  If you
> are not the intended recipient, and have received this
> message in error, please notify us immediately by reply 
> e-mail and then delete it from your system.
> --------------------------------------------------------------
> ----------------
>
______________________________________________
R-help@stat.math.ethz.ch <mailto:R-help@stat.math.ethz.ch>  mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
<https://stat.ethz.ch/mailman/listinfo/r-help> 
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
<http://www.R-project.org/posting-guide.html> 




------------------------------------------------------------------------------

------------------------------------------------------------------------------
	[[alternative HTML version deleted]]

bogdan romocea

2006-Apr-06 19:47 UTC

head link

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

There are several kinds of standardization, and 'normalization' is
only one of them. For some details you could check
http://support.sas.com/91doc/getDoc/statug.hlp/stdize_index.htm
(see Details for standardization methods).

Standardization is required prior to clustering to control for the
impact of scale. (Variables with large variances tend to have more
effect on the resulting clusters than those with small variances.) I
don't know how valuable standardization may be in other areas.

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of roger bos
> Sent: Thursday, April 06, 2006 1:15 PM
> To: Berton Gunter; Liaw, Andy
> Cc: rhelp
> Subject: Re: [R] pros and cons of "robust regression"? (i.e.
> rlm vs lm)
>
> I'm asking this question purely for my own benefit, not to
> try to correct
> anyone.  The procedure you refer to as "normalization" I have
> always heard
> referred to as "standardization".  Is the former the proper
> term?  Also, you
> say its not necessary given today's hardware, but isn't it
> beneficial to get
> all the variables in a similar range?  Is thre any other
> transformation that
> you would suggest?  I use rlm (and "normalization") in my
> models I use every
> day, so I was happy to read the above comments.
>
> Thanks,
>
> Roger
>
>
>
> On 4/6/06, Berton Gunter <gunter.berton at gene.com> wrote:
> >
> > Thanks, Andy. Well said. Excellent points. The final
> weights from rlm
> > serve
> > this diagnostic purpose, of course.
> >
> > -- Bert
> >
> >
> > > -----Original Message-----
> > > From: Liaw, Andy [mailto:andy_liaw at merck.com]
> > > Sent: Thursday, April 06, 2006 9:56 AM
> > > To: 'Berton Gunter'; 'r user'; 'rhelp'
> > > Subject: RE: [R] pros and cons of "robust regression"?
(i.e.
> > > rlm vs lm)
> > >
> > > To add to Bert's comments:
> > >
> > > -  "Normalizing" data (e.g., subtracting mean and
dividing by
> > > SD) can help
> > > numerical stability of the computation, but that's mostly
> > > unnecessary with
> > > modern hardware.  As Bert said, that has nothing to do with
> > > robustness.
> > >
> > > -  Instead of _replacing_ lm() with rlm() or other robust
> > > procedure, I'd do
> > > both of them.  Some scientists view robust procedures that
> > > omit some data
> > > points (e.g., by assigning basically 0 weight to them) in
> > > automatic fashion
> > > and just trust the result as bad science, and I think they
> > > have a point.
> > > Use of robust procedure does not free one from examining the
> > > data carefully
> > > and looking at diagnostics.  Careful treatment of outliers is
> > > esspecially
> > > important, I think, for data coming from a confirmatory
> > > experiment.  If the
> > > conclusion you draw depends on downweighting or omitting
> certain data
> > > points, you ought to have very good reason for doing so.  I
> > > think it can not
> > > be over-emphasized how important it is not to take outlier
> > > deletion lightly.
> > > I've seen many cases that what seems like outlier originally
> > > turned out to
> > > be legitimate data, and omission of them just lead to overly
> > > optimistic
> > > assessment of variability.
> > >
> > > Andy
> > >
> > > From: Berton Gunter
> > > >
> > > > There is a **Huge** literature on robust regression,
> > > > including many books that you can search on at e.g. Amazon.
I
> > > > think it fair to say that we have known since at least the
> > > > 1970's that practically any robust downweighting
procedure
> > > > (see, e.g "M-estimation") is preferable (more
efficient,
> > > > better continuity properties, better estimates) to trimming
> > > > "outliers" defined by arbitrary threshholds. An
excellent but
> > > > now probably dated introductory discussion can be found in
> > > > "UNDERSTANDING ROBUST AND EXPLORATORY DATA
ANALYSIS" edited
> > > > by Hoaglin, Tukey, Mosteller, et. al.
> > > >
> > > > The rub in all this is that nice small sample inference
> > > > results go our the window, though bootstrapping can help
with
> > > > this. Nevertheless, for a variety of reasons, my
> > > > recommendation is simply to **never** use lm and **always**
> > > > use rlm (with maybe a few minor caveats). Many would
disagree
> > > > with this, however.
> > > >
> > > > I don't think "normalizing" data as it's
conventionally used
> > > > has anything to do with robust regression, btw.
> > > >
> > > > -- Bert Gunter
> > > > Genentech Non-Clinical Statistics
> > > > South San Francisco, CA
> > > >
> > > > "The business of the statistician is to catalyze the
> > > > scientific learning process."  - George E. P. Box
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: r-help-bounces at stat.math.ethz.ch
> > > > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf
Of r user
> > > > > Sent: Thursday, April 06, 2006 8:51 AM
> > > > > To: rhelp
> > > > > Subject: [R] pros and cons of "robust
regression"? (i.e.
> > > rlm vs lm)
> > > > >
> > > > > Can anyone comment or point me to a discussion of the
> > > > > pros and cons of robust regressions, vs. a more
> > > > > "manual" approach to trimming outliers and/or
> > > > > "normalizing" data used in regression
analysis?
> > > > >
> > > > > ______________________________________________
> > > > > R-help at stat.math.ethz.ch mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide!
> > > > > http://www.R-project.org/posting-guide.html
> > > > >
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > > http://www.R-project.org/posting-guide.html
> > > >
> > > >
> > >
> > >
> > > --------------------------------------------------------------
> > > ----------------
> > > Notice:  This e-mail message, together with any attachments,
> > > contains information of Merck & Co., Inc. (One Merck Drive,
> > > Whitehouse Station, New Jersey, USA 08889), and/or its
> > > affiliates (which may be known outside the United States as
> > > Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as
> > > Banyu) that may be confidential, proprietary copyrighted
> > > and/or legally privileged. It is intended solely for the use
> > > of the individual or entity named on this message.  If you
> > > are not the intended recipient, and have received this
> > > message in error, please notify us immediately by reply
> > > e-mail and then delete it from your system.
> > > --------------------------------------------------------------
> > > ----------------
> > >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

Reasonably Related Threads

Search for more reasonably related threads

R help - Apr 2006 - pros and cons of "robust regression"? (i.e. rlm vs lm)

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

[R] pros and cons of "robust regression"? (i.e. rlm vs lm)

Reasonably Related Threads