thr3ads.net - R help - [R] removing outlier [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Juli

2015-Sep-12 09:32 UTC

[R] removing outlier

Hi Jim, 

thank you for your help. :)

My point is, that there are outlier and I don?t really know how to deal with
that. 

I need the dataframe for a regression and read often that only a few outlier
can change your results very much. In addition, regression diacnostics
didn?t indcate me the best results.
Yes, and I know its not the core of statistics to work in a way you get
results you would like to have ;).

So what is your suggestion?

And if I remove the outliers, my problem ist, that as you said, they differ
in length. I need the data frame for a regression, so can I remove the whole
column or is there a call to exclude the data?

JULI



--
View this message in context:
http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712170.html
Sent from the R help mailing list archive at Nabble.com.

David Winsemius

2015-Sep-12 16:52 UTC

head link

[R] removing outlier

On Sep 12, 2015, at 2:32 AM, Juli wrote:
> Hi Jim, 
> 
> thank you for your help. :)
> 
> My point is, that there are outlier and I don?t really know how to deal
with
> that. 
> 
> I need the dataframe for a regression and read often that only a few
outlier
> can change your results very much. In addition, regression diacnostics
> didn?t indcate me the best results.
> Yes, and I know its not the core of statistics to work in a way you get
> results you would like to have ;).
> 
> So what is your suggestion?
> 
> And if I remove the outliers, my problem ist, that as you said, they differ
> in length. I need the data frame for a regression, so can I remove the
whole
> column or is there a call to exclude the data?
Most regression methods have a 'subset' parameter which would allow you
to distort the data to your desired specification. But why not think about
examining a different statistical model or using robust methods? That way you
can keep all your data. (Sounds like you don't really have a lot.)

-- 
David.> 
> JULI
> 
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712170.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Bert Gunter

2015-Sep-13 14:33 UTC

head link

[R] removing outlier

... and this, of course, is a nice example of how statistics
contributes to the "irreproducibility crisis" now roiling Science.

Cheers,
Bert

(Quote from a long ago engineering colleague: "Whenever I see an
outlier, I never know whether to throw it away or patent it.")


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Sep 12, 2015 at 9:52 AM, David Winsemius <dwinsemius at
comcast.net> wrote:>
> On Sep 12, 2015, at 2:32 AM, Juli wrote:
>
>> Hi Jim,
>>
>> thank you for your help. :)
>>
>> My point is, that there are outlier and I don?t really know how to deal
with
>> that.
>>
>> I need the dataframe for a regression and read often that only a few
outlier
>> can change your results very much. In addition, regression diacnostics
>> didn?t indcate me the best results.
>> Yes, and I know its not the core of statistics to work in a way you get
>> results you would like to have ;).
>>
>> So what is your suggestion?
>>
>> And if I remove the outliers, my problem ist, that as you said, they
differ
>> in length. I need the data frame for a regression, so can I remove the
whole
>> column or is there a call to exclude the data?
>
> Most regression methods have a 'subset' parameter which would allow
you to distort the data to your desired specification. But why not think about
examining a different statistical model or using robust methods? That way you
can keep all your data. (Sounds like you don't really have a lot.)
>
> --
> David.
>>
>> JULI
>>
>>
>>
>> --
>> View this message in context:
http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712170.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Martin Maechler

2015-Sep-15 08:37 UTC

head link

[R] removing outlier --> use robust regression !

>>>>> Juli  <Julianeleuschner at web.de>
>>>>>     on Sat, 12 Sep 2015 02:32:39 -0700 writes:
     > Hi Jim, thank you for your help. :)

     > My point is, that there are outlier and I don?t really
     > know how to deal with that.

     > I need the dataframe for a regression and read often that
     > only a few outlier can change your results very much. In
     > addition, regression diacnostics didn?t indcate me the
     > best results.  Yes, and I know its not the core of
     > statistics to work in a way you get results you would
     > like to have ;).

     > So what is your suggestion?

Use robust regression, e.g.
    MASS::rlm()  {part of every R installation},
or a somewhat better and more sophisticated version.
    lmrob()  from package 'robustbase' {yes, shameless promotion}.

Further: 

1) Removing outliers is not at all the best way to deal with such
  problems (intuitively, because it is a *dis*continuous method).
  Rather they should be downweighted (continuously, as it
  happens with methods used in  rlm() or lmrob() see above)

2) Removing outliers in *multivariate* setting, if you want to do
   it in spite of 1)  by using univariate treatment {each column
   separately as you do here} is often completely insufficient.  E.g.
   the bivariate outlier  in
      xy <- cbind(x= c(2,1:9), y=c(8,1:9));  plot(xy)
   cannot be found by looking at 'x' and 'y' separately.

3) If, in spite of 1) and 2) you are considering univariate
   treatment, using mean() and sd() for detecting univariate outliers
   has been proven to be insufficient more than 50 years ago (*1), and
   if one looks closer into the literature (say "L_1") even
   considerably longer ago. 
   Using  median() and mad() instead, is one possibility (*2) of
   what you should do. Hampel's rule (*3)
   proposes declaring outliers for the observations outside
   the interval   median(x) +/- 3.5*mad(x)

*1 Tukey, J. W. (1960) A survey of sampling from contaminated distributions. 
   In Contributions to Probability and Statistics, 
   eds I. Olkin, S. Ghurye, W. Hoeffding, W. Madow and H. Mann,
   pp. 448?485. Stanford: Stanford University Press.

*2 Another (less robust, but still infinitely better than mean/sd) approach
   uses  median() and IQR() which is
   basically/approximately what boxplots do to identify outliers.

*3 Frank R. Hampel (1985)
   The Breakdown Points of the Mean Combined With Some Rejection Rules,
   Technometrics, 27:2, 95-107
	  [ http://dx.doi.org/10.1080/00401706.1985.10488027 ]

   See also section 
       "1.4b. How Well Are Objective and Subjective Metbods for
       the ReJection of Outliers Doing in the Context of Robust
       Estimation?",
    page 62 ff  od
  of
    Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw and Werner A.
Stahel
    (1986) Robust Statistics: The Approach Based on Influence Functions.
    John Wiley & Sons, Inc.

R help - Sep 2015 - removing outlier

[R] removing outlier

[R] removing outlier

[R] removing outlier

[R] removing outlier --> use robust regression !