Dear R-users, I have two outliers related questions. I. I have a vector consisting of 69 values. mean = 0.00086 SD = 0.02152 The shape of EDA graphics (boxplots, density plots) is heavily distorted due to outliers. How to define the interval for outliers exception? Is <2SD - mean + 2SD> interval a correct approach? Or should I define 95% (or 99%) limit of agreement for data interval, and exclude lower, and higher values? II. How to extract only those values from vector which fulfill the condition of interval (higher than A, and lower than B)? Rado Bonk
> II. > How to extract only those values from vector which fulfill the condition > of interval (higher than A, and lower than B)?x[x>A & x<B]> > Rado Bonk > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > http://www.stat.math.ethz.ch/mailman/listinfo/r-help >-- 318 Carr Hall bolker at zoo.ufl.edu Zoology Department, University of Florida http://www.zoo.ufl.edu/bolker Box 118525 (ph) 352-392-5697 Gainesville, FL 32611-8525 (fax) 352-392-3704
Hi, the boxplot is based on the quartiles which are much less outlier sensitive than mean and SD and should therefore not be "heavily distorted by outliers". What you mean is presumably that you see the area of the main bulk of the data only as a very small box on the screen because of your outliers. However, a simple straight forward method for outlier identification is median +/- 5.2*mad as suggested by Hampel, Technometrics 27 (1985) 95-107. Outlier identification by use of mean and SD is often bad because these statistics are strongly influenced by the outliers. x <- data vector medx <- median(x) madx <- mad(x) outliers <- (x<medx-5.2*madx) | (x>medx+5.2*madx) selected <- x[!outliers] Best, Christian On 20 Feb 2003, Rado Bonk wrote:> Dear R-users, > > I have two outliers related questions. > > I. > I have a vector consisting of 69 values. > > mean = 0.00086 > SD = 0.02152 > > The shape of EDA graphics (boxplots, density plots) is heavily distorted > due to outliers. How to define the interval for outliers exception? Is > <2SD - mean + 2SD> interval a correct approach? > > Or should I define 95% (or 99%) limit of agreement for data interval, > and exclude lower, and higher values? > > II. > How to extract only those values from vector which fulfill the condition > of interval (higher than A, and lower than B)? > > Rado Bonk > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > http://www.stat.math.ethz.ch/mailman/listinfo/r-help >-- *********************************************************************** Christian Hennig Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently) and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/ hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag.de
On Thu, Feb 20, 2003 at 06:37:48PM -0500, Rado Bonk wrote:> Dear R-users, > > I have two outliers related questions. > > I. > I have a vector consisting of 69 values. > > mean = 0.00086 > SD = 0.02152 > > The shape of EDA graphics (boxplots, density plots) is heavily distorted > due to outliers. How to define the interval for outliers exception? Is > <2SD - mean + 2SD> interval a correct approach?Yikes. There's been a lot of discussion of this over the years; these discussions usually generate more heat than light. <personal bias> Throwing away outliers without further investigation is often considered a bad idea. The argument is that you get into a situation where you are rejecting data because it doesn't fit the model, which is a strange approach. The most famous case of this was satelite data on ozone thickness over Antarctica - the ozone hole was missed for years because of an automatic outlier-rejection routine in the data analysis. If those outliers hadn't been rejected, the steps taken could've been done sooner, avoiding a lot of dammage. My own work is in industrial process control - if I ignored outliers, I'd make an awful lot of very bad mistakes, and wouldn't have a job for long. Outliers aren't necessarily wrong - sometimes the data is trying to tell you something. </personal bias> Robust summaries are another way. Check out the help pages for mad(), IQR(), fivenum(). Having said that, if you want to compare outlier-free data with your raw data to help enlighten you about where those outliers might be comming from, something like this might help... ss <- mad(myvec) mm <- median(myvec) ind <- (myvec > mm - 3*ss & myvec < mm + 3*ss) # or ind2 <- (myvec > quantile(myvec,0.025) & myvec <quantile(myvec,0.975)) boxplot(myvec[ind]) boxplot(myvec[ind2]) Cheers Jason -- Indigo Industrial Controls Ltd. 64-21-343-545 jasont at indigoindustrial.co.nz