Hi, this is both a statistical and a R question... what would the best way / test to detect an outlier value among a series of 10 to 30 values ? for instance if we have the following dataset: 10,11,12,15,20,22,25,30,500 I d like to have a way to identify the last data as an outlier (only one direction). One way would be to calculate abs(mean - median) and if elevated (to what extent ?) delete the extreme data then redo.. but is it valid to do so with so few data ? is the (trimmed mean - mean) more efficient ? if so, what would be the maximal tolerable value to use as a threshold ? (I guess it will be experiment dependent...) tests for skweness will probably required a larger dataset ? any suggestions are very welcome ! thanks for your help Philippe Guardiola, MD
<Phguardiol <at> aol.com> writes: : : Hi, : this is both a statistical and a R question... : what would the best way / test to detect an outlier value among a series of 10 to 30 values ? for instance if we : have the following dataset: 10,11,12,15,20,22,25,30,500 I d like to have a way to identify the last data : as an outlier (only one direction). One way would be to calculate abs(mean - median) and if elevated (to : what extent ?) delete the extreme data then redo.. but is it valid to do so with so few data ? is the (trimmed : mean - mean) more efficient ? if so, what would be the maximal tolerable value to use as a threshold ? (I guess : it will be experiment dependent...) tests for skweness will probably required a larger dataset ? : any suggestions are very welcome ! : thanks for your help : Philippe Guardiola, MD If z is your vector the following all detect outliers: boxplot(z) # will show the outlier plot(lm(z ~ 1)) # the various plots show this as well require(car) outlier.test(lm(z ~ 1)) # tests most extreme value
Hi Philippe, you could consider using the Windsorized mean, winds.mean <- function(x, k=2){ y <- x[!is.na(x)] mu <- mean(y) stdev <- sd(y) outliers.up <- y[y>mu+k*stdev] outliers.lo <- y[y<mu-k*stdev] y[y==outliers.up] <- mu+k*stdev y[y==outliers.lo] <- mu-k*stdev list(mean=sum(y)/length(y), outliers.up=outliers.up, outliers.lo=outliers.lo) } ################## x <- c(10,11,12,15,20,22,25,30,500) mean(x) winds.mean(x) I hope this helps. Best, Dimitris ---- Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/16/396887 Fax: +32/16/337015 Web: http://www.med.kuleuven.ac.be/biostat/ http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm ----- Original Message ----- From: <Phguardiol at aol.com> To: <r-help at stat.math.ethz.ch> Sent: Thursday, September 23, 2004 4:22 PM Subject: [R] detection of outliers> Hi, > this is both a statistical and a R question... > what would the best way / test to detect an outlier value among a > series of 10 to 30 values ? for instance if we have the following > dataset: 10,11,12,15,20,22,25,30,500 I d like to have a way to > identify the last data as an outlier (only one direction). One way > would be to calculate abs(mean - median) and if elevated (to what > extent ?) delete the extreme data then redo.. but is it valid to do > so with so few data ? is the (trimmed mean - mean) more efficient ? > if so, what would be the maximal tolerable value to use as a > threshold ? (I guess it will be experiment dependent...) tests for > skweness will probably required a larger dataset ? > any suggestions are very welcome ! > thanks for your help > Philippe Guardiola, MD > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
On Thu, 23 Sep 2004 Phguardiol at aol.com wrote:> Hi, > this is both a statistical and a R question... > what would the best way / test to detect an outlier value among a series of 10 to 30 values ? for instance if we have the following dataset: 10,11,12,15,20,22,25,30,500 I d like to have a way to identify the last data as an outlier (only one direction). One way would be to calculate abs(mean - median) and if elevated (to what extent ?) delete the extreme data then redo.. but is it valid to do so with so few data ? is the (trimmed mean - mean) more efficient ? if so, what would be the maximal tolerable value to use as a threshold ? (I guess it will be experiment dependent...) tests for skweness will probably required a larger dataset ? > any suggestions are very welcome ! > thanks for your help > Philippe Guardiola, MD > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >You may want to read Davies and Gather, The identification of multiple outliers, JASA 88 (1993), 782-801. The simplest recommendation is to nominate all points with distance larger than c*mad(data) from the median as outliers. Choices of c depending on n are given in the above paper. This is somewhat better founded theoretically than the boxplot method recommended by Gabor G., but it is based on the assumption that the distribution on the non-outliers is close to the normal and especially not strongly skewed (the boxplot method seems to be a bit more robust against skewness). Christian *********************************************************************** Christian Hennig Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag-online.de
Hi, give a look to: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm it's the Grubbs' Test for Outliers. It is based on the assumption of normality of data. Other methods of outliers' could: Run-Sequence Plot Histogram Normal Probability Plot Box-plot Best Vito you wrote: Hi, this is both a statistical and a R question... what would the best way / test to detect an outlier value among a series of 10 to 30 values ? for instance if we have the following dataset: 10,11,12,15,20,22,25,30,500 I d like to have a way to identify the last data as an outlier (only one direction). One way would be to calculate abs(mean - median) and if elevated (to what extent ?) delete the extreme data then redo.. but is it valid to do so with so few data ? is the (trimmed mean - mean) more efficient ? if so, what would be the maximal tolerable value to use as a threshold ? (I guess it will be experiment dependent...) tests for skweness will probably required a larger dataset ? any suggestions are very welcome ! thanks for your help Philippe Guardiola, MD ====Diventare costruttori di soluzioni Visitate il portale http://www.modugno.it/ e in particolare la sezione su Palese http://www.modugno.it/archivio/cat_palese.shtml ___________________________________ http://it.seriea.fantasysports.yahoo.com/
Not to oversimplify ... 1. (At least) dozens of books and thousands of papers have been written on this... 2. Most important question is: What is an outlier? (Many smart folks says that the concept is illogical/flawed -- there is no mystical boundary that one crosses to become a statistical pariah; many other smart folks disagree). 3. Equivalently: What is the model with respect to which values are outlying? (with apologies to Winston Churchill's: "That is an indignity up with which I will not put.") So good advice here is: Beware of good advice about this. (Of course, I may just be an outlier ...) ;) Cheers, -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > Phguardiol at aol.com > Sent: Thursday, September 23, 2004 7:22 AM > To: r-help at stat.math.ethz.ch > Subject: [R] detection of outliers > > Hi, > this is both a statistical and a R question... > what would the best way / test to detect an outlier value > among a series of 10 to 30 values ? for instance if we have > the following dataset: 10,11,12,15,20,22,25,30,500 I d like > to have a way to identify the last data as an outlier (only > one direction). One way would be to calculate abs(mean - > median) and if elevated (to what extent ?) delete the extreme > data then redo.. but is it valid to do so with so few data ? > is the (trimmed mean - mean) more efficient ? if so, what > would be the maximal tolerable value to use as a threshold ? > (I guess it will be experiment dependent...) tests for > skweness will probably required a larger dataset ? > any suggestions are very welcome ! > thanks for your help > Philippe Guardiola, MD > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
Dimitris Rizopoulos writes, in part:> Hi Philippe, > > you could consider using the Windsorized mean, > > winds.mean <- function(x, k=2){FYI, the shrinking of tails process of Winsorization was brought to the attention of the statistical community by John Tukey. It is named after its originator, Charley Winsor, and not after the House of Windsor. ********************************************************** Cliff Lunneborg, Professor Emeritus, Statistics & Psychology, University of Washington, Seattle cliff at ms.washington.edu