Hello R-users, Is there any more sophisticated way how to identify the dataset outliers other then seeing them in boxplot? I wanna exclude them from further analysis and I am interested in their position in my vector data. Rado -- Radoslav Bonk M.S. Dept. of Physical Geography and Geoecology Faculty of Sciences, Comenius University Mlynska Dolina 842 15, Bratislava, SLOVAKIA tel: +421 2 602 96 250 e-mail: rbonk at host.sk -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
First, a bit of retoric: Rejecting data points solely on the ground of being "statistical outliers" probably should be outlawed. (Sounds like that's what you're trying to do.) You need to investigate these data points so that you understand the reason for their "outlyingness" before you decide whether their exclusion make sense or not. Exclusion based purely on statistical criteria almost guarantee irreproducible research. Some explanation: Any "statistical outliers" are with reference to a model (formal or conceptual). They reflect lack of fit of the model to the data. Rejecting these data points on statistical ground means you believe the model more than the data, which not such a good idea for scientific research. The "outliers" indicated by boxplots are based on a criterion (something like the upper/lower hinges +/- k*IQR, where k is either 1.5 or 3, see ?boxplot.stats for some definitions). Actually boxplot.stats gives you the limits boxplot() used to identify outliers. Andy -----Original Message----- From: Rado Bonk [mailto:rbonk at host.sk] Sent: Tuesday, November 26, 2002 10:36 AM To: r-help at stat.math.ethz.ch Subject: [R] how to identify the outliers Hello R-users, Is there any more sophisticated way how to identify the dataset outliers other then seeing them in boxplot? I wanna exclude them from further analysis and I am interested in their position in my vector data. Rado -- Radoslav Bonk M.S. Dept. of Physical Geography and Geoecology Faculty of Sciences, Comenius University Mlynska Dolina 842 15, Bratislava, SLOVAKIA tel: +421 2 602 96 250 e-mail: rbonk at host.sk -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear Rado, I do not know how your data looks like, but generally you can use robust Mahalanobis distances. That is, compute robust mean and covariance matrix by cov.rob (method="mcd") in Library lqs, and put these as center and cov into the function mahalanobis. As cutoff value you can take a large quantile (say 0.999) of the chi^2-distribution with p (number of your variables) degrees of freedom. Details in Rousseeuw & van Driessen, see help page on cov.rob. Christian On Tue, 26 Nov 2002, Rado Bonk wrote:> Hello R-users, > > Is there any more sophisticated way how to identify the dataset > outliers other then seeing them in boxplot? I wanna exclude them from > further analysis and I am interested in their position in my vector > data. > > Rado > >-- *********************************************************************** Christian Hennig Seminar fuer Statistik, ETH-Zentrum (LEO), CH-8092 Zuerich (currently) and Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg hennig at stat.math.ethz.ch, http://stat.ethz.ch/~hennig/ hennig at math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/ ####################################################################### ich empfehle www.boag.de -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Sigh... Today seems to be my bad English day...> First, a bit of retoric:^^^^^^^ That should be "rhetoric", but "propaganda" is probably closer to what I actually meant. Sorry for wasting bandwidth... Andy ------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear Rado, First I would do is to investigate the data I consider "outliers"; it brings some consequences to exclude them IF the data are naturally occur, and it means we lost them, otherwise we can exclude them. However, I don't think you need a sophisticated way to do that since you can run analyses like stepwise or robust analysis. Exclude one by one, and run your analysis to see how far the exclusion will impact to your results, do this repeatedly. This is to avoid exclusion "valuable" data and important information in every single data you have... Alternatively, you can read Tukey's book " Exploratory Data analysis", I don't have it now. Cheers, Eduwin -----Original Message----- From: owner-r-help at stat.math.ethz.ch [mailto:owner-r-help at stat.math.ethz.ch] On Behalf Of Rado Bonk Sent: Tuesday, November 26, 2002 10:36 PM To: r-help at stat.math.ethz.ch Subject: [R] how to identify the outliers Hello R-users, Is there any more sophisticated way how to identify the dataset outliers other then seeing them in boxplot? I wanna exclude them from further analysis and I am interested in their position in my vector data. Rado -- Radoslav Bonk M.S. Dept. of Physical Geography and Geoecology Faculty of Sciences, Comenius University Mlynska Dolina 842 15, Bratislava, SLOVAKIA tel: +421 2 602 96 250 e-mail: rbonk at host.sk -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._._._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._