Alexander Eggel
2010-Aug-09 22:27 UTC
[R] Identification of Outliners and Extraction of Samples
Hello everybody, I need to know which samples (S1-S6) contain a value that is bigger than the median + five standard deviations of the column he is in. This is just an example. Command should be applied to a data frame wich is a lot bigger (over 100 columns). Any solutions? Thank you very much for your help!!!> sSamples A B C E 1 S1 1 2 3 7 2 S2 4 NA 6 6 3 S3 7 8 9 NA 4 S4 4 5 NA 6 5 S5 2 5 6 7 6 S6 2 3 4 5 This loop works fine for a column without NA values. However it doesn't work for the other columns. I should have a loop that I could apply to all columns ideally in "one command". o <- data.frame(); for (i in 1:nrow(s)) { dd <- s[i,]; if (dd$A >= median(s$A, na.rm=TRUE) + 5 * sd(s$A, na.rm=TRUE)) o <- rbind(o,dd) } [[alternative HTML version deleted]]
Frank Harrell
2010-Aug-09 22:47 UTC
[R] Identification of Outliners and Extraction of Samples
On Mon, 9 Aug 2010, Alexander Eggel wrote:> Hello everybody, > > I need to know which samples (S1-S6) contain a value that is bigger than the > median + five standard deviations of the column he is in. This is just anWhy not the 70th percentile plus 6 times the difference in the 85th and 75th percentiles :-) Frank P.S. See @Article{fin06cal, author = {Finney, David J.}, title = {Calibration guidelines challenge outlier practices}, journal = The American Statistician, year = 2006, volume = 60, pages = {309-313}, annote = {anticoagulant therapy;bias;causation;ethics;objectivity;outliers;guidelines for treatment of outliers;overview of types of outliers;letter to the editor and reply 61:187 May 2007} }> example. Command should be applied to a data frame wich is a lot bigger > (over 100 columns). Any solutions? Thank you very much for your help!!! > >> s > Samples A B C E > 1 S1 1 2 3 7 > 2 S2 4 NA 6 6 > 3 S3 7 8 9 NA > 4 S4 4 5 NA 6 > 5 S5 2 5 6 7 > 6 S6 2 3 4 5 > > This loop works fine for a column without NA values. However it doesn't work > for the other columns. I should have a loop that I could apply to all > columns ideally in "one command". > > o <- data.frame(); > for (i in 1:nrow(s)) > { > dd <- s[i,]; > if (dd$A >= median(s$A, na.rm=TRUE) + 5 * sd(s$A, na.rm=TRUE)) o <- > rbind(o,dd) > > } > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
David Winsemius
2010-Aug-10 01:53 UTC
[R] Identification of Outliners and Extraction of Samples
On Aug 9, 2010, at 6:27 PM, Alexander Eggel wrote:> Hello everybody, > > I need to know which samples (S1-S6) contain a value that is bigger > than the > median + five standard deviations of the column he is in. This is > just an > example. Command should be applied to a data frame wich is a lot > bigger > (over 100 columns). Any solutions? Thank you very much for your > help!!! > >> s > Samples A B C E > 1 S1 1 2 3 7 > 2 S2 4 NA 6 6 > 3 S3 7 8 9 NA > 4 S4 4 5 NA 6 > 5 S5 2 5 6 7 > 6 S6 2 3 4 5 > > This loop works fine for a column without NA values. However it > doesn't work > for the other columns. I should have a loop that I could apply to all > columns ideally in "one command". > > o <- data.frame(); > for (i in 1:nrow(s))> { > dd <- s[i,]; > if (dd$A >= median(s$A, na.rm=TRUE) + 5 * sd(s$A, na.rm=TRUE)) > o <- > rbind(o,dd) > > } >Let's look at the more general problem of how to do column-wise calculations (since I suspect there is not much support in this neighborhood for the notion that you have a proper definition of "outlier" and furthermore you have not provided an example where any such outliers exist). Let's just calculate a set of logical vectors that signal whether a value is greater than one sd above the median: apply(s[-1], 2, function(x) {x > median(x, na.rm=TRUE) + sd(x, na.rm=TRUE)}) A B C E 1 FALSE FALSE FALSE TRUE 2 FALSE NA FALSE FALSE 3 TRUE TRUE TRUE NA 4 FALSE FALSE NA FALSE 5 FALSE FALSE FALSE TRUE 6 FALSE FALSE FALSE FALSE Each column is passed in turn to the function (as a vector) and the function then calcuates the median() and sd() with that vector as the first argument. The ">" operator has a vector on the lhs and a scalar on the rhs but that is perfectly fine and we get the expected results in a logical matrix. -- David.