thr3ads.net - R help - [R] Identification of Outliners and Extraction of Samples [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Alexander Eggel

2010-Aug-09 22:27 UTC

[R] Identification of Outliners and Extraction of Samples

Hello everybody,

I need to know which samples (S1-S6) contain a value that is bigger than the
median + five standard deviations of the column he is in. This is just an
example. Command should be applied to a data frame wich is a lot bigger
(over 100 columns). Any solutions? Thank you very much for your help!!!
> s    Samples     A     B    C    E
1             S1   1     2     3     7
2             S2   4    NA   6     6
3             S3   7     8     9    NA
4             S4   4     5    NA   6
5             S5   2     5     6     7
6             S6   2     3     4     5

This loop works fine for a column without NA values. However it doesn't work
for the other columns. I should have a loop that I could apply to all
columns ideally in "one command".

o <- data.frame();
for (i in 1:nrow(s))
{
       dd <- s[i,];
       if (dd$A >= median(s$A, na.rm=TRUE) + 5 * sd(s$A, na.rm=TRUE)) o <-
rbind(o,dd)

}

	[[alternative HTML version deleted]]

Frank Harrell

2010-Aug-09 22:47 UTC

head link

[R] Identification of Outliners and Extraction of Samples

On Mon, 9 Aug 2010, Alexander Eggel wrote:
> Hello everybody,
>
> I need to know which samples (S1-S6) contain a value that is bigger than
the
> median + five standard deviations of the column he is in. This is just an
Why not the 70th percentile plus 6 times the difference in the 85th 
and 75th percentiles  :-)

Frank

P.S.  See

@Article{fin06cal,
   author = 		 {Finney, David J.},
   title = 		 {Calibration guidelines challenge outlier 
practices},
   journal = 	 The American Statistician,
   year = 		 2006,
   volume =		 60,
   pages =		 {309-313},
   annote =		 {anticoagulant
therapy;bias;causation;ethics;objectivity;outliers;guidelines for
treatment of outliers;overview of types of outliers;letter to the 
editor and reply 61:187 May 2007}
}

> example. Command should be applied to a data frame wich is a lot bigger
> (over 100 columns). Any solutions? Thank you very much for your help!!!
>
>> s
>    Samples     A     B    C    E
> 1             S1   1     2     3     7
> 2             S2   4    NA   6     6
> 3             S3   7     8     9    NA
> 4             S4   4     5    NA   6
> 5             S5   2     5     6     7
> 6             S6   2     3     4     5
>
> This loop works fine for a column without NA values. However it doesn't
work
> for the other columns. I should have a loop that I could apply to all
> columns ideally in "one command".
>
> o <- data.frame();
> for (i in 1:nrow(s))
> {
>       dd <- s[i,];
>       if (dd$A >= median(s$A, na.rm=TRUE) + 5 * sd(s$A, na.rm=TRUE)) o
<-
> rbind(o,dd)
>
> }
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

David Winsemius

2010-Aug-10 01:53 UTC

head link

[R] Identification of Outliners and Extraction of Samples

On Aug 9, 2010, at 6:27 PM, Alexander Eggel wrote:
> Hello everybody,
>
> I need to know which samples (S1-S6) contain a value that is bigger  
> than the
> median + five standard deviations of the column he is in. This is  
> just an
> example. Command should be applied to a data frame wich is a lot  
> bigger
> (over 100 columns). Any solutions? Thank you very much for your  
> help!!!
>
>> s
>    Samples     A     B    C    E
> 1             S1   1     2     3     7
> 2             S2   4    NA   6     6
> 3             S3   7     8     9    NA
> 4             S4   4     5    NA   6
> 5             S5   2     5     6     7
> 6             S6   2     3     4     5
>
> This loop works fine for a column without NA values. However it  
> doesn't work
> for the other columns. I should have a loop that I could apply to all
> columns ideally in "one command".
>
> o <- data.frame();
> for (i in 1:nrow(s))
> {
>       dd <- s[i,];
>       if (dd$A >= median(s$A, na.rm=TRUE) + 5 * sd(s$A, na.rm=TRUE))  
> o <-
> rbind(o,dd)
>
> }
>
Let's look at the more general problem of how to do column-wise  
calculations (since I suspect there is not much support in this  
neighborhood for the notion that you have a proper definition of  
"outlier" and furthermore you have not provided an example where any  
such outliers exist). Let's just calculate a set of logical vectors  
that signal whether a value is greater than one sd above the median:

apply(s[-1], 2, function(x) {x > median(x, na.rm=TRUE) + sd(x,  
na.rm=TRUE)})

       A     B     C     E
1 FALSE FALSE FALSE  TRUE
2 FALSE    NA FALSE FALSE
3  TRUE  TRUE  TRUE    NA
4 FALSE FALSE    NA FALSE
5 FALSE FALSE FALSE  TRUE
6 FALSE FALSE FALSE FALSE

Each column is passed in turn to the function (as a vector) and the  
function then calcuates the median() and sd() with that vector as the  
first argument. The ">" operator has a vector on the lhs and a
scalar
on the rhs but that is perfectly fine and we get the expected results  
in a logical matrix.

-- 
David.

Reasonably Related Threads

Search for more seemingly similar threads

R help - Aug 2010 - Identification of Outliners and Extraction of Samples

[R] Identification of Outliners and Extraction of Samples

[R] Identification of Outliners and Extraction of Samples

[R] Identification of Outliners and Extraction of Samples

Reasonably Related Threads