thr3ads.net - R help - [R] Data frame manipulation by eliminating rows containing extreme values [Oct 2011]

If this information is useful, please help other people find it:
Share via:

aajit75

2011-Oct-22 10:57 UTC

[R] Data frame manipulation by eliminating rows containing extreme values

Dear All, 

I have got the limits for removing extreme values for each variables using
following function .

f=function(x){quantile(x, c(0.25, 0.75),na.rm = TRUE) - matrix(IQR(x,na.rm TRUE)
* c(1.5), nrow = 1) %*% c(-1, 1)}

#Example:

n <- 100
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- 1*(x5=='a' | x5=='c')
data1 <- cbind(x1,x2,x3,x4,x5,x6)
data2 <- data.frame(data1)
xyz <- lapply(data1, f)

#Now, I can eliminate those rows(observations) from the data which contains
extreme values for each of the variables one by one as below.

data2 <- subset (data2, x1<=xyz$x1[,1] &  x1>=xyz$x1[,2])
data2 <- subset (data2, x1<=xyz$x2[,1] &  x1>=xyz$x2[,2])

.
.
and so on..

But my data has more number of variables (more than 120),  can any body
suggest efficient way of eliminating rows containg extreme values?

Thanks in advance!

Regards,
-Ajit


--
View this message in context:
http://r.789695.n4.nabble.com/Data-frame-manipulation-by-eliminating-rows-containing-extreme-values-tp3927941p3927941.html
Sent from the R help mailing list archive at Nabble.com.

David Winsemius

2011-Oct-22 13:50 UTC

head link

[R] Data frame manipulation by eliminating rows containing extreme values

On Oct 22, 2011, at 6:57 AM, aajit75 wrote:
> Dear All,
>
> I have got the limits for removing extreme values for each variables  
> using
> following function .
>
> f=function(x){quantile(x, c(0.25, 0.75),na.rm = TRUE) -  
> matrix(IQR(x,na.rm > TRUE) * c(1.5), nrow = 1) %*% c(-1, 1)}
I think you need to clarify what your expectations are for that  
function. First you calculate the interquartile range and then you  
subtract 1.5 times the interquartile range. Exactly how does that  
identify extreme values? It appears you would be removing substantial  
amounts of your data.

>
> #Example:
>
> n <- 100
> x1 <- runif(n)
> x2 <- runif(n)
> x3 <- x1 + x2 + runif(n)/10
> x4 <- x1 + x2 + x3 + runif(n)/10
> x5 <-
factor(sample(c('a','b','c'),n,replace=TRUE))
> x6 <- 1*(x5=='a' | x5=='c')
> data1 <- cbind(x1,x2,x3,x4,x5,x6)
> data2 <- data.frame(data1)
> xyz <- lapply(data1, f)
Have you looked at the output of that operation? I get a list of 600  
elements:

 > str(xyz)
List of 600
  $ : num [1, 1:2] 0.315 0.315
  $ : num [1, 1:2] 0.0132 0.0132
  $ : num [1, 1:2] 0.519 0.519
  $ : num [1, 1:2] 0.0917 0.0917
snipped

>
> #Now, I can eliminate those rows(observations) from the data which  
> contains
> extreme values for each of the variables one by one as below.
And now you propose to overwrite data2 not one but twice?
>
> data2 <- subset (data2, x1<=xyz$x1[,1] &  x1>=xyz$x1[,2])
> data2 <- subset (data2, x1<=xyz$x2[,1] &  x1>=xyz$x2[,2])
>
> .
> .
> and so on..
>
> But my data has more number of variables (more than 120),  can any  
> body
> suggest efficient way of eliminating rows containg extreme values?
The first step would be arriving at a sensible definiton for "extreme  
value". And you should also consider that these are data and removing  
"extreme values" is a serious distortion of the data. There needs to  
be some justification for cutting out the extremes.

-- 

David Winsemius, MD
West Hartford, CT

Maybe Matching Threads

Search for more maybe matching threads

R help - Oct 2011 - Data frame manipulation by eliminating rows containing extreme values

[R] Data frame manipulation by eliminating rows containing extreme values

[R] Data frame manipulation by eliminating rows containing extreme values

Maybe Matching Threads