thr3ads.net - R help - [R] Drop values of one dataframe based on the value of another [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Sam Albers

2012-Jun-01 18:40 UTC

[R] Drop values of one dataframe based on the value of another

Hello all,

Let me first say that this isn't a question about outliers. I am using
the outlier function from the outliers package but I am using it only
because it is a convenient wrapper to determine values that have the
largest difference between itself and the sample mean. Where I am
running into problems is that I am several groups where I want to
calculate the "outlier" within that group. Then I want to create two
data.frames, one with the "outliers" and the other those values
dropped. And both dataframes need to include additional columns of
data present before the subset. The first case is easy but I can't
seem to figure out how to determine the next. So for example:

library(plyr)
library(outliers)

## A dataframe with some obviously extreme values
dfa <- data.frame(Mins=runif(15, 0,1),
Fac=rep(c("Test1","Test2","Test3"), each=5))
df.out <- data.frame(Mins=c(3,4,5),
Fac=c("Test1","Test2","Test3"))
df <- rbind(dfa, df.out)
df$Meta <- runif(18,4,5); df

## Dataframe with the extreme value
To_remove<-ddply(df, c("Fac"), subset, Mins==outlier(Mins));
To_remove

So now my question is how can I use this dataframe (To_remove) to
remove all these values from df and create a new dataframe. Given a df
(To_remove) with a list of values, how can I choose all values of
another dataframe (df) that aren't those values in the To_remove
dataframe?. There is a rm.outliers function in this same package but I
having trouble with that and would like to try another approach.

Thanks in advance!

Sam

Ethan Brown

2012-Jun-02 00:20 UTC

head link

[R] Drop values of one dataframe based on the value of another

Before using ddply, try adding an id variable to uniquely identify each
record (this is a good data integrity practice anyway). Then you can simply
create the new data frame by using all the ids that aren't in your
'To_remove' subset.

Here's the code for your example:

library(plyr)
library(outliers)

## A dataframe with some obviously extreme values
dfa <- data.frame(Mins=runif(15, 0,1),
Fac=rep(c("Test1","Test2","Test3"), each=5))
df.out <- data.frame(Mins=c(3,4,5),
Fac=c("Test1","Test2","Test3"))
df <- rbind(dfa, df.out)
df$Meta <- runif(18,4,5)

##################################################
## add an id variable
df$id <- 1:nrow(df)
##################################################

## Dataframe with the extreme value
To_remove<-ddply(df, c("Fac"), subset, Mins==outlier(Mins));
To_remove

##################################################
## create dataframe without ids that are in To_remove
To_keep <- df[!(df$id %in% To_remove$id),]

## or, more compactly since in this case the ids are row numbers,
To_keep <- df[-To_remove$id,]

Best,
Ethan

P.S. Your email address and Google picture are so epic!

----
statisfactions.com -- the sounds of data and whimsy



On Fri, Jun 1, 2012 at 2:40 PM, Sam Albers
<tonightsthenight@gmail.com>wrote:
> Hello all,
>
> Let me first say that this isn't a question about outliers. I am using
> the outlier function from the outliers package but I am using it only
> because it is a convenient wrapper to determine values that have the
> largest difference between itself and the sample mean. Where I am
> running into problems is that I am several groups where I want to
> calculate the "outlier" within that group. Then I want to create
two
> data.frames, one with the "outliers" and the other those values
> dropped. And both dataframes need to include additional columns of
> data present before the subset. The first case is easy but I can't
> seem to figure out how to determine the next. So for example:
>
> library(plyr)
> library(outliers)
>
> ## A dataframe with some obviously extreme values
> dfa <- data.frame(Mins=runif(15, 0,1),
> Fac=rep(c("Test1","Test2","Test3"), each=5))
> df.out <- data.frame(Mins=c(3,4,5),
Fac=c("Test1","Test2","Test3"))
> df <- rbind(dfa, df.out)
> df$Meta <- runif(18,4,5); df
>
> ## Dataframe with the extreme value
> To_remove<-ddply(df, c("Fac"), subset, Mins==outlier(Mins));
To_remove
>
> So now my question is how can I use this dataframe (To_remove) to
> remove all these values from df and create a new dataframe. Given a df
> (To_remove) with a list of values, how can I choose all values of
> another dataframe (df) that aren't those values in the To_remove
> dataframe?. There is a rm.outliers function in this same package but I
> having trouble with that and would like to try another approach.
>
> Thanks in advance!
>
> Sam
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more maybe matching threads

R help - Jun 2012 - Drop values of one dataframe based on the value of another

[R] Drop values of one dataframe based on the value of another

[R] Drop values of one dataframe based on the value of another

Possibly Parallel Threads