Sam Albers
2012-Jun-01 18:40 UTC
[R] Drop values of one dataframe based on the value of another
Hello all, Let me first say that this isn't a question about outliers. I am using the outlier function from the outliers package but I am using it only because it is a convenient wrapper to determine values that have the largest difference between itself and the sample mean. Where I am running into problems is that I am several groups where I want to calculate the "outlier" within that group. Then I want to create two data.frames, one with the "outliers" and the other those values dropped. And both dataframes need to include additional columns of data present before the subset. The first case is easy but I can't seem to figure out how to determine the next. So for example: library(plyr) library(outliers) ## A dataframe with some obviously extreme values dfa <- data.frame(Mins=runif(15, 0,1), Fac=rep(c("Test1","Test2","Test3"), each=5)) df.out <- data.frame(Mins=c(3,4,5), Fac=c("Test1","Test2","Test3")) df <- rbind(dfa, df.out) df$Meta <- runif(18,4,5); df ## Dataframe with the extreme value To_remove<-ddply(df, c("Fac"), subset, Mins==outlier(Mins)); To_remove So now my question is how can I use this dataframe (To_remove) to remove all these values from df and create a new dataframe. Given a df (To_remove) with a list of values, how can I choose all values of another dataframe (df) that aren't those values in the To_remove dataframe?. There is a rm.outliers function in this same package but I having trouble with that and would like to try another approach. Thanks in advance! Sam
Ethan Brown
2012-Jun-02 00:20 UTC
[R] Drop values of one dataframe based on the value of another
Before using ddply, try adding an id variable to uniquely identify each record (this is a good data integrity practice anyway). Then you can simply create the new data frame by using all the ids that aren't in your 'To_remove' subset. Here's the code for your example: library(plyr) library(outliers) ## A dataframe with some obviously extreme values dfa <- data.frame(Mins=runif(15, 0,1), Fac=rep(c("Test1","Test2","Test3"), each=5)) df.out <- data.frame(Mins=c(3,4,5), Fac=c("Test1","Test2","Test3")) df <- rbind(dfa, df.out) df$Meta <- runif(18,4,5) ################################################## ## add an id variable df$id <- 1:nrow(df) ################################################## ## Dataframe with the extreme value To_remove<-ddply(df, c("Fac"), subset, Mins==outlier(Mins)); To_remove ################################################## ## create dataframe without ids that are in To_remove To_keep <- df[!(df$id %in% To_remove$id),] ## or, more compactly since in this case the ids are row numbers, To_keep <- df[-To_remove$id,] Best, Ethan P.S. Your email address and Google picture are so epic! ---- statisfactions.com -- the sounds of data and whimsy On Fri, Jun 1, 2012 at 2:40 PM, Sam Albers <tonightsthenight@gmail.com>wrote:> Hello all, > > Let me first say that this isn't a question about outliers. I am using > the outlier function from the outliers package but I am using it only > because it is a convenient wrapper to determine values that have the > largest difference between itself and the sample mean. Where I am > running into problems is that I am several groups where I want to > calculate the "outlier" within that group. Then I want to create two > data.frames, one with the "outliers" and the other those values > dropped. And both dataframes need to include additional columns of > data present before the subset. The first case is easy but I can't > seem to figure out how to determine the next. So for example: > > library(plyr) > library(outliers) > > ## A dataframe with some obviously extreme values > dfa <- data.frame(Mins=runif(15, 0,1), > Fac=rep(c("Test1","Test2","Test3"), each=5)) > df.out <- data.frame(Mins=c(3,4,5), Fac=c("Test1","Test2","Test3")) > df <- rbind(dfa, df.out) > df$Meta <- runif(18,4,5); df > > ## Dataframe with the extreme value > To_remove<-ddply(df, c("Fac"), subset, Mins==outlier(Mins)); To_remove > > So now my question is how can I use this dataframe (To_remove) to > remove all these values from df and create a new dataframe. Given a df > (To_remove) with a list of values, how can I choose all values of > another dataframe (df) that aren't those values in the To_remove > dataframe?. There is a rm.outliers function in this same package but I > having trouble with that and would like to try another approach. > > Thanks in advance! > > Sam > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]