Hello! I'm a newcomer to R hoping to replace some convoluted database code with an R script. Unfortunately, I haven't been able to figure out how to implement the following logic. Essentially, we have a database of transactions that are coded with a geographic locale and a type. These are being loaded into a data.frame with named variables city, type, and price. E.g., trans$city and all that. We want to calculate mean prices by city and type, AFTER excluding outliers. That is, we want to calculate the mean price in 3 steps: 1. calculate a mean and standard deviation by city and type over all transactions 2. create a subset of the original data frame, excluding transactions that differ from the relevant mean by more than 2 standard deviations 3. calculate a final mean by city and type based on this subset. I'm stuck on step 2. I would like to do something like the following: fs <- list(factor(trans$city), factor(trans$type)) means <- tapply(trans$price, fs, mean) stdevs <- tapply(trans$price, fs, sd) filter <- abs(trans$price - means[trans$city, trans$type]) < 2*stdevs[trans$city, trans$type] sub <- subset(trans, filter) The above code doesn't work. What's the correct way to do this? Thanks, Josh
Try this, using the built in anscombe data set: anscombe[!rowSums(abs(scale(anscombe)) > 2),] On 7/11/06, Joshua Tokle <jtokle at math.washington.edu> wrote:> Hello! I'm a newcomer to R hoping to replace some convoluted database > code with an R script. Unfortunately, I haven't been able to figure out > how to implement the following logic. > > Essentially, we have a database of transactions that are coded with a > geographic locale and a type. These are being loaded into a data.frame > with named variables city, type, and price. E.g., trans$city and all > that. > > We want to calculate mean prices by city and type, AFTER excluding > outliers. That is, we want to calculate the mean price in 3 steps: > > 1. calculate a mean and standard deviation by city and type over all > transactions > 2. create a subset of the original data frame, excluding transactions that > differ from the relevant mean by more than 2 standard deviations > 3. calculate a final mean by city and type based on this subset. > > I'm stuck on step 2. I would like to do something like the following: > > fs <- list(factor(trans$city), factor(trans$type)) > means <- tapply(trans$price, fs, mean) > stdevs <- tapply(trans$price, fs, sd) > > filter <- abs(trans$price - means[trans$city, trans$type]) < > 2*stdevs[trans$city, trans$type] > > sub <- subset(trans, filter) > > The above code doesn't work. What's the correct way to do this? > > Thanks, > Josh > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Gabor, your solution does not take into account the groups. How about something like: iris2 <- iris iris2$m <- ave(iris2$Sepal.Length, iris2$Species) iris2$s <- ave(iris2$Sepal.Length, iris2$Species, FUN=sd) iris2 <- transform(iris2, z= (Sepal.Length-m)/s) iris2.2 <- subset(iris2, abs(z) < 2) aggregate(iris2.2, list(iris2.2$Species), FUN=mean) -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Gabor Grothendieck Sent: Tuesday, July 11, 2006 1:06 PM To: Joshua Tokle Cc: r-help at stat.math.ethz.ch Subject: Re: [R] R newbie: logical subsets Try this, using the built in anscombe data set: anscombe[!rowSums(abs(scale(anscombe)) > 2),] On 7/11/06, Joshua Tokle <jtokle at math.washington.edu> wrote:> Hello! I'm a newcomer to R hoping to replace some convoluted database> code with an R script. Unfortunately, I haven't been able to figure > out how to implement the following logic. > > Essentially, we have a database of transactions that are coded with a > geographic locale and a type. These are being loaded into a > data.frame with named variables city, type, and price. E.g., > trans$city and all that. > > We want to calculate mean prices by city and type, AFTER excluding > outliers. That is, we want to calculate the mean price in 3 steps: > > 1. calculate a mean and standard deviation by city and type over all > transactions 2. create a subset of the original data frame, excluding > transactions that differ from the relevant mean by more than 2 > standard deviations 3. calculate a final mean by city and type based > on this subset. > > I'm stuck on step 2. I would like to do something like the following: > > fs <- list(factor(trans$city), factor(trans$type)) means <- > tapply(trans$price, fs, mean) stdevs <- tapply(trans$price, fs, sd) > > filter <- abs(trans$price - means[trans$city, trans$type]) < > 2*stdevs[trans$city, trans$type] > > sub <- subset(trans, filter) > > The above code doesn't work. What's the correct way to do this? > > Thanks, > Josh > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html