Cecilia Carmo
2010-Aug-01 01:39 UTC
[R] remove extreme values or winsorize loop - dataframe
Hi everyone! #I need a loop or a function that creates a X2 variable that is X1 without the extreme values (or X1 winsorized) by industry and year. #My reproducible example: firm<-sort(rep(1:1000,10),decreasing=F) year<-rep(1998:2007,1000) industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10), rep(10,10)),1000) X1<-rnorm(10000) data<-data.frame(firm, industry,year,X1) data The way I?m doing this is very hard. I split my sample by industry and year, for each industry and year I calculate the 10% and 90% quantiles, then I create a X2 variable like this: industry1<-subset(data,data$industry==1) ind1year1999<-subset(industry1,industry1$year==1999) q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE) q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE) ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) ind1year2000<-subset(industry1,industry1$year==2000) q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE) q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE) ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) I repeat this for all years and industries, and then I merge/bind all again to have a new dataframe with all the columns of the dataframe ?data? plus X2. Could anyone help me doing this in a easier way? Thanks Cec?lia Carmo Universidade de Aveiro - Portugal
jim holtman
2010-Aug-01 02:10 UTC
[R] remove extreme values or winsorize – loop - dataframe
This will split the data by industry & year and then return the values that include the 80%-tile (>=10% & <= 90%) # split the data by industry/year d.s <- split(data, list(data$industry, data$year), drop=TRUE) result <- lapply(d.s, function(.id){ # get 10/90% values .limit <- quantile(.id$X1, prob=c(.1, .9)) subset(.id, X1 >= .limit[1] & X1 <= .limit[2]) }) This returns a list of 100 elements for each combination. On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo <cecilia.carmo at ua.pt> wrote:> Hi everyone! > > #I need a loop or a function that creates a X2 variable that is X1 without > the extreme values (or X1 winsorized) by industry and year. > > #My reproducible example: > firm<-sort(rep(1:1000,10),decreasing=F) > year<-rep(1998:2007,1000) > industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10), > rep(10,10)),1000) > X1<-rnorm(10000) > data<-data.frame(firm, industry,year,X1) > data > > The way I?m doing this is very hard. I split my sample by industry and year, > for each industry and year I calculate the 10% and 90% quantiles, then I > create a X2 variable like this: > > industry1<-subset(data,data$industry==1) > > ind1year1999<-subset(industry1,industry1$year==1999) > q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE) > q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE) > ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) > > ind1year2000<-subset(industry1,industry1$year==2000) > q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE) > q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE) > ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) > > I repeat this for all years and industries, and then I merge/bind all again > to have a new dataframe with all the columns of the dataframe ?data? plus > X2. > > Could anyone help me doing this in a easier way? > > Thanks > Cec?lia Carmo > Universidade de Aveiro - Portugal > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Glen Barnett
2010-Aug-03 00:00 UTC
[R] remove extreme values or winsorize – loop - dataframe
This might help some: RSiteSearch("winsorize") On Sun, Aug 1, 2010 at 11:39 AM, Cecilia Carmo <cecilia.carmo at ua.pt> wrote:> Hi everyone! > > #I need a loop or a function that creates a X2 variable that is X1 without > the extreme values (or X1 winsorized) by industry and year. > > #My reproducible example: > firm<-sort(rep(1:1000,10),decreasing=F) > year<-rep(1998:2007,1000) > industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10), > rep(10,10)),1000) > X1<-rnorm(10000) > data<-data.frame(firm, industry,year,X1) > data > > The way I?m doing this is very hard. I split my sample by industry and year, > for each industry and year I calculate the 10% and 90% quantiles, then I > create a X2 variable like this: > > industry1<-subset(data,data$industry==1) > > ind1year1999<-subset(industry1,industry1$year==1999) > q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE) > q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE) > ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) > > ind1year2000<-subset(industry1,industry1$year==2000) > q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE) > q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE) > ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1))) > > I repeat this for all years and industries, and then I merge/bind all again > to have a new dataframe with all the columns of the dataframe ?data? plus > X2. > > Could anyone help me doing this in a easier way? > > Thanks > Cec?lia Carmo > Universidade de Aveiro - Portugal > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >