Gavin Rudge
2013-Aug-06 17:17 UTC
[R] creating a quantile variable based on subsets of a dataframe
#some sample data: library(Hmisc) set.seed(33) df<-data.frame(x=(sample(letters[1:10],1000,replace=TRUE)),y=rnorm(1000,mean=20,sd=15)) x is a category from a to J, say a geographical area, into which an observation y falls, y being a score. Now if I want to put my score into quantiles (quintiles in this case) across the whole population of observations and then make a quintile variable I do the following: #make a quintile variable df<- within(df,z<-as.integer(cut2(y,quantile(y,probs=seq(0,1,0.2))))) I'm using cut2 here as I want the extremes of my ranges to be included in the upper and lower bins. So far so good, but I would also like another variable to indicate the quintile of the score within the areas indicated by the x variable, so all of the scores where x=a, binned into quintiles for area a, the same for scores in areas b, c and so on. I see that I could put my quintile variable code into a function and then split my data frame by x, apply the function in each of the ten groups and stitch the whole thing back together again (not sure I could write it though), but is there a much simpler solution? Thanks, GavinR [[alternative HTML version deleted]]
David Winsemius
2013-Aug-06 17:29 UTC
[R] creating a quantile variable based on subsets of a dataframe
On Aug 6, 2013, at 10:17 AM, Gavin Rudge wrote:> #some sample data: > library(Hmisc) > set.seed(33) > df<-data.frame(x=(sample(letters[1:10],1000,replace=TRUE)),y=rnorm(1000,mean=20,sd=15)) > > x is a category from a to J, say a geographical area, into which an observation y falls, y being a score. Now if I want to put my score into quantiles (quintiles in this case) across the whole population of observations and then make a quintile variable I do the following: > > #make a quintile variable > df<- within(df,z<-as.integer(cut2(y,quantile(y,probs=seq(0,1,0.2))))) > > I'm using cut2 here as I want the extremes of my ranges to be included in the upper and lower bins. > > So far so good, but I would also like another variable to indicate the quintile of the score within the areas indicated by the x variable, so all of the scores where x=a, binned into quintiles for area a, the same for scores in areas b, c and so on. > > I see that I could put my quintile variable code into a function and then split my data frame by x, apply the function in each of the ten groups and stitch the whole thing back together again (not sure I could write it though), but is there a much simpler solution? >Generally questions involving the distribution of a single variate grouped within categories where the desired result is as long as the original variate are well handled with th `ave` function:> df$c2.grp <- ave(df$y, df$x, FUN=function(z) cut2(z, quantile(z,probs=seq(0,1,0.2)) ) ) > str(df)'data.frame': 1000 obs. of 3 variables: $ x : Factor w/ 10 levels "a","b","c","d",..: 5 4 5 10 9 6 5 4 1 2 ... $ y : num 15 45.3 29.9 45.2 23.3 ... $ c2.grp: num 2 5 4 5 3 4 2 4 3 2 ... I was a bit surprised that the resulting column in df was numeric rather than factor, but I suspect it was the fact that the levels of the intra-groups splits could not be reconciled. You didn't apparently consider that issue in your problem specification. The result could be "cleaned up" with:> df$c2.grp <- factor(df$c2.grp, labels=paste0("Q", 1:5) ) > with(df, table(x, c2.grp))c2.grp x Q1 Q2 Q3 Q4 Q5 a 22 23 22 22 22 b 19 19 18 19 19 c 21 20 20 20 21 d 20 19 19 19 20 e 19 20 21 20 20 f 21 21 21 21 22 g 21 21 21 21 22 h 19 19 19 19 19 i 18 18 17 18 18 j 20 20 19 20 20 -- David Winsemius Alameda, CA, USA