Reid Bryant
2015-Jan-15 19:42 UTC
[R] Iteratively subsetting data by factor level across multiple variables
Hi R experts! I would like to have a scripted solution that will iteratively subset data across many variables per factor level of each variable. To illustrate, if I create a dataframe (df) by: variation <- c("A","B","C","D") element1 <- as.factor(c(0,1,0,1)) element2 <- as.factor(c(0,0,1,1)) response <- c(4,2,6,2) df <- data.frame(variation,element1,element2,response) I would like a function that would allow me to subset the data into four groups and perform analysis across the groups. One group for each of the two factor levels across two variables. In this example its fairly easy because I only have two variables with two levels each, but would I would like this to be extendable across situations where I am dealing with more than 2 variables and/or more than two factor levels per variable. I am looking for a result that will mimic the output of the following: element1_level0 <- subset(df,df$element1=="0") element1_level1 <- subset(df,df$element1=="1") element2_level0 <- subset(df,df$element2=="0") element2_level1 <- subset(df,df$element2=="1") The purpose would be to perform analysis on the df across each subset. Simplistically this could be represented as follows: mean(element1_level0$response) mean(element1_level1$response) mean(element2_level0$response) mean(element2_level1$response) Thanks, Reid [[alternative HTML version deleted]]
William Dunlap
2015-Jan-15 21:46 UTC
[R] Iteratively subsetting data by factor level across multiple variables
There are lots of ways to do this. You have to decide on how you want to organize the results. Here are two ways that use only core R packages. Many people like the plyr package for this split-data/analyze-parts/combine-results sort of thing.> df <- data.frame(x=1:27,response=log2(1:27),g1=rep(letters[1:2],len=27),g2=rep(LETTERS[24:26],c(10,10,7)))> s <- split(seq_len(nrow(df)), df[c("g1","g2")]) > mean(subset(df, df$g1=="a" & df$g2=="Z")$response)[1] 4.578656> vapply(s, function(si)mean(df$response[si]), FUN.VALUE=0) # a.Z part isprevious result a.X b.X a.Y b.Y a.Z b.Z 1.976834 2.381378 3.880430 3.976834 4.578656 4.581611> coef(lm(response~x, data=subset(df, df$g1=="a" & df$g2=="Z"))) #regression example (Intercept) x 3.12905040 0.06040022> vapply(s, function(si)coef(lm(response ~ x, data=df[si,])),FUN.VALUE=rep(0,2)) a.X b.X a.Y b.Y a.Z b.Z (Intercept) 0.0862735 0.6882213 2.40741927 2.50763309 3.12905040 3.13556268 x 0.3781121 0.2821928 0.09820075 0.09182506 0.06040022 0.06025202 For the particular case of computing means of a partition of the data you can use lm() once, which gives the same numbers organized in a different way:> coef(lm(response ~ x * (g1:g2) - x - 1, data=df))g1a:g2X g1b:g2X g1a:g2Y g1b:g2Y g1a:g2Z g1b:g2Z 0.08627350 0.68822126 2.40741927 2.50763309 3.12905040 3.13556268 x:g1a:g2X x:g1b:g2X x:g1a:g2Y x:g1b:g2Y x:g1a:g2Z x:g1b:g2Z 0.37811212 0.28219281 0.09820075 0.09182506 0.06040022 0.06025202 Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Jan 15, 2015 at 11:42 AM, Reid Bryant <reidbryant at gmail.com> wrote:> Hi R experts! > > I would like to have a scripted solution that will iteratively subset data > across many variables per factor level of each variable. > > To illustrate, if I create a dataframe (df) by: > > variation <- c("A","B","C","D") > element1 <- as.factor(c(0,1,0,1)) > element2 <- as.factor(c(0,0,1,1)) > response <- c(4,2,6,2) > df <- data.frame(variation,element1,element2,response) > > I would like a function that would allow me to subset the data into four > groups and perform analysis across the groups. One group for each of the > two factor levels across two variables. In this example its fairly easy > because I only have two variables with two levels each, but would I would > like this to be extendable across situations where I am dealing with more > than 2 variables and/or more than two factor levels per variable. I am > looking for a result that will mimic the output of the following: > > element1_level0 <- subset(df,df$element1=="0") > element1_level1 <- subset(df,df$element1=="1") > element2_level0 <- subset(df,df$element2=="0") > element2_level1 <- subset(df,df$element2=="1") > > The purpose would be to perform analysis on the df across each subset. > Simplistically this could be represented as follows: > > mean(element1_level0$response) > mean(element1_level1$response) > mean(element2_level0$response) > mean(element2_level1$response) > > Thanks, > Reid > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]