thr3ads.net - R help - [R] Iteratively subsetting data by factor level across multiple variables [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Reid Bryant

2015-Jan-15 19:42 UTC

[R] Iteratively subsetting data by factor level across multiple variables

Hi R experts!

I would like to have a scripted solution that will iteratively subset data
across many variables per factor level of each variable.

To illustrate, if I create a dataframe (df) by:

variation <- c("A","B","C","D")
element1 <- as.factor(c(0,1,0,1))
element2 <- as.factor(c(0,0,1,1))
response <- c(4,2,6,2)
df <- data.frame(variation,element1,element2,response)

I would like a function that would allow me to subset the data into four
groups and perform analysis across the groups.  One group for each of the
two factor levels across two variables.  In this example its fairly easy
because I only have two variables with two levels each, but would I would
like this to be extendable across situations where I am dealing with more
than 2 variables and/or more than two factor levels per variable.  I am
looking for a result that will mimic the output of the following:

element1_level0 <- subset(df,df$element1=="0")
element1_level1 <- subset(df,df$element1=="1")
element2_level0 <- subset(df,df$element2=="0")
element2_level1 <- subset(df,df$element2=="1")

The purpose would be to perform analysis on the df across each subset.
Simplistically this could be represented as follows:

mean(element1_level0$response)
mean(element1_level1$response)
mean(element2_level0$response)
mean(element2_level1$response)

Thanks,
Reid

	[[alternative HTML version deleted]]

William Dunlap

2015-Jan-15 21:46 UTC

head link

[R] Iteratively subsetting data by factor level across multiple variables

There are lots of ways to do this.  You have to decide on how you want to
organize the results.
Here are two ways that use only core R packages. Many people like the plyr
package for this
split-data/analyze-parts/combine-results sort of thing.
> df <- data.frame(x=1:27,response=log2(1:27),          
g1=rep(letters[1:2],len=27),g2=rep(LETTERS[24:26],c(10,10,7)))> s <- split(seq_len(nrow(df)), df[c("g1","g2")])
> mean(subset(df, df$g1=="a" & df$g2=="Z")$response)
[1] 4.578656> vapply(s, function(si)mean(df$response[si]), FUN.VALUE=0) # a.Z part isprevious result
     a.X      b.X      a.Y      b.Y      a.Z      b.Z
1.976834 2.381378 3.880430 3.976834 4.578656 4.581611> coef(lm(response~x, data=subset(df, df$g1=="a" &
df$g2=="Z"))) #regression example
(Intercept)           x
 3.12905040  0.06040022> vapply(s, function(si)coef(lm(response ~ x, data=df[si,])),FUN.VALUE=rep(0,2))
                  a.X       b.X        a.Y        b.Y        a.Z        b.Z
(Intercept) 0.0862735 0.6882213 2.40741927 2.50763309 3.12905040 3.13556268
x           0.3781121 0.2821928 0.09820075 0.09182506 0.06040022 0.06025202


For the particular case of computing means of a partition of the data you
can use lm() once,
which gives the same numbers organized in a different
way:> coef(lm(response ~ x * (g1:g2) - x - 1, data=df))   g1a:g2X    g1b:g2X    g1a:g2Y    g1b:g2Y    g1a:g2Z    g1b:g2Z
0.08627350 0.68822126 2.40741927 2.50763309 3.12905040 3.13556268
 x:g1a:g2X  x:g1b:g2X  x:g1a:g2Y  x:g1b:g2Y  x:g1a:g2Z  x:g1b:g2Z
0.37811212 0.28219281 0.09820075 0.09182506 0.06040022 0.06025202



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Jan 15, 2015 at 11:42 AM, Reid Bryant <reidbryant at gmail.com>
wrote:
> Hi R experts!
>
> I would like to have a scripted solution that will iteratively subset data
> across many variables per factor level of each variable.
>
> To illustrate, if I create a dataframe (df) by:
>
> variation <- c("A","B","C","D")
> element1 <- as.factor(c(0,1,0,1))
> element2 <- as.factor(c(0,0,1,1))
> response <- c(4,2,6,2)
> df <- data.frame(variation,element1,element2,response)
>
> I would like a function that would allow me to subset the data into four
> groups and perform analysis across the groups.  One group for each of the
> two factor levels across two variables.  In this example its fairly easy
> because I only have two variables with two levels each, but would I would
> like this to be extendable across situations where I am dealing with more
> than 2 variables and/or more than two factor levels per variable.  I am
> looking for a result that will mimic the output of the following:
>
> element1_level0 <- subset(df,df$element1=="0")
> element1_level1 <- subset(df,df$element1=="1")
> element2_level0 <- subset(df,df$element2=="0")
> element2_level1 <- subset(df,df$element2=="1")
>
> The purpose would be to perform analysis on the df across each subset.
> Simplistically this could be represented as follows:
>
> mean(element1_level0$response)
> mean(element1_level1$response)
> mean(element2_level0$response)
> mean(element2_level1$response)
>
> Thanks,
> Reid
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Jan 2015 - Iteratively subsetting data by factor level across multiple variables

[R] Iteratively subsetting data by factor level across multiple variables

[R] Iteratively subsetting data by factor level across multiple variables