I have a large dataset grouped by a factor and I want to perform a regression on each data subset based on this factor. There are many ways to do this, posted here and elsewhere. I have tried several. However I found one method posted on the R wiki which works exactly as I want, and I like the elegance and simplicity of the solution, but I don't understand how it works. Its all in the formula specification. Problem is also that I want to extend it from a single variable case to a multivariable case and haven't really got a clue how to go about it. Hope someone can help. # The code below creates a data.frame comprising two marginally noisy straight lines and performs a # regression on each based on the grp factor. This is a single variable case based on R wiki one liner. # y = x for grp A # y = 2 + 5x for grp B set.seed(5) ind <- as.vector(runif(50)) d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50)) d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50)) data1 <- rbind(d1,d2) fits <- lm(y ~ grp:x + grp - 1, data=data1) summary(fits) with(data1, plot(x,y)) abline(coef(fits)[c(1,3)], col="red") abline(coef(fits)[c(2,4)], col="blue") Output: Call: lm(formula = y ~ grp:x + grp - 1, data = data1) Residuals: Min 1Q Median 3Q Max -0.206414 -0.057582 -0.001376 0.062737 0.242111 Coefficients: Estimate Std. Error t value Pr(>|t|) grpA -0.02935 0.02811 -1.044 0.299 grpB 1.96200 0.02811 69.803 <2e-16 *** grpA:x 1.06276 0.04921 21.596 <2e-16 *** grpB:x 5.09351 0.04921 103.502 <2e-16 *** --- We can see that the intercepts of are close to 0 and 2 and slopes close to 1 and 5 as suggested by the data creation equation. In fact these results are identical to running lm individually on each subset of data with formula y ~ x. What I don't understand is how this result comes about from the formula specification y ~ grp:x + grp -1 . Can someone offer an explanation? In addition I want to apply the same idea for a multivariate data set. The code below shows a multivariate dataset which I hope to fit by group. # The code below creates a data.frame comprising three marginally noisy planes. I want to perform a # regression on each based on the grp factor. # y = x1 + x2 for grp A # y = 2 + 2x1 + 4x2 for grp B # y = 2 + 2x1 + 4x2 + 6x1x2 for grp C ind <- matrix(runif(100), ncol=2, dimnames=list(c(), c("x1","x2"))) d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1), grp=rep("A", 100)) d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1), grp=rep("B", 100)) d3 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C", 100)) data2 <- rbind(d1,d2,d3) # visualise ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found")) fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2) summary(fits) Output: Call: lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2) Residuals: Min 1Q Median 3Q Max -1.338348 -0.092069 -0.002844 0.094813 1.259094 Coefficients: Estimate Std. Error t value Pr(>|t|) grpA -0.01943 0.08442 -0.230 0.818 grpB 1.98345 0.08442 23.495 < 2e-16 *** grpC 0.51002 0.08442 6.042 4.66e-09 *** grpA:x1 1.03376 0.11769 8.784 < 2e-16 *** grpB:x1 2.02563 0.11769 17.212 < 2e-16 *** grpC:x1 4.83101 0.11769 41.050 < 2e-16 *** grpA:x2 1.02371 0.09664 10.593 < 2e-16 *** grpB:x2 4.00069 0.09664 41.396 < 2e-16 *** grpC:x2 7.22331 0.09664 74.742 < 2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.2838 on 291 degrees of freedom Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972 F-statistic: 1.199e+04 on 9 and 291 DF, p-value: < 2.2e-16 Whilst there seems to be some correlation between coefficients and terms that doesn't apply in all cases. I've tried many other combinations but noe that gave me coefficients similar to the defining equations. Bottom line is I don't understand what the formula specification should be for the multivariate version. Thanks for any help given. Chris -- View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html Sent from the R help mailing list archive at Nabble.com.
Hi, It seems like your multivariate regression works for groups A and B. For group C, values are calculated using an interaction between x1 and x2. Try to include it in the regression model to find the right coeffs: y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1 It works for me. Notice in summary() that the coefficients for the x1:x2 interaction are not significantly different from 0 in groups A and B, which is what we expected. HTH, Xavier ----- Mail Original ----- De: "Chris Friedl" <cfriedalek at gmail.com> ?: r-help at r-project.org Envoy?: Mercredi 17 Juin 2009 12h07:08 GMT +01:00 Amsterdam / Berlin / Berne / Rome / Stockholm / Vienne Objet: [R] Re gression by groups questions I have a large dataset grouped by a factor and I want to perform a regression on each data subset based on this factor. There are many ways to do this, posted here and elsewhere. I have tried several. However I found one method posted on the R wiki which works exactly as I want, and I like the elegance and simplicity of the solution, but I don't understand how it works. Its all in the formula specification. Problem is also that I want to extend it from a single variable case to a multivariable case and haven't really got a clue how to go about it. Hope someone can help. # The code below creates a data.frame comprising two marginally noisy straight lines and performs a # regression on each based on the grp factor. This is a single variable case based on R wiki one liner. # y = x for grp A # y = 2 + 5x for grp B set.seed(5) ind <- as.vector(runif(50)) d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50)) d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50)) data1 <- rbind(d1,d2) fits <- lm(y ~ grp:x + grp - 1, data=data1) summary(fits) with(data1, plot(x,y)) abline(coef(fits)[c(1,3)], col="red") abline(coef(fits)[c(2,4)], col="blue") Output: Call: lm(formula = y ~ grp:x + grp - 1, data = data1) Residuals: Min 1Q Median 3Q Max -0.206414 -0.057582 -0.001376 0.062737 0.242111 Coefficients: Estimate Std. Error t value Pr(>|t|) grpA -0.02935 0.02811 -1.044 0.299 grpB 1.96200 0.02811 69.803 <2e-16 *** grpA:x 1.06276 0.04921 21.596 <2e-16 *** grpB:x 5.09351 0.04921 103.502 <2e-16 *** --- We can see that the intercepts of are close to 0 and 2 and slopes close to 1 and 5 as suggested by the data creation equation. In fact these results are identical to running lm individually on each subset of data with formula y ~ x. What I don't understand is how this result comes about from the formula specification y ~ grp:x + grp -1 . Can someone offer an explanation? In addition I want to apply the same idea for a multivariate data set. The code below shows a multivariate dataset which I hope to fit by group. # The code below creates a data.frame comprising three marginally noisy planes. I want to perform a # regression on each based on the grp factor. # y = x1 + x2 for grp A # y = 2 + 2x1 + 4x2 for grp B # y = 2 + 2x1 + 4x2 + 6x1x2 for grp C ind <- matrix(runif(200), ncol=2, dimnames=list(c(), c("x1","x2"))) d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1), grp=rep("A", 100)) d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1), grp=rep("B", 100)) d3 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C", 100)) data2 <- rbind(d1,d2,d3) # visualise ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found")) fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2) summary(fits) Output: Call: lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2) Residuals: Min 1Q Median 3Q Max -1.338348 -0.092069 -0.002844 0.094813 1.259094 Coefficients: Estimate Std. Error t value Pr(>|t|) grpA -0.01943 0.08442 -0.230 0.818 grpB 1.98345 0.08442 23.495 < 2e-16 *** grpC 0.51002 0.08442 6.042 4.66e-09 *** grpA:x1 1.03376 0.11769 8.784 < 2e-16 *** grpB:x1 2.02563 0.11769 17.212 < 2e-16 *** grpC:x1 4.83101 0.11769 41.050 < 2e-16 *** grpA:x2 1.02371 0.09664 10.593 < 2e-16 *** grpB:x2 4.00069 0.09664 41.396 < 2e-16 *** grpC:x2 7.22331 0.09664 74.742 < 2e-16 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 0.2838 on 291 degrees of freedom Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972 F-statistic: 1.199e+04 on 9 and 291 DF, p-value: < 2.2e-16 Whilst there seems to be some correlation between coefficients and terms that doesn't apply in all cases. I've tried many other combinations but noe that gave me coefficients similar to the defining equations. Bottom line is I don't understand what the formula specification should be for the multivariate version. Thanks for any help given. Chris -- View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
pasting reply by Xavier into this thread for the sake of others who might come across it in their R travels. Thanks Xavier Hi, It seems like your multivariate regression works for groups A and B. For group C, values are calculated using an interaction between x1 and x2. Try to include it in the regression model to find the right coeffs: y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1 It works for me. Notice in summary() that the coefficients for the x1:x2 interaction are not significantly different from 0 in groups A and B, which is what we expected. HTH, Xavier Chris Friedl wrote:> > I have a large dataset grouped by a factor and I want to perform a > regression on each data subset based on this factor. There are many ways > to do this, posted here and elsewhere. I have tried several. However I > found one method posted on the R wiki which works exactly as I want, and I > like the elegance and simplicity of the solution, but I don't understand > how it works. Its all in the formula specification. Problem is also that I > want to extend it from a single variable case to a multivariable case and > haven't really got a clue how to go about it. Hope someone can help. > > # The code below creates a data.frame comprising two marginally noisy > straight lines and performs a > # regression on each based on the grp factor. This is a single variable > case based on R wiki one liner. > # y = x for grp A > # y = 2 + 5x for grp B > set.seed(5) > ind <- as.vector(runif(50)) > d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50)) > d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B", 50)) > data1 <- rbind(d1,d2) > fits <- lm(y ~ grp:x + grp - 1, data=data1) > summary(fits) > with(data1, plot(x,y)) > abline(coef(fits)[c(1,3)], col="red") > abline(coef(fits)[c(2,4)], col="blue") > > Output: > Call: > lm(formula = y ~ grp:x + grp - 1, data = data1) > > Residuals: > Min 1Q Median 3Q Max > -0.206414 -0.057582 -0.001376 0.062737 0.242111 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > grpA -0.02935 0.02811 -1.044 0.299 > grpB 1.96200 0.02811 69.803 <2e-16 *** > grpA:x 1.06276 0.04921 21.596 <2e-16 *** > grpB:x 5.09351 0.04921 103.502 <2e-16 *** > --- > > We can see that the intercepts of are close to 0 and 2 and slopes close to > 1 and 5 as suggested by the data creation equation. In fact these results > are identical to running lm individually on each subset of data with > formula y ~ x. What I don't understand is how this result comes about from > the formula specification y ~ grp:x + grp -1 . Can someone offer an > explanation? > > > In addition I want to apply the same idea for a multivariate data set. The > code below shows a multivariate dataset which I hope to fit by group. > > # The code below creates a data.frame comprising three marginally noisy > planes. I want to perform a > # regression on each based on the grp factor. > # y = x1 + x2 for grp A > # y = 2 + 2x1 + 4x2 for grp B > # y = 2 + 2x1 + 4x2 + 6x1x2 for grp C > ind <- matrix(runif(100), ncol=2, dimnames=list(c(), c("x1","x2"))) > d1 <- data.frame(ind, y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1), > grp=rep("A", 100)) > d2 <- data.frame(ind, y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1), > grp=rep("B", 100)) > d3 <- data.frame(ind, > y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C", > 100)) > data2 <- rbind(d1,d2,d3) > # visualise > ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not found")) > fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2) > summary(fits) > > Output: > > Call: > lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2) > > Residuals: > Min 1Q Median 3Q Max > -1.338348 -0.092069 -0.002844 0.094813 1.259094 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > grpA -0.01943 0.08442 -0.230 0.818 > grpB 1.98345 0.08442 23.495 < 2e-16 *** > grpC 0.51002 0.08442 6.042 4.66e-09 *** > grpA:x1 1.03376 0.11769 8.784 < 2e-16 *** > grpB:x1 2.02563 0.11769 17.212 < 2e-16 *** > grpC:x1 4.83101 0.11769 41.050 < 2e-16 *** > grpA:x2 1.02371 0.09664 10.593 < 2e-16 *** > grpB:x2 4.00069 0.09664 41.396 < 2e-16 *** > grpC:x2 7.22331 0.09664 74.742 < 2e-16 *** > --- > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > > Residual standard error: 0.2838 on 291 degrees of freedom > Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972 > F-statistic: 1.199e+04 on 9 and 291 DF, p-value: < 2.2e-16 > > Whilst there seems to be some correlation between coefficients and terms > that doesn't apply in all cases. I've tried many other combinations but > noe that gave me coefficients similar to the defining equations. Bottom > line is I don't understand what the formula specification should be for > the multivariate version. > > Thanks for any help given. > > Chris > >-- View this message in context: http://www.nabble.com/Regression-by-groups-questions-tp24070481p24091659.html Sent from the R help mailing list archive at Nabble.com.