thr3ads.net - R help - [R] Re gression by groups questions [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Chris Friedl

2009-Jun-17 10:07 UTC

[R] Re gression by groups questions

I have a large dataset grouped by a factor and I want to perform a regression
on each data subset based on this factor. There are many ways to do this,
posted here and elsewhere. I have tried several. However I found one method
posted on the R wiki which works exactly as I want, and I like the elegance
and simplicity of the solution,  but I don't understand how it works. Its
all in the formula specification. Problem is also that I want to extend it
from a single variable case to a multivariable case and haven't really got a
clue how to go about it. Hope someone can help.

# The code below creates a data.frame comprising two marginally noisy
straight lines and performs a 
# regression on each based on the grp factor. This is a single variable case
based on R wiki one liner.
# y = x		for grp A
# y = 2 + 5x	for grp B
set.seed(5)
ind <- as.vector(runif(50))
d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B",
50))
data1 <- rbind(d1,d2)
fits <- lm(y ~ grp:x + grp - 1, data=data1)
summary(fits)
with(data1, plot(x,y))
abline(coef(fits)[c(1,3)], col="red")
abline(coef(fits)[c(2,4)], col="blue")

Output:
Call:
lm(formula = y ~ grp:x + grp - 1, data = data1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.206414 -0.057582 -0.001376  0.062737  0.242111 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
grpA   -0.02935    0.02811  -1.044    0.299    
grpB    1.96200    0.02811  69.803   <2e-16 ***
grpA:x  1.06276    0.04921  21.596   <2e-16 ***
grpB:x  5.09351    0.04921 103.502   <2e-16 ***
---

We can see that the intercepts of are close to 0 and 2 and slopes close to 1
and 5 as suggested by the data creation equation. In fact these results are
identical to running lm individually on each subset of data with formula y ~
x. What I don't understand is how this result comes about from the formula
specification y ~ grp:x + grp -1 . Can someone offer an explanation?


In addition I want to apply the same idea for a multivariate data set. The
code below shows a multivariate dataset which I hope to fit by group.

# The code below creates a data.frame comprising three marginally noisy
planes. I want to perform a 
# regression on each based on the grp factor. 
# y = x1 + x2 			for grp A
# y = 2 + 2x1 + 4x2 		for grp B
# y = 2 + 2x1 + 4x2 + 6x1x2  for grp C
ind <- matrix(runif(100), ncol=2, dimnames=list(c(),
c("x1","x2")))
d1 <- data.frame(ind,
y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1),
grp=rep("A",
100))
d2 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
grp=rep("B", 100))
d3 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
100))
data2 <- rbind(d1,d2,d3)
# visualise
ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not
found"))
fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
summary(fits)

Output:

Call:
lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)

Residuals:
      Min        1Q    Median        3Q       Max 
-1.338348 -0.092069 -0.002844  0.094813  1.259094 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
grpA    -0.01943    0.08442  -0.230    0.818    
grpB     1.98345    0.08442  23.495  < 2e-16 ***
grpC     0.51002    0.08442   6.042 4.66e-09 ***
grpA:x1  1.03376    0.11769   8.784  < 2e-16 ***
grpB:x1  2.02563    0.11769  17.212  < 2e-16 ***
grpC:x1  4.83101    0.11769  41.050  < 2e-16 ***
grpA:x2  1.02371    0.09664  10.593  < 2e-16 ***
grpB:x2  4.00069    0.09664  41.396  < 2e-16 ***
grpC:x2  7.22331    0.09664  74.742  < 2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 

Residual standard error: 0.2838 on 291 degrees of freedom
Multiple R-squared: 0.9973,     Adjusted R-squared: 0.9972 
F-statistic: 1.199e+04 on 9 and 291 DF,  p-value: < 2.2e-16 

Whilst there seems to be some correlation between coefficients and terms
that doesn't apply in all cases. I've tried many other combinations but
noe
that gave me coefficients similar to the defining equations. Bottom line is
I don't understand what the formula specification should be for the
multivariate version.

Thanks for any help given.

Chris

-- 
View this message in context:
http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html
Sent from the R help mailing list archive at Nabble.com.

xavier.chardon at free.fr

2009-Jun-18 07:25 UTC

head link

[R] Re gression by groups questions

Hi,

It seems like your multivariate regression works for groups A and B. For group
C, values are calculated using an interaction between x1 and x2. Try to include
it in the regression model to find the right coeffs:

y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1

It works for me. Notice in summary() that the coefficients for the x1:x2
interaction are not significantly different from 0 in groups A and B, which is
what we expected.

HTH,

Xavier



----- Mail Original -----
De: "Chris Friedl" <cfriedalek at gmail.com>
?: r-help at r-project.org
Envoy?: Mercredi 17 Juin 2009 12h07:08 GMT +01:00 Amsterdam / Berlin / Berne /
Rome / Stockholm / Vienne
Objet: [R] Re gression by groups questions


I have a large dataset grouped by a factor and I want to perform a regression
on each data subset based on this factor. There are many ways to do this,
posted here and elsewhere. I have tried several. However I found one method
posted on the R wiki which works exactly as I want, and I like the elegance
and simplicity of the solution,  but I don't understand how it works. Its
all in the formula specification. Problem is also that I want to extend it
from a single variable case to a multivariable case and haven't really got a
clue how to go about it. Hope someone can help.

# The code below creates a data.frame comprising two marginally noisy
straight lines and performs a 
# regression on each based on the grp factor. This is a single variable case
based on R wiki one liner.
# y = x		for grp A
# y = 2 + 5x	for grp B
set.seed(5)
ind <- as.vector(runif(50))
d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A", 50))
d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1), grp=rep("B",
50))
data1 <- rbind(d1,d2)
fits <- lm(y ~ grp:x + grp - 1, data=data1)
summary(fits)
with(data1, plot(x,y))
abline(coef(fits)[c(1,3)], col="red")
abline(coef(fits)[c(2,4)], col="blue")

Output:
Call:
lm(formula = y ~ grp:x + grp - 1, data = data1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.206414 -0.057582 -0.001376  0.062737  0.242111 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
grpA   -0.02935    0.02811  -1.044    0.299    
grpB    1.96200    0.02811  69.803   <2e-16 ***
grpA:x  1.06276    0.04921  21.596   <2e-16 ***
grpB:x  5.09351    0.04921 103.502   <2e-16 ***
---

We can see that the intercepts of are close to 0 and 2 and slopes close to 1
and 5 as suggested by the data creation equation. In fact these results are
identical to running lm individually on each subset of data with formula y ~
x. What I don't understand is how this result comes about from the formula
specification y ~ grp:x + grp -1 . Can someone offer an explanation?


In addition I want to apply the same idea for a multivariate data set. The
code below shows a multivariate dataset which I hope to fit by group.

# The code below creates a data.frame comprising three marginally noisy
planes. I want to perform a 
# regression on each based on the grp factor. 
# y = x1 + x2 			for grp A
# y = 2 + 2x1 + 4x2 		for grp B
# y = 2 + 2x1 + 4x2 + 6x1x2  for grp C
ind <- matrix(runif(200), ncol=2, dimnames=list(c(),
c("x1","x2")))
d1 <- data.frame(ind,
y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1),
grp=rep("A",
100))
d2 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
grp=rep("B", 100))
d3 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
100))
data2 <- rbind(d1,d2,d3)
# visualise
ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not
found"))
fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
summary(fits)

Output:

Call:
lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)

Residuals:
      Min        1Q    Median        3Q       Max 
-1.338348 -0.092069 -0.002844  0.094813  1.259094 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
grpA    -0.01943    0.08442  -0.230    0.818    
grpB     1.98345    0.08442  23.495  < 2e-16 ***
grpC     0.51002    0.08442   6.042 4.66e-09 ***
grpA:x1  1.03376    0.11769   8.784  < 2e-16 ***
grpB:x1  2.02563    0.11769  17.212  < 2e-16 ***
grpC:x1  4.83101    0.11769  41.050  < 2e-16 ***
grpA:x2  1.02371    0.09664  10.593  < 2e-16 ***
grpB:x2  4.00069    0.09664  41.396  < 2e-16 ***
grpC:x2  7.22331    0.09664  74.742  < 2e-16 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 

Residual standard error: 0.2838 on 291 degrees of freedom
Multiple R-squared: 0.9973,     Adjusted R-squared: 0.9972 
F-statistic: 1.199e+04 on 9 and 291 DF,  p-value: < 2.2e-16 

Whilst there seems to be some correlation between coefficients and terms
that doesn't apply in all cases. I've tried many other combinations but
noe
that gave me coefficients similar to the defining equations. Bottom line is
I don't understand what the formula specification should be for the
multivariate version.

Thanks for any help given.

Chris

-- 
View this message in context:
http://www.nabble.com/Regression-by-groups-questions-tp24070481p24070481.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Chris Friedl

2009-Jun-18 12:01 UTC

head link

[R] Re gression by groups questions

pasting reply by Xavier into this thread for the sake of others who might
come across it in their R travels. Thanks Xavier

Hi,

It seems like your multivariate regression works for groups A and B. For
group C, values are calculated using an interaction between x1 and x2. Try
to include it in the regression model to find the right coeffs:

y ~ grp:x1:x2 + grp:x1 + grp:x2 + grp - 1

It works for me. Notice in summary() that the coefficients for the x1:x2
interaction are not significantly different from 0 in groups A and B, which
is what we expected.

HTH,

Xavier


Chris Friedl wrote:> 
> I have a large dataset grouped by a factor and I want to perform a
> regression on each data subset based on this factor. There are many ways
> to do this, posted here and elsewhere. I have tried several. However I
> found one method posted on the R wiki which works exactly as I want, and I
> like the elegance and simplicity of the solution,  but I don't
understand
> how it works. Its all in the formula specification. Problem is also that I
> want to extend it from a single variable case to a multivariable case and
> haven't really got a clue how to go about it. Hope someone can help.
> 
> # The code below creates a data.frame comprising two marginally noisy
> straight lines and performs a 
> # regression on each based on the grp factor. This is a single variable
> case based on R wiki one liner.
> # y = x		for grp A
> # y = 2 + 5x	for grp B
> set.seed(5)
> ind <- as.vector(runif(50))
> d1 <- data.frame(x=ind, y=ind + rnorm(50,0,0.1), grp=rep("A",
50))
> d2 <- data.frame(x=ind, y=2+5*ind+rnorm(50,0,0.1),
grp=rep("B", 50))
> data1 <- rbind(d1,d2)
> fits <- lm(y ~ grp:x + grp - 1, data=data1)
> summary(fits)
> with(data1, plot(x,y))
> abline(coef(fits)[c(1,3)], col="red")
> abline(coef(fits)[c(2,4)], col="blue")
> 
> Output:
> Call:
> lm(formula = y ~ grp:x + grp - 1, data = data1)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -0.206414 -0.057582 -0.001376  0.062737  0.242111 
> 
> Coefficients:
>        Estimate Std. Error t value Pr(>|t|)    
> grpA   -0.02935    0.02811  -1.044    0.299    
> grpB    1.96200    0.02811  69.803   <2e-16 ***
> grpA:x  1.06276    0.04921  21.596   <2e-16 ***
> grpB:x  5.09351    0.04921 103.502   <2e-16 ***
> ---
> 
> We can see that the intercepts of are close to 0 and 2 and slopes close to
> 1 and 5 as suggested by the data creation equation. In fact these results
> are identical to running lm individually on each subset of data with
> formula y ~ x. What I don't understand is how this result comes about
from
> the formula specification y ~ grp:x + grp -1 . Can someone offer an
> explanation?
> 
> 
> In addition I want to apply the same idea for a multivariate data set. The
> code below shows a multivariate dataset which I hope to fit by group.
> 
> # The code below creates a data.frame comprising three marginally noisy
> planes. I want to perform a 
> # regression on each based on the grp factor. 
> # y = x1 + x2 			for grp A
> # y = 2 + 2x1 + 4x2 		for grp B
> # y = 2 + 2x1 + 4x2 + 6x1x2  for grp C
> ind <- matrix(runif(100), ncol=2, dimnames=list(c(),
c("x1","x2")))
> d1 <- data.frame(ind,
y=ind[,"x1"]+ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("A", 100))
> d2 <- data.frame(ind,
y=2+2*ind[,"x1"]+4*ind[,"x2"]+rnorm(100,0,0.1),
> grp=rep("B", 100))
> d3 <- data.frame(ind,
>
y=2+2*ind[,"x1"]+4*ind[,"x2"]+6*ind[,"x1"]*ind[,"x2"]+rnorm(100,0,0.1),grp=rep("C",
> 100))
> data2 <- rbind(d1,d2,d3)
> # visualise
> ifelse(require(rgl), with(data2, plot3d(x1,x2,y)), print("RGL not
found"))
> fits <- lm(y ~ grp:x1 + grp:x2 + grp - 1, data=data2)
> summary(fits)
> 
> Output:
> 
> Call:
> lm(formula = y ~ grp:x1 + grp:x2 + grp - 1, data = data2)
> 
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -1.338348 -0.092069 -0.002844  0.094813  1.259094 
> 
> Coefficients:
>         Estimate Std. Error t value Pr(>|t|)    
> grpA    -0.01943    0.08442  -0.230    0.818    
> grpB     1.98345    0.08442  23.495  < 2e-16 ***
> grpC     0.51002    0.08442   6.042 4.66e-09 ***
> grpA:x1  1.03376    0.11769   8.784  < 2e-16 ***
> grpB:x1  2.02563    0.11769  17.212  < 2e-16 ***
> grpC:x1  4.83101    0.11769  41.050  < 2e-16 ***
> grpA:x2  1.02371    0.09664  10.593  < 2e-16 ***
> grpB:x2  4.00069    0.09664  41.396  < 2e-16 ***
> grpC:x2  7.22331    0.09664  74.742  < 2e-16 ***
> ---
> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 
> 
> Residual standard error: 0.2838 on 291 degrees of freedom
> Multiple R-squared: 0.9973,     Adjusted R-squared: 0.9972 
> F-statistic: 1.199e+04 on 9 and 291 DF,  p-value: < 2.2e-16 
> 
> Whilst there seems to be some correlation between coefficients and terms
> that doesn't apply in all cases. I've tried many other combinations
but
> noe that gave me coefficients similar to the defining equations. Bottom
> line is I don't understand what the formula specification should be for
> the multivariate version.
> 
> Thanks for any help given.
> 
> Chris
> 
> 
-- 
View this message in context:
http://www.nabble.com/Regression-by-groups-questions-tp24070481p24091659.html
Sent from the R help mailing list archive at Nabble.com.

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Jun 2009 - Re gression by groups questions

[R] Re gression by groups questions

[R] Re gression by groups questions

[R] Re gression by groups questions

Possibly Parallel Threads