thr3ads.net - R help - [R] linear regression for grouped data [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Entropi ntrp

2010-Dec-29 02:23 UTC

[R] linear regression for grouped data

Hi,
I have been examining large data and need to do simple linear regression
with the data which is grouped based on the values of a particular
attribute. For instance, consider three columns : ID, x, y,  and  I need to
regress x on y for each distinct value of ID. Specifically, for the set of
data corresponding to each of the 4 values of ID (76,111,121,168) in the
below data, I should invoke linear regression 4 times. The challenge is
that, the length of the ID vector is around 20000 and therefore linear
regression must be done automatically for each distinct value of ID.

               ID            x                     y
 76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111 35756 4.8
121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727 21.9  168
37739 29.7  168 37746 97.4
I was wondering whether there is an easy way to group data based on the
values of ID in R  so that linear regression can be done easily for each
group determined by each value of ID. Or, is the only way to construct
loops  with 'for' or 'while'  in which a matrix is generated for
each
distinct value of ID  that stores corresponding values of x and y by
screening the entire ID vector?

Thanks in advance,

Yasin

	[[alternative HTML version deleted]]

David Winsemius

2010-Dec-29 02:31 UTC

head link

[R] linear regression for grouped data

On Dec 28, 2010, at 9:23 PM, Entropi ntrp wrote:
> Hi,
> I have been examining large data and need to do simple linear  
> regression
> with the data which is grouped based on the values of a particular
> attribute. For instance, consider three columns : ID, x, y,  and  I  
> need to
> regress x on y for each distinct value of ID. Specifically, for the  
> set of
> data corresponding to each of the 4 values of ID (76,111,121,168) in  
> the
> below data, I should invoke linear regression 4 times. The challenge  
> is
> that, the length of the ID vector is around 20000 and therefore linear
> regression must be done automatically for each distinct value of ID.
>
>               ID            x                     y
> 76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111  
> 35756 4.8
> 121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727  
> 21.9  168
> 37739 29.7  168 37746 97.4
Let's say that is a dataframe named "indat. Try:

  lapply(split(indat, as.factor(indat$ID)), function(df) {lm(y ~ x,  
data=df)} )
> I was wondering whether there is an easy way to group data based on  
> the
> values of ID in R  so that linear regression can be done easily for  
> each
> group determined by each value of ID. Or, is the only way to construct
> loops  with 'for' or 'while'  in which a matrix is
generated for each
> distinct value of ID  that stores corresponding values of x and y by
> screening the entire ID vector?
>
> Thanks in advance,
>
> Yasin
-- 

David Winsemius, MD
West Hartford, CT

Bill.Venables at csiro.au

2010-Dec-29 03:33 UTC

head link

[R] linear regression for grouped data

library(nlme)
lmList(y ~ x | factor(ID), myData)

This gives a list of fitted model objects. 

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Entropi ntrp
Sent: Wednesday, 29 December 2010 12:24 PM
To: r-help at r-project.org
Subject: [R] linear regression for grouped data

Hi,
I have been examining large data and need to do simple linear regression
with the data which is grouped based on the values of a particular
attribute. For instance, consider three columns : ID, x, y,  and  I need to
regress x on y for each distinct value of ID. Specifically, for the set of
data corresponding to each of the 4 values of ID (76,111,121,168) in the
below data, I should invoke linear regression 4 times. The challenge is
that, the length of the ID vector is around 20000 and therefore linear
regression must be done automatically for each distinct value of ID.

               ID            x                     y
 76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111 35756 4.8
121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727 21.9  168
37739 29.7  168 37746 97.4
I was wondering whether there is an easy way to group data based on the
values of ID in R  so that linear regression can be done easily for each
group determined by each value of ID. Or, is the only way to construct
loops  with 'for' or 'while'  in which a matrix is generated for
each
distinct value of ID  that stores corresponding values of x and y by
screening the entire ID vector?

Thanks in advance,

Yasin

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

entropy

2010-Dec-29 06:15 UTC

head link

[R] linear regression for grouped data

Thanks alot for the quick responses.
I have some additional questions related to this topic. In fact, my
intention was to be able to answer questions like what percent of the
regressions have p_values less than a certain threshold, how do
residuals look like, how do the plots of y vs. x look like, etc.
I tried the following commands and found that the second line (and
similar ones) does not work for extracting certain statistics.

regress=lapply(split(egfr, as.factor(egfr$P_ID)), function(df)
{anova(lm(VALUE ~ LAB_DT, data=df)) })
regress[1]$residuals; regress[1]$fstatistic[1]

So, is it possible to record statistics of each regression such as
p_value, F-value, residuals, etc. as a vector?

Thanks,


On Dec 28, 6:23?pm, Entropi ntrp <entropy... at gmail.com>
wrote:> Hi,
> I have been examining large data and need to do simple linear regression
> with the data which is grouped based on the values of a particular
> attribute. For instance, consider three columns : ID, x, y, ?and ?I need to
> regress x on y for each distinct value of ID. Specifically, for the set of
> data corresponding to each of the 4 values of ID (76,111,121,168) in the
> below data, I should invoke linear regression 4 times. The challenge is
> that, the length of the ID vector is around 20000 and therefore linear
> regression must be done automatically for each distinct value of ID.
>
> ? ? ? ? ? ? ? ?ID ? ? ? ? ? ?x ? ? ? ? ? ? ? ? ? ? y
> ?76 36476 15.8 ?76 36493 66.9 ?76 36579 65.6 ?111 35465 10.3 ?111 35756 4.8
> 121 38183 16 ?121 38184 15 ?121 38254 9.6 ?121 38255 7 ?168 37727 21.9 ?168
> 37739 29.7 ?168 37746 97.4
> I was wondering whether there is an easy way to group data based on the
> values of ID in R ?so that linear regression can be done easily for each
> group determined by each value of ID. Or, is the only way to construct
> loops ?with 'for' or 'while' ?in which a matrix is
generated for each
> distinct value of ID ?that stores corresponding values of x and y by
> screening the entire ID vector?
>
> Thanks in advance,
>
> Yasin
>
> ? ? ? ? [[alternative HTML version deleted]]
>
> ______________________________________________
> R-h... at r-project.org mailing
listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Dennis Murphy

2010-Dec-29 06:16 UTC

head link

[R] linear regression for grouped data

Hi:

There are some advantages to taking a plyr approach to this type of problem.
The basic idea is to fit a linear model to each subgroup and save the
results in a list, from which you can extract what you want piece by piece.

library(plyr)

# One of those SAS style data sets...> df <- data.frame(matrix(scan(), ncol = 3, byrow = TRUE))1: 76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111 35756
4.8
16: 121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727 21.9
168
32: 37739 29.7  168 37746 97.4
37:
Read 36 items

# A little cleanup:
names(df) <- c('ID', 'x', 'y')
df$ID <- factor(df$ID)

# Fit a linear model to each sub-data frame identified by ID
# and send the results to a list object

# dlply takes a data frame as input and outputs a list
# the grouping variable is ID
# the argument d in the function is the sub-data frame of a given ID
lr1 <- dlply(df, .(ID), function(d) lm(y ~ x, data = d))

# So you can do things like:

# Grab the model coefficients
# (input is a list, output is a data frame)> ldply(lr1, function(m) m$coef)   ID  (Intercept)           x
1  76  -11699.9999  0.32176123
2 111     680.6007 -0.01890034
3 121    3900.5051 -0.10174534
4 168 -136322.4296  3.61371841

# export the R^2 values> ldply(lr1, function(m) summary(m)$r.squared)   ID        V1
1  76 0.3718840
2 111 1.0000000
3 121 0.9367437
4 168 0.6993811

# Extract the residuals and predicted values to another
list> llply(lr1, function(m) cbind(m$resid, m$fitted))$`76`
        [,1]     [,2]
1 -20.762884 36.56288
2  24.867175 42.03282
3  -4.104291 69.70429

$`111`
  [,1] [,2]
4    0 10.3
5    0  4.8

$`121`
        [,1]      [,2]
6  0.4371678 15.562832
7 -0.4610869 15.461087
8  1.2610869  8.338913
9 -1.2371678  8.237168

$`168`
        [,1]     [,2]
10   9.57509 12.32491
11 -25.98953 55.68953
12  16.41444 80.98556

# Plot the residuals vs. fitted values for each model (don't blink :)
# the _ means that no object is returned; the plot is a side effect
l_ply(lr1, function(d) plot(resid(d) ~ fitted(d)))

These are just some examples; clearly, there is a lot more one could do with
this type of structure.

HTH,
Dennis

On Tue, Dec 28, 2010 at 6:23 PM, Entropi ntrp <entropy053@gmail.com>
wrote:
> Hi,
> I have been examining large data and need to do simple linear regression
> with the data which is grouped based on the values of a particular
> attribute. For instance, consider three columns : ID, x, y,  and  I need to
> regress x on y for each distinct value of ID. Specifically, for the set of
> data corresponding to each of the 4 values of ID (76,111,121,168) in the
> below data, I should invoke linear regression 4 times. The challenge is
> that, the length of the ID vector is around 20000 and therefore linear
> regression must be done automatically for each distinct value of ID.
>
>               ID            x                     y
>  76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111 35756 4.8
> 121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727 21.9  168
> 37739 29.7  168 37746 97.4
> I was wondering whether there is an easy way to group data based on the
> values of ID in R  so that linear regression can be done easily for each
> group determined by each value of ID. Or, is the only way to construct
> loops  with 'for' or 'while'  in which a matrix is
generated for each
> distinct value of ID  that stores corresponding values of x and y by
> screening the entire ID vector?
>
> Thanks in advance,
>
> Yasin
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Michael Dewey

2010-Dec-30 13:39 UTC

head link

[R] linear regression for grouped data

At 02:23 29/12/2010, Entropi ntrp wrote:>Hi,
>I have been examining large data and need to do simple linear regression
>with the data which is grouped based on the values of a particular
>attribute. For instance, consider three columns : ID, x, y,  and  I need to
>regress x on y for each distinct value of ID. Specifically, for the set of
>data corresponding to each of the 4 values of ID (76,111,121,168) in the
>below data, I should invoke linear regression 4 times. The challenge is
>that, the length of the ID vector is around 20000 and therefore linear
>regression must be done automatically for each distinct value of ID.
>
>                ID            x                     y
>  76 36476 15.8  76 36493 66.9  76 36579 65.6  111 35465 10.3  111 35756 4.8
>121 38183 16  121 38184 15  121 38254 9.6  121 38255 7  168 37727 21.9  168
>37739 29.7  168 37746 97.4
>I was wondering whether there is an easy way to group data based on the
>values of ID in R  so that linear regression can be done easily for each
>group determined by each value of ID. Or, is the only way to construct
>loops  with 'for' or 'while'  in which a matrix is generated
for each
>distinct value of ID  that stores corresponding values of x and y by
>screening the entire ID vector?
The advantage of using lmList from nlme is that
a) it gives you access to a range of functions already written to 
operate on such oblects
b) you can easily write your own extractor function and then call it 
using lapply

If you do it yourself you can still do (b) but you lose (a)

>Thanks in advance,
>
>Yasin
>
>         [[alternative HTML version deleted]]
Michael Dewey
http://www.aghmed.fsnet.co.uk

Reasonably Related Threads

Search for more seemingly similar threads

R help - Dec 2010 - linear regression for grouped data

[R] linear regression for grouped data

[R] linear regression for grouped data

[R] linear regression for grouped data

[R] linear regression for grouped data

[R] linear regression for grouped data

[R] linear regression for grouped data

Reasonably Related Threads