Hi, I have been examining large data and need to do simple linear regression with the data which is grouped based on the values of a particular attribute. For instance, consider three columns : ID, x, y, and I need to regress x on y for each distinct value of ID. Specifically, for the set of data corresponding to each of the 4 values of ID (76,111,121,168) in the below data, I should invoke linear regression 4 times. The challenge is that, the length of the ID vector is around 20000 and therefore linear regression must be done automatically for each distinct value of ID. ID x y 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 35756 4.8 121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 21.9 168 37739 29.7 168 37746 97.4 I was wondering whether there is an easy way to group data based on the values of ID in R so that linear regression can be done easily for each group determined by each value of ID. Or, is the only way to construct loops with 'for' or 'while' in which a matrix is generated for each distinct value of ID that stores corresponding values of x and y by screening the entire ID vector? Thanks in advance, Yasin [[alternative HTML version deleted]]
On Dec 28, 2010, at 9:23 PM, Entropi ntrp wrote:> Hi, > I have been examining large data and need to do simple linear > regression > with the data which is grouped based on the values of a particular > attribute. For instance, consider three columns : ID, x, y, and I > need to > regress x on y for each distinct value of ID. Specifically, for the > set of > data corresponding to each of the 4 values of ID (76,111,121,168) in > the > below data, I should invoke linear regression 4 times. The challenge > is > that, the length of the ID vector is around 20000 and therefore linear > regression must be done automatically for each distinct value of ID. > > ID x y > 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 > 35756 4.8 > 121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 > 21.9 168 > 37739 29.7 168 37746 97.4Let's say that is a dataframe named "indat. Try: lapply(split(indat, as.factor(indat$ID)), function(df) {lm(y ~ x, data=df)} )> I was wondering whether there is an easy way to group data based on > the > values of ID in R so that linear regression can be done easily for > each > group determined by each value of ID. Or, is the only way to construct > loops with 'for' or 'while' in which a matrix is generated for each > distinct value of ID that stores corresponding values of x and y by > screening the entire ID vector? > > Thanks in advance, > > Yasin-- David Winsemius, MD West Hartford, CT
library(nlme) lmList(y ~ x | factor(ID), myData) This gives a list of fitted model objects. -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Entropi ntrp Sent: Wednesday, 29 December 2010 12:24 PM To: r-help at r-project.org Subject: [R] linear regression for grouped data Hi, I have been examining large data and need to do simple linear regression with the data which is grouped based on the values of a particular attribute. For instance, consider three columns : ID, x, y, and I need to regress x on y for each distinct value of ID. Specifically, for the set of data corresponding to each of the 4 values of ID (76,111,121,168) in the below data, I should invoke linear regression 4 times. The challenge is that, the length of the ID vector is around 20000 and therefore linear regression must be done automatically for each distinct value of ID. ID x y 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 35756 4.8 121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 21.9 168 37739 29.7 168 37746 97.4 I was wondering whether there is an easy way to group data based on the values of ID in R so that linear regression can be done easily for each group determined by each value of ID. Or, is the only way to construct loops with 'for' or 'while' in which a matrix is generated for each distinct value of ID that stores corresponding values of x and y by screening the entire ID vector? Thanks in advance, Yasin [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks alot for the quick responses. I have some additional questions related to this topic. In fact, my intention was to be able to answer questions like what percent of the regressions have p_values less than a certain threshold, how do residuals look like, how do the plots of y vs. x look like, etc. I tried the following commands and found that the second line (and similar ones) does not work for extracting certain statistics. regress=lapply(split(egfr, as.factor(egfr$P_ID)), function(df) {anova(lm(VALUE ~ LAB_DT, data=df)) }) regress[1]$residuals; regress[1]$fstatistic[1] So, is it possible to record statistics of each regression such as p_value, F-value, residuals, etc. as a vector? Thanks, On Dec 28, 6:23?pm, Entropi ntrp <entropy... at gmail.com> wrote:> Hi, > I have been examining large data and need to do simple linear regression > with the data which is grouped based on the values of a particular > attribute. For instance, consider three columns : ID, x, y, ?and ?I need to > regress x on y for each distinct value of ID. Specifically, for the set of > data corresponding to each of the 4 values of ID (76,111,121,168) in the > below data, I should invoke linear regression 4 times. The challenge is > that, the length of the ID vector is around 20000 and therefore linear > regression must be done automatically for each distinct value of ID. > > ? ? ? ? ? ? ? ?ID ? ? ? ? ? ?x ? ? ? ? ? ? ? ? ? ? y > ?76 36476 15.8 ?76 36493 66.9 ?76 36579 65.6 ?111 35465 10.3 ?111 35756 4.8 > 121 38183 16 ?121 38184 15 ?121 38254 9.6 ?121 38255 7 ?168 37727 21.9 ?168 > 37739 29.7 ?168 37746 97.4 > I was wondering whether there is an easy way to group data based on the > values of ID in R ?so that linear regression can be done easily for each > group determined by each value of ID. Or, is the only way to construct > loops ?with 'for' or 'while' ?in which a matrix is generated for each > distinct value of ID ?that stores corresponding values of x and y by > screening the entire ID vector? > > Thanks in advance, > > Yasin > > ? ? ? ? [[alternative HTML version deleted]] > > ______________________________________________ > R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi: There are some advantages to taking a plyr approach to this type of problem. The basic idea is to fit a linear model to each subgroup and save the results in a list, from which you can extract what you want piece by piece. library(plyr) # One of those SAS style data sets...> df <- data.frame(matrix(scan(), ncol = 3, byrow = TRUE))1: 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 35756 4.8 16: 121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 21.9 168 32: 37739 29.7 168 37746 97.4 37: Read 36 items # A little cleanup: names(df) <- c('ID', 'x', 'y') df$ID <- factor(df$ID) # Fit a linear model to each sub-data frame identified by ID # and send the results to a list object # dlply takes a data frame as input and outputs a list # the grouping variable is ID # the argument d in the function is the sub-data frame of a given ID lr1 <- dlply(df, .(ID), function(d) lm(y ~ x, data = d)) # So you can do things like: # Grab the model coefficients # (input is a list, output is a data frame)> ldply(lr1, function(m) m$coef)ID (Intercept) x 1 76 -11699.9999 0.32176123 2 111 680.6007 -0.01890034 3 121 3900.5051 -0.10174534 4 168 -136322.4296 3.61371841 # export the R^2 values> ldply(lr1, function(m) summary(m)$r.squared)ID V1 1 76 0.3718840 2 111 1.0000000 3 121 0.9367437 4 168 0.6993811 # Extract the residuals and predicted values to another list> llply(lr1, function(m) cbind(m$resid, m$fitted))$`76` [,1] [,2] 1 -20.762884 36.56288 2 24.867175 42.03282 3 -4.104291 69.70429 $`111` [,1] [,2] 4 0 10.3 5 0 4.8 $`121` [,1] [,2] 6 0.4371678 15.562832 7 -0.4610869 15.461087 8 1.2610869 8.338913 9 -1.2371678 8.237168 $`168` [,1] [,2] 10 9.57509 12.32491 11 -25.98953 55.68953 12 16.41444 80.98556 # Plot the residuals vs. fitted values for each model (don't blink :) # the _ means that no object is returned; the plot is a side effect l_ply(lr1, function(d) plot(resid(d) ~ fitted(d))) These are just some examples; clearly, there is a lot more one could do with this type of structure. HTH, Dennis On Tue, Dec 28, 2010 at 6:23 PM, Entropi ntrp <entropy053@gmail.com> wrote:> Hi, > I have been examining large data and need to do simple linear regression > with the data which is grouped based on the values of a particular > attribute. For instance, consider three columns : ID, x, y, and I need to > regress x on y for each distinct value of ID. Specifically, for the set of > data corresponding to each of the 4 values of ID (76,111,121,168) in the > below data, I should invoke linear regression 4 times. The challenge is > that, the length of the ID vector is around 20000 and therefore linear > regression must be done automatically for each distinct value of ID. > > ID x y > 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 35756 4.8 > 121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 21.9 168 > 37739 29.7 168 37746 97.4 > I was wondering whether there is an easy way to group data based on the > values of ID in R so that linear regression can be done easily for each > group determined by each value of ID. Or, is the only way to construct > loops with 'for' or 'while' in which a matrix is generated for each > distinct value of ID that stores corresponding values of x and y by > screening the entire ID vector? > > Thanks in advance, > > Yasin > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
At 02:23 29/12/2010, Entropi ntrp wrote:>Hi, >I have been examining large data and need to do simple linear regression >with the data which is grouped based on the values of a particular >attribute. For instance, consider three columns : ID, x, y, and I need to >regress x on y for each distinct value of ID. Specifically, for the set of >data corresponding to each of the 4 values of ID (76,111,121,168) in the >below data, I should invoke linear regression 4 times. The challenge is >that, the length of the ID vector is around 20000 and therefore linear >regression must be done automatically for each distinct value of ID. > > ID x y > 76 36476 15.8 76 36493 66.9 76 36579 65.6 111 35465 10.3 111 35756 4.8 >121 38183 16 121 38184 15 121 38254 9.6 121 38255 7 168 37727 21.9 168 >37739 29.7 168 37746 97.4 >I was wondering whether there is an easy way to group data based on the >values of ID in R so that linear regression can be done easily for each >group determined by each value of ID. Or, is the only way to construct >loops with 'for' or 'while' in which a matrix is generated for each >distinct value of ID that stores corresponding values of x and y by >screening the entire ID vector?The advantage of using lmList from nlme is that a) it gives you access to a range of functions already written to operate on such oblects b) you can easily write your own extractor function and then call it using lapply If you do it yourself you can still do (b) but you lose (a)>Thanks in advance, > >Yasin > > [[alternative HTML version deleted]]Michael Dewey http://www.aghmed.fsnet.co.uk