Hi:
On Fri, Jan 7, 2011 at 7:04 AM, wangwallace <talenttree@gmail.com> wrote:
>
> hey, folks,
>
> I have two very simple questions. I am not quite sure if I can do this
> using
> R.
>
> so, I am analyzing a large data frame with thousands of variables. For
> example:
>
> Dependent variables: d1, d2, d3 (i.e., there are three dependent variables)
>
> Independent variables: s1, s2, s3, ......s1000 (i.e., there are 1000
> independent variables)
>
> now I want to run simple linear regression analyses of dependent variables
> on independent variables. A laborious way to do this is running 1000 linear
> regression models for each dependent variable separately. This would take a
> lot of time. My question is:
>
> 1) is there a easy way to run all 1000 linear regression analyses for each
> dependent variable at once?
>
Yes.
> 2) after running all 1000 linear regression analyses at once, how can I
> write 3000 regression weights (1000 regression weights for each dependent
> variable) and its significance level in to a file (e.g., csv, excel, ect).
>
Define 'weights'. Do you mean the parameter estimates?
Here's a simplified example, but the output is a separate list object for
each response. Each component of each list is an lm object, produced from
the lm() function.
# Generate some fake data: three responses and eight covariates
df <- data.frame(y1 = rnorm(50), y2 = rnorm(50), y3 = rnorm(50),
x1 = rnorm(50), x2 = rnorm(50), x3 = rnorm(50),
x4 = rpois(50, 30), x5 = rpois(50, 20), x6 = rpois(50,
10),
x7 = runif(50), x8 = runif(50))
# Create a vector of covariate names
xs <- paste('x', 1:8, sep = '')
# Initialize a list whose length is that of the vector xs
rl1 <- vector('list', 8)
rl2 <- vector('list', 8)
rl3 <- vector('list', 8)
# The following loop regresses all three responses individually on a single
covariate x[i]
# and exports the results to separate lists for each response
# The first statement creates a formula with the name of the i-th covariate
# The second statement does the regression and assigns the output to the
i-th
# component of the list corresponding to the j-th response (j = 1, 2, 3)
for(i in 1:8) { fm1<- as.formula(paste('y1', xs[i], sep =
'~'))
fm2 <- as.formula(paste('y2', xs[i], sep =
'~'))
fm3 <- as.formula(paste('y3', xs[i], sep =
'~'))
rl1[[i]] <- lm(fm1, data = df)
rl2[[i]] <- lm(fm2, data = df)
rl3[[i]] <- lm(fm3, data = df)
}
# The print method of lm() applied to the first component is
rl1[[1]]
Each component of each list will resemble the output object you would get
from running lm() on a single response with one explanatory variable at a
time. In each list, there are as many components as there are covariates.
You can extract all sorts of things from each component of the list; I
prefer using
the ldply() function from package plyr, but there are other ways with base
functions.
Here are some examples:
library(plyr)
# R^2 values:
ldply(rl1, function(x) summary(x)$r.squared)
# Model coefficients:
ldply(rl1, function(x) coef(x))
# p-values of significance tests for intercept and slope
ldply(rl1, function(x) summary(x)$coefficients[, 4])
# residuals from each model
res1 <- t(ldply(rl1, function(x) resid(x))) # produces a matrix
Some comments:
1. If you want to run multivariate regression on the response matrix [y1 y2
y3],
you only need one list object for output. Substitute 'cbind(y1, y2,
y3)'
for
yi when creating the formula. Only one call is needed for multivariate
regression rather than three for the univariate regressions. However, you
would then need to be more careful about output structures and pulling
them
together over list components. For instance, in a multivariate
regression, the
coefficients are output as a matrix rather than a vector, so to combine
them
all, you would need to use a 3D array rather than a data frame.
2. The 'x' in the anonymous functions in ldply() corresponds to a
generic
list
component. In this context, x is an lm object, so anything you could do
with
the output of an lm object could be mapped componentwise with ldply. This
approach is very convenient if you want to pick off individual output
pieces
(e.g., R^2 or the root MSE) for all the regressions at once.
3. To associate the covariates with rows of output in each generated data
frame above, you could use, e.g.,
r2y1 <- ldply(rl1, function(x) summary(x)$r.squared)
rownames(r2y1) <- xs
In the residual example, the data frame is 8 x 50; using the t() function
[for transpose], the result is a 50 x 8 matrix whose columns correspond
to
the individual covariates, so in this case
colnames(res1) <- xs
4. Make sure you organize your code and desired output in advance; with 1000
covariates, things could get messy if you're not careful.
HTH,
Dennis
> Many thanks in advance!!
> --
> View this message in context:
>
http://r.789695.n4.nabble.com/how-to-run-linear-regression-models-at-once-tp3179256p3179256.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]