thr3ads.net - R help - [R] Regression with many independent variables [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Matthew Douglas

2011-Feb-28 20:32 UTC

[R] Regression with many independent variables

Hi,

I am trying use lm() on some data, the code works fine but I would
like to use a more efficient way to do this.

The data looks like this (the data is very sparse with a few 1s, -1s
and the rest 0s):
> head(adj0708)      MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
1   64.28571   29    0    0    0    0    0    0   0    0    0    0
0    0    0
2 -100.00000    6    0    0    0    0    0    0   0    1    0    0
0    0    0
3  100.00000    4    0    0    0    0    0    0   0    1    0    0
0    0    0
4  -33.33333    7    0    0    0    0    0    0   0    0    0    0
0    0    0
5  200.00000    2    0    0    0    0    0    0   0    0    0    0
-1    0    0
6  -83.33333   12    0    -1    0    0    0    0   0    0    0    0
0    0    0

adj0708 is actually a 35657x341 data set. Each column after "Poss" is
an independent variable, the dependent variable is "MARGIN" and it is
weighted by "Poss"


The regression is below:
fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
adj0708$P605 + adj0708$P337 + .... +
adj0708$P510,weights=adj0708$Poss)

I have two questions:

1. Is there a way to to condense how I write the independent variables
in the lm(), instead of having such a long line of code (I have 339
independent variables to be exact)?
2. I would like to pair the data to look a regression of the
interactions between two independent variables. I think it would look
something like this....
fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
but there will be 339 Choose 2 combinations, so a lot of independent
variables! Is there a more efficient way of writing this code. Is
there a way I can do this?

Thanks,
Matt

Greg Snow

2011-Feb-28 21:30 UTC

head link

[R] Regression with many independent variables

Don't put the name of the dataset in the formula, use the data argument to
lm to provide that.  A single period (".") on the right hand side of
the formula will represent all the columns in the data set that are not on the
left hand side (you can then use "-" to remove any other columns that
you don't want included on the RHS).

For example:
> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
Call:
lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)

Coefficients:
      (Intercept)       Petal.Length        Petal.Width  Speciesversicolor  
           3.0485             0.1547             0.6234            -1.7641  
 Speciesvirginica  
          -2.1964  


But, are you sure that a regression model with 339 predictors will be
meaningful?

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Matthew Douglas
> Sent: Monday, February 28, 2011 1:32 PM
> To: r-help at r-project.org
> Subject: [R] Regression with many independent variables
> 
> Hi,
> 
> I am trying use lm() on some data, the code works fine but I would
> like to use a more efficient way to do this.
> 
> The data looks like this (the data is very sparse with a few 1s, -1s
> and the rest 0s):
> 
> > head(adj0708)
>       MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
> 1   64.28571   29    0    0    0    0    0    0   0    0    0    0
> 0    0    0
> 2 -100.00000    6    0    0    0    0    0    0   0    1    0    0
> 0    0    0
> 3  100.00000    4    0    0    0    0    0    0   0    1    0    0
> 0    0    0
> 4  -33.33333    7    0    0    0    0    0    0   0    0    0    0
> 0    0    0
> 5  200.00000    2    0    0    0    0    0    0   0    0    0    0
> -1    0    0
> 6  -83.33333   12    0    -1    0    0    0    0   0    0    0    0
> 0    0    0
> 
> adj0708 is actually a 35657x341 data set. Each column after
"Poss" is
> an independent variable, the dependent variable is "MARGIN" and
it is
> weighted by "Poss"
> 
> 
> The regression is below:
> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
> adj0708$P605 + adj0708$P337 + .... +
> adj0708$P510,weights=adj0708$Poss)
> 
> I have two questions:
> 
> 1. Is there a way to to condense how I write the independent variables
> in the lm(), instead of having such a long line of code (I have 339
> independent variables to be exact)?
> 2. I would like to pair the data to look a regression of the
> interactions between two independent variables. I think it would look
> something like this....
> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
> but there will be 339 Choose 2 combinations, so a lot of independent
> variables! Is there a more efficient way of writing this code. Is
> there a way I can do this?
> 
> Thanks,
> Matt
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Feb 2011 - Regression with many independent variables

[R] Regression with many independent variables

[R] Regression with many independent variables

Maybe Matching Threads