thr3ads.net - R help - [R] question about the degrees of freedom [May 2010]

If this information is useful, please help other people find it:
Share via:

serdal ozusaglam

2010-May-03 13:19 UTC

[R] question about the degrees of freedom

Dear R users,


I think i have a simple question which i want to explain by an example;

i have several 2-digit industry codes that i want to use for conducting
by-industry analysis but i think there is a problem with the degrees of freedom!

for example, when i do my analysis without any 2-digit industry code, i got the
following summary (i have 146574 observations in total):> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno, data=ds)
> summary(abc)
Call:
lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno, data = ds)

Residuals:
      Min        1Q    Median        3Q       Max 
-11.01340  -0.17637  -0.02217   0.14974   7.79005 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 0.8870369  0.0050646  175.144   <2e-16 ***
lnC         0.0658922  0.0006549  100.614   <2e-16 ***
lnM         0.8027478  0.0006549 1225.764   <2e-16 ***
lnL         0.0173622  0.0004025   43.138   <2e-16 ***
lnE         0.0657710  0.0006745   97.516   <2e-16 ***
ecoTRUE     0.0101649  0.0045892    2.215   0.0268 *  
innoTRUE    0.0945100  0.0030317   31.174   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.294 on 146160 degrees of freedom
  (407 observations deleted due to missingness)
Multiple R-squared: 0.9705,     Adjusted R-squared: 0.9705 
F-statistic: 8.027e+05 on 6 and 146160 DF,  p-value: < 2.2e-16 

as we can see from the last row there are 146160 DF (407 deleted) this is ok!




but when i want to use for example just one of the industry lets say just the
11th industry
1st:  i create the dummy for this industry such as; 

>ind1=(ind_2d==11)# so here the R supposed to consider just the 11th
industry!!
> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno+ind, data=ds)
> summary(abc)
Call:
lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno + ind, 
    data = ds)

Residuals:
      Min        1Q    Median        3Q       Max 
-11.03392  -0.17647  -0.02301   0.14901   7.74957 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  0.8980397  0.0050451  178.001  < 2e-16 ***
lnC          0.0672255  0.0006523  103.065  < 2e-16 ***
lnM          0.7990819  0.0006579 1214.596  < 2e-16 ***
lnL          0.0171633  0.0004004   42.870  < 2e-16 ***
lnE          0.0670030  0.0006716   99.770  < 2e-16 ***
ecoTRUE      0.0162249  0.0045672    3.552 0.000382 ***
innoTRUE     0.0966967  0.0030160   32.062  < 2e-16 ***
indTRUE     -0.1251466  0.0031509  -39.717  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.2924 on 146159 degrees of freedom
  (407 observations deleted due to missingness)
Multiple R-squared: 0.9709,     Adjusted R-squared: 0.9709 
F-statistic: 6.957e+05 on 7 and 146159 DF,  p-value: < 2.2e-16 

but as we can see it again counted in all the industries! so the DF is 146159!!!


So i just wonder, where do i made mistake, or there is no mistake at all, and i
just misunderstood the DF issue?

Any answer would be appreciated
thanks in advance






 		 	   		  
_________________________________________________________________


	[[alternative HTML version deleted]]

Ista Zahn

2010-May-03 14:38 UTC

head link

[R] question about the degrees of freedom

Hi Serdal,
There is a lot of confusion here (how much is yours and how much is
mine remains to be seen). See specific comments in line.

On Mon, May 3, 2010 at 9:19 AM, serdal ozusaglam
<saint-filth at hotmail.com> wrote:>
> Dear R users,
>
>
> I think i have a simple question which i want to explain by an example;
>
> i have several 2-digit industry codes that i want to use for conducting
by-industry analysis but i think there is a problem with the degrees of freedom!
>
> for example, when i do my analysis without any 2-digit industry code, i got
the following summary (i have 146574 observations in total):
>> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno, data=ds)
>> summary(abc)
>
> Call:
> lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno, data = ds)
>
> Residuals:
> ? ? ?Min ? ? ? ?1Q ? ?Median ? ? ? ?3Q ? ? ? Max
> -11.01340 ?-0.17637 ?-0.02217 ? 0.14974 ? 7.79005
>
> Coefficients:
> ? ? ? ? ? ? Estimate Std. Error ?t value Pr(>|t|)
> (Intercept) 0.8870369 ?0.0050646 ?175.144 ? <2e-16 ***
> lnC ? ? ? ? 0.0658922 ?0.0006549 ?100.614 ? <2e-16 ***
> lnM ? ? ? ? 0.8027478 ?0.0006549 1225.764 ? <2e-16 ***
> lnL ? ? ? ? 0.0173622 ?0.0004025 ? 43.138 ? <2e-16 ***
> lnE ? ? ? ? 0.0657710 ?0.0006745 ? 97.516 ? <2e-16 ***
> ecoTRUE ? ? 0.0101649 ?0.0045892 ? ?2.215 ? 0.0268 *
> innoTRUE ? ?0.0945100 ?0.0030317 ? 31.174 ? <2e-16 ***
> ---
> Signif. codes: ?0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>
> Residual standard error: 0.294 on 146160 degrees of freedom
> ?(407 observations deleted due to missingness)
> Multiple R-squared: 0.9705, ? ? Adjusted R-squared: 0.9705
> F-statistic: 8.027e+05 on 6 and 146160 DF, ?p-value: < 2.2e-16
>
> as we can see from the last row there are 146160 DF (407 deleted) this is
ok!
>
>
Usually it is better to make a small example that demonstrates your
issue. I have no idea what these variable are which makes it harder to
diagnose your problem.
>
>
> but when i want to use for example just one of the industry lets say just
the 11th industry
> 1st: ?i create the dummy for this industry such as;
>
>
>>ind1=(ind_2d==11)# so here the R supposed to consider just the 11th
industry!!
This makes no sense to me. What are you trying to do here? What is
ind_2d? Are you trying to subset your data.frame? If so, see ?subset,
or ?"["
>> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno+ind, data=ds)
>> summary(abc)
>
> Call:
> lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno + ind,
> ? ?data = ds)
>
> Residuals:
> ? ? ?Min ? ? ? ?1Q ? ?Median ? ? ? ?3Q ? ? ? Max
> -11.03392 ?-0.17647 ?-0.02301 ? 0.14901 ? 7.74957
>
> Coefficients:
> ? ? ? ? ? ? ?Estimate Std. Error ?t value Pr(>|t|)
> (Intercept) ?0.8980397 ?0.0050451 ?178.001 ?< 2e-16 ***
> lnC ? ? ? ? ?0.0672255 ?0.0006523 ?103.065 ?< 2e-16 ***
> lnM ? ? ? ? ?0.7990819 ?0.0006579 1214.596 ?< 2e-16 ***
> lnL ? ? ? ? ?0.0171633 ?0.0004004 ? 42.870 ?< 2e-16 ***
> lnE ? ? ? ? ?0.0670030 ?0.0006716 ? 99.770 ?< 2e-16 ***
> ecoTRUE ? ? ?0.0162249 ?0.0045672 ? ?3.552 0.000382 ***
> innoTRUE ? ? 0.0966967 ?0.0030160 ? 32.062 ?< 2e-16 ***
> indTRUE ? ? -0.1251466 ?0.0031509 ?-39.717 ?< 2e-16 ***
> ---
> Signif. codes: ?0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>
> Residual standard error: 0.2924 on 146159 degrees of freedom
> ?(407 observations deleted due to missingness)
> Multiple R-squared: 0.9709, ? ? Adjusted R-squared: 0.9709
> F-statistic: 6.957e+05 on 7 and 146159 DF, ?p-value: < 2.2e-16
>
> but as we can see it again counted in all the industries! so the DF is
146159!!!
>
>
> So i just wonder, where do i made mistake, or there is no mistake at all,
and i just misunderstood the DF issue?
I think the misunderstanding runs deeper than that. Try creating a
minimal example, and clearly stating a) what you are trying to
accomplish, b) what you tried, and c) what doesn't work as you expect.

Best,
Ista
>
> Any answer would be appreciated
> thanks in advance
>
>
>
>
>
>
>
> _________________________________________________________________
>
>
> ? ? ? ?[[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

serdal ozusaglam

2010-May-03 15:43 UTC

head link

[R] question about the degrees of freedom

Hi Serdal,> There is a lot of confusion here (how much is yours and how much is
> mine remains to be seen). See specific comments in line. 
Also inline comments.
 >
> On Mon, May 3, 2010 at 9:19 AM, serdal ozusaglam
> <saint-filth@hotmail.com> wrote:
>>
>> Dear R users,
>>
>>
>> I think i have a simple question which i want to explain by an  
>> example;
>>
>> i have several 2-digit industry codes that i want to use for  
>> conducting by-industry analysis but i think there is a problem with  
>> the degrees of freedom!
>>
>> for example, when i do my analysis without any 2-digit industry  
>> code, i got the following summary (i have 146574 observations in  
>> total):
>>> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno, data=ds)
>>> summary(abc)
>>
>> Call:
>> lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno, data = ds)
>>
>> Residuals:
>>      Min        1Q    Median        3Q       Max
>> -11.01340  -0.17637  -0.02217   0.14974   7.79005
>>
>> Coefficients:
>>             Estimate Std. Error  t value Pr(>|t|)
>> (Intercept) 0.8870369  0.0050646  175.144   <2e-16 ***
>> lnC         0.0658922  0.0006549  100.614   <2e-16 ***
>> lnM         0.8027478  0.0006549 1225.764   <2e-16 ***
>> lnL         0.0173622  0.0004025   43.138   <2e-16 ***
>> lnE         0.0657710  0.0006745   97.516   <2e-16 ***
>> ecoTRUE     0.0101649  0.0045892    2.215   0.0268 *
>> innoTRUE    0.0945100  0.0030317   31.174   <2e-16 ***
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.294 on 146160 degrees of freedom
>>  (407 observations deleted due to missingness)
>> Multiple R-squared: 0.9705,     Adjusted R-squared: 0.9705
>> F-statistic: 8.027e+05 on 6 and 146160 DF,  p-value: < 2.2e-16
>>
>> as we can see from the last row there are 146160 DF (407 deleted)  
>> this is ok!
>>
>>
>
> Usually it is better to make a small example that demonstrates your
> issue. I have no idea what these variable are which makes it harder to
> diagnose your problem.
>
>>
>>
>> but when i want to use for example just one of the industry lets  
>> say just the 11th industry
>> 1st:  i create the dummy for this industry such as;
>>
>>
>>> ind1=(ind_2d==11)# so here the R supposed to consider just the  
>>> 11th industry!!
>
> This makes no sense to me. What are you trying to do here? What is
> ind_2d? Are you trying to subset your data.frame? If so, see ?subset,
> or ?"[" 
Serdal is just making a logical indicator variable.
 >
>>> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno+ind, data=ds)
>>> summary(abc)
>>
>> Call:
>> lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno + ind,
>>    data = ds)
>>
>> Residuals:
>>      Min        1Q    Median        3Q       Max
>> -11.03392  -0.17647  -0.02301   0.14901   7.74957
>>
>> Coefficients:
>>              Estimate Std. Error  t value Pr(>|t|)
>> (Intercept)  0.8980397  0.0050451  178.001  < 2e-16 ***
>> lnC          0.0672255  0.0006523  103.065  < 2e-16 ***
>> lnM          0.7990819  0.0006579 1214.596  < 2e-16 ***
>> lnL          0.0171633  0.0004004   42.870  < 2e-16 ***
>> lnE          0.0670030  0.0006716   99.770  < 2e-16 ***
>> ecoTRUE      0.0162249  0.0045672    3.552 0.000382 ***
>> innoTRUE     0.0966967  0.0030160   32.062  < 2e-16 ***
>> indTRUE     -0.1251466  0.0031509  -39.717  < 2e-16 ***
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.2924 on 146159 degrees of freedom
>>  (407 observations deleted due to missingness)
>> Multiple R-squared: 0.9709,     Adjusted R-squared: 0.9709
>> F-statistic: 6.957e+05 on 7 and 146159 DF,  p-value: < 2.2e-16
>>
>> but as we can see it again counted in all the industries! so the DF  
>> is 146159!!!
>>
>>
>> So i just wonder, where do i made mistake, or there is no mistake  
>> at all, and i just misunderstood the DF issue?
>
> I think the misunderstanding runs deeper than that. Try creating a
> minimal example, and clearly stating a) what you are trying to
> accomplish, b) what you tried, and c) what doesn't work as you expect. 
I, too, was puzzled by the OP's reaction. Serdal added a single  
logical predictor variable to an existing model that already had two  
such variables and as a result his degrees of freedom in the model  
increased by one and the degrees of freedom in the residuals decreased  
by one. Where is the problem? And why wasn't this question posed even  
earlier at the point of addition of "eco" and "inno"
variables?  He
perhaps was expecting that the degrees of freedom in the model would  
increase by the number of records that shared an indTRUE value of  
TRUE, but that is not the way ordinary regression works. Perhaps he  
should do some reading on mixed effects modeling? Or perhaps that is  
what his professor or supervisor is hoping he will learn by assigning  
this task? Or perhaps he needs to learn to use the anova() function?
 


------

i think there is much more problem with my regression than i thought 

Dear David, I just want to ask a question to clarify something about the dummy
variables!
with a small example, lets say;
RD is the dummy variable that i created if a firm has reported an R&D
expenditure such as;
RD=(researchexp>0)# and lets say i know that there are 100 firms has reported
such an expenditure within 1000 firms
and i basically want to see the effect of this variable on  log oftotal output
(lnQ)

by writin this command:
abc<-lm(lnQ~RD, data=mydata)
doesnt the R suppose to consider just these 100 firms that reported R&D
expenditure?
if so why it doesnt ( i tried just now and it doesnt)
if it doesnt suppose to do this, which way i should follow, or where do i make
mistake?

Looking forward to see your response
thank you!
serdal 


>
> Best,
> Ista
>-- 
David Winsemius, MD
West Hartford, CT
  		 	   		  
_________________________________________________________________


	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more apparently analagous threads

R help - May 2010 - question about the degrees of freedom

[R] question about the degrees of freedom

[R] question about the degrees of freedom

[R] question about the degrees of freedom

Reasonably Related Threads