thr3ads.net - R help - [R] FW: logistic regression [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Darin Brooks

2008-Sep-27 13:44 UTC

[R] FW: logistic regression

Sorry.

Let me try again then.

I am trying to find "significant" predictors" from a list of
about 44
independent variables.  So I started with all 44 variables and ran
drop1(sep22lr, test="Chisq")... and then dropped the highest p value
from
the run.  Then I reran the drop1.  

Model:
MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_MST_1 + 
    SOIL_NUTR + cE + cN + cELEV + cDIAM_125 + cCRCLS + cCULM_125 + 
    cSPH + cAGE + cVRI_NONPINE + cVRI_nonpineCFR + cVRI_BLEAF + 
    cvol_125 + cstrDST_SW + cwaterDST_SW + cSEEDSRCE_SW + cMAT + 
    cMWMT + cMCMT + cTD + cMAP + cMSP + cAHM + cSHM + cMATMAP + 
    cddless0 + cddless18 + cddgrtr0 + cddgrtr18 + cNFFD + cbFFP + 
    ceFFP + cPAS + cDD5_100 + cEXT_Cold + cS_INDX
                Df Deviance    AIC    LRT   Pr(Chi)    
<none>               814.21 938.21                     
ORG_CODE         4   824.97 940.97  10.76 0.0294100 *  
BECLBL08         9   845.61 951.61  31.41 0.0002519 ***
PEM_SScat       10   829.11 933.11  14.90 0.1357580    
SOIL_MST_1       1   814.63 936.63   0.43 0.5135094    
SOIL_NUTR        2   818.49 938.49   4.28 0.1175411    
cE               1   814.37 936.37   0.16 0.6886085    
cN               1   814.40 936.40   0.20 0.6566765    
cELEV            1   814.35 936.35   0.14 0.7044864    
cDIAM_125        1   817.98 939.98   3.78 0.0519554 .  
cCRCLS           1   819.32 941.32   5.11 0.0237598 *  
cCULM_125        1   816.17 938.17   1.97 0.1606846    
cSPH             1   816.62 938.62   2.41 0.1204141    
cAGE             1   815.92 937.92   1.72 0.1902314    
cVRI_NONPINE     1   818.04 940.04   3.84 0.0501149 .  
cVRI_nonpineCFR  1   821.17 943.17   6.96 0.0083197 ** 
cVRI_BLEAF       1   818.78 940.78   4.58 0.0324286 *  
cvol_125         1   814.67 936.67   0.47 0.4949495    
cstrDST_SW       1   814.63 936.63   0.42 0.5169757    
cwaterDST_SW     1   814.75 936.75   0.55 0.4592643    
cSEEDSRCE_SW     1   817.73 939.73   3.53 0.0604234 .  
cMAT             1   814.27 936.27   0.06 0.8002333    
cMWMT            1   814.49 936.49   0.28 0.5942246    
cMCMT            1   819.39 941.39   5.18 0.0228425 *  
cTD              1   816.20 938.20   1.99 0.1580332    
cMAP             1   814.25 936.25   0.04 0.8386626    
cMSP             1   818.41 940.41   4.20 0.0404411 *  
cAHM             1   815.66 937.66   1.46 0.2276311    
cSHM             1   819.95 941.95   5.75 0.0165227 *  
cMATMAP          1   814.91 936.91   0.71 0.4001878    
cddless0         1   818.04 940.04   3.83 0.0502153 .  
cddless18        1   817.81 939.81   3.60 0.0576931 .  
cddgrtr0         1   816.64 938.64   2.44 0.1184235    
cddgrtr18        1   815.77 937.77   1.57 0.2104958    
cNFFD            1   815.38 937.38   1.18 0.2782582    
cbFFP            1   814.39 936.39   0.18 0.6677481    
ceFFP            1   820.22 942.22   6.01 0.0141863 *  
cPAS             1   814.21 936.21   0.01 0.9347654    
cDD5_100         1   814.79 936.79   0.58 0.4447531    
cEXT_Cold        1   816.99 938.99   2.78 0.0954512 .  
cS_INDX          1   815.21 937.21   1.01 0.3157208    


And then systematically reran the drop1, removing the HIGHEST p value (least
significant)from each resultant until only significant (0.10) variables
remained.

Model:
MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
    cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
             Df Deviance    AIC    LRT   Pr(Chi)    
<none>            884.20 946.20                     
ORG_CODE      4   916.38 970.38  32.18 1.757e-06 ***
BECLBL08      9   940.66 984.66  56.46 6.418e-09 ***
PEM_SScat    11   906.20 946.20  22.00 0.0243795 *  
SOIL_NUTR     2   894.19 952.19   9.99 0.0067557 ** 
cSEEDSRCE_SW  1   894.41 954.41  10.21 0.0013983 ** 
cMSP          1   896.97 956.97  12.77 0.0003516 ***
ceFFP         1   928.50 988.50  44.30 2.812e-11 ***
cEXT_Cold     1   923.35 983.35  39.15 3.921e-10 ***


I didn't create any kind of dummy or factor variables for my categorical
data (at least, not on purpose).

With a remaining 8 variables, I tried to run a logistic regression (glm)
against my dependent variable(MIN_Mstocked).  When I do a summary of the
glm, I am provided with the usual table of estimate, std error, z value, and
Pr(>|z|)... BUT there are some coefficients missing in the list.  None of
the categorical data is complete.  Some are missing only one category, while
others are missing 4 or 5 categories.  

e.g.

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -1.324e+02  1.363e+03  -0.097 0.922611    
ORG_CODE[T.DLA]      -1.504e+01  1.363e+03  -0.011 0.991192    
ORG_CODE[T.DMO]      -1.494e+01  1.363e+03  -0.011 0.991253    
ORG_CODE[T.DPG]      -1.766e+01  1.363e+03  -0.013 0.989658    
ORG_CODE[T.DVA]      -1.841e+01  1.363e+03  -0.014 0.989220    
BECLBL08[T.SBS dw 2] -6.733e-01  5.903e-01  -1.141 0.254033    
BECLBL08[T.SBS dw 3] -1.094e+00  5.714e-01  -1.914 0.055586 .  
BECLBL08[T.SBS mc 2]  1.573e-01  5.004e-01   0.314 0.753211    
BECLBL08[T.SBS mc 3]  1.402e+00  5.824e-01   2.408 0.016043 *  
BECLBL08[T.SBS mk 1] -2.388e+00  7.529e-01  -3.172 0.001514 ** 
BECLBL08[T.SBS mw]   -1.672e+01  1.393e+03  -0.012 0.990425    
BECLBL08[T.SBS vk]   -1.614e+01  1.243e+03  -0.013 0.989640    
BECLBL08[T.SBS wk 1] -3.640e+00  8.174e-01  -4.453 8.48e-06 ***
BECLBL08[T.SBS wk 3] -1.838e+01  1.363e+03  -0.013 0.989240    
PEM_SScat[T.B]       -1.815e+01  3.956e+03  -0.005 0.996339    
PEM_SScat[T.C]        1.998e-01  3.925e-01   0.509 0.610792    
PEM_SScat[T.D]       -2.314e-01  3.215e-01  -0.720 0.471621    
PEM_SScat[T.E]        5.581e-01  3.433e-01   1.626 0.104020    
PEM_SScat[T.F]       -1.113e+00  5.782e-01  -1.926 0.054153 .  
PEM_SScat[T.G]        1.780e-01  4.420e-01   0.403 0.687150    
PEM_SScat[T.H]        1.670e+01  3.956e+03   0.004 0.996633    
PEM_SScat[T.I]        2.751e-01  9.313e-01   0.295 0.767705    
PEM_SScat[T.J]       -2.623e-01  9.693e-01  -0.271 0.786649    
PEM_SScat[T.K]       -1.862e+01  3.956e+03  -0.005 0.996244    
PEM_SScat[T.L]       -1.661e+01  1.211e+03  -0.014 0.989056    
SOIL_NUTR[T.C]       -1.119e+00  3.781e-01  -2.960 0.003073 ** 
SOIL_NUTR[T.D]       -7.912e-02  9.049e-01  -0.087 0.930320    
cSEEDSRCE_SW         -1.512e-03  4.930e-04  -3.066 0.002170 ** 
cMSP                  1.808e-02  5.304e-03   3.409 0.000652 ***
ceFFP                 2.889e-01  4.662e-02   6.196 5.80e-10 ***
cEXT_Cold            -1.880e+00  3.330e-01  -5.647 1.63e-08 ***

There should be a PEM_Sscat[T.A].  It is the most prevalent occurrence in
this category.

ORG_CODE is missing more than 6 categories in the list

SOIL_NUTR should have a [T.B]

Does that help? 

-----Original Message-----
From: Kevin E. Thorpe [mailto:kevin.thorpe at utoronto.ca] 
Sent: Saturday, September 27, 2008 6:21 AM
To: Darin Brooks
Cc: r-help at r-project.org
Subject: Re: [R] logistic regression


Darin Brooks wrote:> Good afternoon
>  
> I have what I hope is a simple logistic regression issue.
>  
> I started with 44 independent variables and then used the drop1, 
> test="chisq" to reduce the list to 8 significant independent
variables.
>  
> drop1(sep22lr, test="Chisq") and wound up with this model:
>  
> Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
> cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>  
> 4 of the remaining variables are categorical and 4 are continuous.
>  
> However, when I run a glm and then a summary on the glm - some of the 
> categorical data is missing from the output.
>  
> The PEM_SScat is missing only one variable ... the BECLBL08 is missing 
> several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR 
> is missing 1 variable.
>  
> It seems arbitrary to the number of variables missing.  Is there 
> something wrong with my syntax in calling the logistic model?  Am I not
understanding> the inputs correctly?   
>  
> Any help would be appreciated.
>  
I'm not sure I fully understand your question.  It sounds like you created
your own dummy variables for your categorical variables. Did you?  Or did
you use factor variables for your categorical variables?
If the latter, then I REALLY don't understand your question.

Kevin

--
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program Assistant Professor,
Dalla Lana School of Public Health University of Toronto
email: kevin.thorpe at utoronto.ca  Tel: 416.864.5776  Fax: 416.864.6057 No
virus found in this incoming message.
Checked by AVG - http://www.avg.com

6:55 PM

Kevin E. Thorpe

2008-Sep-27 14:22 UTC

head link

[R] FW: logistic regression

Darin Brooks wrote:> Sorry.
> 
> Let me try again then.
> 
> I am trying to find "significant" predictors" from a list of
about 44
> independent variables.  So I started with all 44 variables and ran
> drop1(sep22lr, test="Chisq")... and then dropped the highest p
value from
> the run.  Then I reran the drop1.  
> 
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_MST_1 + 
>     SOIL_NUTR + cE + cN + cELEV + cDIAM_125 + cCRCLS + cCULM_125 + 
>     cSPH + cAGE + cVRI_NONPINE + cVRI_nonpineCFR + cVRI_BLEAF + 
>     cvol_125 + cstrDST_SW + cwaterDST_SW + cSEEDSRCE_SW + cMAT + 
>     cMWMT + cMCMT + cTD + cMAP + cMSP + cAHM + cSHM + cMATMAP + 
>     cddless0 + cddless18 + cddgrtr0 + cddgrtr18 + cNFFD + cbFFP + 
>     ceFFP + cPAS + cDD5_100 + cEXT_Cold + cS_INDX
>                 Df Deviance    AIC    LRT   Pr(Chi)    
> <none>               814.21 938.21                     
> ORG_CODE         4   824.97 940.97  10.76 0.0294100 *  
> BECLBL08         9   845.61 951.61  31.41 0.0002519 ***
> PEM_SScat       10   829.11 933.11  14.90 0.1357580    
> SOIL_MST_1       1   814.63 936.63   0.43 0.5135094    
> SOIL_NUTR        2   818.49 938.49   4.28 0.1175411    
> cE               1   814.37 936.37   0.16 0.6886085    
> cN               1   814.40 936.40   0.20 0.6566765    
> cELEV            1   814.35 936.35   0.14 0.7044864    
> cDIAM_125        1   817.98 939.98   3.78 0.0519554 .  
> cCRCLS           1   819.32 941.32   5.11 0.0237598 *  
> cCULM_125        1   816.17 938.17   1.97 0.1606846    
> cSPH             1   816.62 938.62   2.41 0.1204141    
> cAGE             1   815.92 937.92   1.72 0.1902314    
> cVRI_NONPINE     1   818.04 940.04   3.84 0.0501149 .  
> cVRI_nonpineCFR  1   821.17 943.17   6.96 0.0083197 ** 
> cVRI_BLEAF       1   818.78 940.78   4.58 0.0324286 *  
> cvol_125         1   814.67 936.67   0.47 0.4949495    
> cstrDST_SW       1   814.63 936.63   0.42 0.5169757    
> cwaterDST_SW     1   814.75 936.75   0.55 0.4592643    
> cSEEDSRCE_SW     1   817.73 939.73   3.53 0.0604234 .  
> cMAT             1   814.27 936.27   0.06 0.8002333    
> cMWMT            1   814.49 936.49   0.28 0.5942246    
> cMCMT            1   819.39 941.39   5.18 0.0228425 *  
> cTD              1   816.20 938.20   1.99 0.1580332    
> cMAP             1   814.25 936.25   0.04 0.8386626    
> cMSP             1   818.41 940.41   4.20 0.0404411 *  
> cAHM             1   815.66 937.66   1.46 0.2276311    
> cSHM             1   819.95 941.95   5.75 0.0165227 *  
> cMATMAP          1   814.91 936.91   0.71 0.4001878    
> cddless0         1   818.04 940.04   3.83 0.0502153 .  
> cddless18        1   817.81 939.81   3.60 0.0576931 .  
> cddgrtr0         1   816.64 938.64   2.44 0.1184235    
> cddgrtr18        1   815.77 937.77   1.57 0.2104958    
> cNFFD            1   815.38 937.38   1.18 0.2782582    
> cbFFP            1   814.39 936.39   0.18 0.6677481    
> ceFFP            1   820.22 942.22   6.01 0.0141863 *  
> cPAS             1   814.21 936.21   0.01 0.9347654    
> cDD5_100         1   814.79 936.79   0.58 0.4447531    
> cEXT_Cold        1   816.99 938.99   2.78 0.0954512 .  
> cS_INDX          1   815.21 937.21   1.01 0.3157208    
> 
> 
> And then systematically reran the drop1, removing the HIGHEST p value
(least
> significant)from each resultant until only significant (0.10) variables
> remained.
> 
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
>     cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>              Df Deviance    AIC    LRT   Pr(Chi)    
> <none>            884.20 946.20                     
> ORG_CODE      4   916.38 970.38  32.18 1.757e-06 ***
> BECLBL08      9   940.66 984.66  56.46 6.418e-09 ***
> PEM_SScat    11   906.20 946.20  22.00 0.0243795 *  
> SOIL_NUTR     2   894.19 952.19   9.99 0.0067557 ** 
> cSEEDSRCE_SW  1   894.41 954.41  10.21 0.0013983 ** 
> cMSP          1   896.97 956.97  12.77 0.0003516 ***
> ceFFP         1   928.50 988.50  44.30 2.812e-11 ***
> cEXT_Cold     1   923.35 983.35  39.15 3.921e-10 ***
> 
> 
> I didn't create any kind of dummy or factor variables for my
categorical
> data (at least, not on purpose).
> 
> With a remaining 8 variables, I tried to run a logistic regression (glm)
> against my dependent variable(MIN_Mstocked).  When I do a summary of the
> glm, I am provided with the usual table of estimate, std error, z value,
and
> Pr(>|z|)... BUT there are some coefficients missing in the list.  None
of
> the categorical data is complete.  Some are missing only one category,
while
> others are missing 4 or 5 categories.  
> 
> e.g.
> 
> Coefficients:
>                        Estimate Std. Error z value Pr(>|z|)    
> (Intercept)          -1.324e+02  1.363e+03  -0.097 0.922611    
> ORG_CODE[T.DLA]      -1.504e+01  1.363e+03  -0.011 0.991192    
> ORG_CODE[T.DMO]      -1.494e+01  1.363e+03  -0.011 0.991253    
> ORG_CODE[T.DPG]      -1.766e+01  1.363e+03  -0.013 0.989658    
> ORG_CODE[T.DVA]      -1.841e+01  1.363e+03  -0.014 0.989220    
> BECLBL08[T.SBS dw 2] -6.733e-01  5.903e-01  -1.141 0.254033    
> BECLBL08[T.SBS dw 3] -1.094e+00  5.714e-01  -1.914 0.055586 .  
> BECLBL08[T.SBS mc 2]  1.573e-01  5.004e-01   0.314 0.753211    
> BECLBL08[T.SBS mc 3]  1.402e+00  5.824e-01   2.408 0.016043 *  
> BECLBL08[T.SBS mk 1] -2.388e+00  7.529e-01  -3.172 0.001514 ** 
> BECLBL08[T.SBS mw]   -1.672e+01  1.393e+03  -0.012 0.990425    
> BECLBL08[T.SBS vk]   -1.614e+01  1.243e+03  -0.013 0.989640    
> BECLBL08[T.SBS wk 1] -3.640e+00  8.174e-01  -4.453 8.48e-06 ***
> BECLBL08[T.SBS wk 3] -1.838e+01  1.363e+03  -0.013 0.989240    
> PEM_SScat[T.B]       -1.815e+01  3.956e+03  -0.005 0.996339    
> PEM_SScat[T.C]        1.998e-01  3.925e-01   0.509 0.610792    
> PEM_SScat[T.D]       -2.314e-01  3.215e-01  -0.720 0.471621    
> PEM_SScat[T.E]        5.581e-01  3.433e-01   1.626 0.104020    
> PEM_SScat[T.F]       -1.113e+00  5.782e-01  -1.926 0.054153 .  
> PEM_SScat[T.G]        1.780e-01  4.420e-01   0.403 0.687150    
> PEM_SScat[T.H]        1.670e+01  3.956e+03   0.004 0.996633    
> PEM_SScat[T.I]        2.751e-01  9.313e-01   0.295 0.767705    
> PEM_SScat[T.J]       -2.623e-01  9.693e-01  -0.271 0.786649    
> PEM_SScat[T.K]       -1.862e+01  3.956e+03  -0.005 0.996244    
> PEM_SScat[T.L]       -1.661e+01  1.211e+03  -0.014 0.989056    
> SOIL_NUTR[T.C]       -1.119e+00  3.781e-01  -2.960 0.003073 ** 
> SOIL_NUTR[T.D]       -7.912e-02  9.049e-01  -0.087 0.930320    
> cSEEDSRCE_SW         -1.512e-03  4.930e-04  -3.066 0.002170 ** 
> cMSP                  1.808e-02  5.304e-03   3.409 0.000652 ***
> ceFFP                 2.889e-01  4.662e-02   6.196 5.80e-10 ***
> cEXT_Cold            -1.880e+00  3.330e-01  -5.647 1.63e-08 ***
> 
> There should be a PEM_Sscat[T.A].  It is the most prevalent occurrence in
> this category.
> 
> ORG_CODE is missing more than 6 categories in the list
> 
> SOIL_NUTR should have a [T.B]
> 
> Does that help? 
Yes.  I don't see a problem however.  First, your variables are
"factors" which means there will be one fewer coefficients than
categories.  One level is a reference group which probably explains
PEM_Sscat and SOIL_NUTR each "missing" one coefficient.  For ORG_CODE,
there were 4 DF in the starting model, 4 DF in the final model with 4
coefficients.  So the 6 missing categories appear to have been missing
from the start.

What do you expect for ORG_CODE?  What does say summary(ORG_CODE) give you?

Are you aware of the dangers of stepwise model fitting?  It is a
commonly recurring theme on this list.

Kevin
> -----Original Message-----
> From: Kevin E. Thorpe [mailto:kevin.thorpe at utoronto.ca] 
> Sent: Saturday, September 27, 2008 6:21 AM
> To: Darin Brooks
> Cc: r-help at r-project.org
> Subject: Re: [R] logistic regression
> 
> 
> Darin Brooks wrote:
>> Good afternoon
>>  
>> I have what I hope is a simple logistic regression issue.
>>  
>> I started with 44 independent variables and then used the drop1, 
>> test="chisq" to reduce the list to 8 significant independent
variables.
>>  
>> drop1(sep22lr, test="Chisq") and wound up with this model:
>>  
>> Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
>> cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>>  
>> 4 of the remaining variables are categorical and 4 are continuous.
>>  
>> However, when I run a glm and then a summary on the glm - some of the 
>> categorical data is missing from the output.
>>  
>> The PEM_SScat is missing only one variable ... the BECLBL08 is missing 
>> several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR 
>> is missing 1 variable.
>>  
>> It seems arbitrary to the number of variables missing.  Is there 
>> something wrong with my syntax in calling the logistic model?  Am I not
> understanding
>> the inputs correctly?   
>>  
>> Any help would be appreciated.
>>  
> 
> I'm not sure I fully understand your question.  It sounds like you
created
> your own dummy variables for your categorical variables. Did you?  Or did
> you use factor variables for your categorical variables?
> If the latter, then I REALLY don't understand your question.
> 
> Kevin

-- 
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program
Assistant Professor, Dalla Lana School of Public Health
University of Toronto
email: kevin.thorpe at utoronto.ca  Tel: 416.864.5776  Fax: 416.864.6057

Frank E Harrell Jr

2008-Sep-27 20:41 UTC

head link

[R] FW: logistic regression

Darin Brooks wrote:> Sorry.
> 
> Let me try again then.
> 
> I am trying to find "significant" predictors" from a list of
about 44
> independent variables.  So I started with all 44 variables and ran
Why?  What is wrong with insignificant predictors?
> drop1(sep22lr, test="Chisq")... and then dropped the highest p
value from
> the run.  Then I reran the drop1.  
> 
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_MST_1 + 
>     SOIL_NUTR + cE + cN + cELEV + cDIAM_125 + cCRCLS + cCULM_125 + 
>     cSPH + cAGE + cVRI_NONPINE + cVRI_nonpineCFR + cVRI_BLEAF + 
>     cvol_125 + cstrDST_SW + cwaterDST_SW + cSEEDSRCE_SW + cMAT + 
>     cMWMT + cMCMT + cTD + cMAP + cMSP + cAHM + cSHM + cMATMAP + 
>     cddless0 + cddless18 + cddgrtr0 + cddgrtr18 + cNFFD + cbFFP + 
>     ceFFP + cPAS + cDD5_100 + cEXT_Cold + cS_INDX
>                 Df Deviance    AIC    LRT   Pr(Chi)    
> <none>               814.21 938.21                     
> ORG_CODE         4   824.97 940.97  10.76 0.0294100 *  
> BECLBL08         9   845.61 951.61  31.41 0.0002519 ***
> PEM_SScat       10   829.11 933.11  14.90 0.1357580    
> SOIL_MST_1       1   814.63 936.63   0.43 0.5135094    
> SOIL_NUTR        2   818.49 938.49   4.28 0.1175411    
> cE               1   814.37 936.37   0.16 0.6886085    
> cN               1   814.40 936.40   0.20 0.6566765    
> cELEV            1   814.35 936.35   0.14 0.7044864    
> cDIAM_125        1   817.98 939.98   3.78 0.0519554 .  
> cCRCLS           1   819.32 941.32   5.11 0.0237598 *  
> cCULM_125        1   816.17 938.17   1.97 0.1606846    
> cSPH             1   816.62 938.62   2.41 0.1204141    
> cAGE             1   815.92 937.92   1.72 0.1902314    
> cVRI_NONPINE     1   818.04 940.04   3.84 0.0501149 .  
> cVRI_nonpineCFR  1   821.17 943.17   6.96 0.0083197 ** 
> cVRI_BLEAF       1   818.78 940.78   4.58 0.0324286 *  
> cvol_125         1   814.67 936.67   0.47 0.4949495    
> cstrDST_SW       1   814.63 936.63   0.42 0.5169757    
> cwaterDST_SW     1   814.75 936.75   0.55 0.4592643    
> cSEEDSRCE_SW     1   817.73 939.73   3.53 0.0604234 .  
> cMAT             1   814.27 936.27   0.06 0.8002333    
> cMWMT            1   814.49 936.49   0.28 0.5942246    
> cMCMT            1   819.39 941.39   5.18 0.0228425 *  
> cTD              1   816.20 938.20   1.99 0.1580332    
> cMAP             1   814.25 936.25   0.04 0.8386626    
> cMSP             1   818.41 940.41   4.20 0.0404411 *  
> cAHM             1   815.66 937.66   1.46 0.2276311    
> cSHM             1   819.95 941.95   5.75 0.0165227 *  
> cMATMAP          1   814.91 936.91   0.71 0.4001878    
> cddless0         1   818.04 940.04   3.83 0.0502153 .  
> cddless18        1   817.81 939.81   3.60 0.0576931 .  
> cddgrtr0         1   816.64 938.64   2.44 0.1184235    
> cddgrtr18        1   815.77 937.77   1.57 0.2104958    
> cNFFD            1   815.38 937.38   1.18 0.2782582    
> cbFFP            1   814.39 936.39   0.18 0.6677481    
> ceFFP            1   820.22 942.22   6.01 0.0141863 *  
> cPAS             1   814.21 936.21   0.01 0.9347654    
> cDD5_100         1   814.79 936.79   0.58 0.4447531    
> cEXT_Cold        1   816.99 938.99   2.78 0.0954512 .  
> cS_INDX          1   815.21 937.21   1.01 0.3157208    
> 
> 
> And then systematically reran the drop1, removing the HIGHEST p value
(least
> significant)from each resultant until only significant (0.10) variables
> remained.
> 
> Model:
> MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
>     cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>              Df Deviance    AIC    LRT   Pr(Chi)    
> <none>            884.20 946.20                     
> ORG_CODE      4   916.38 970.38  32.18 1.757e-06 ***
> BECLBL08      9   940.66 984.66  56.46 6.418e-09 ***
> PEM_SScat    11   906.20 946.20  22.00 0.0243795 *  
> SOIL_NUTR     2   894.19 952.19   9.99 0.0067557 ** 
> cSEEDSRCE_SW  1   894.41 954.41  10.21 0.0013983 ** 
> cMSP          1   896.97 956.97  12.77 0.0003516 ***
> ceFFP         1   928.50 988.50  44.30 2.812e-11 ***
> cEXT_Cold     1   923.35 983.35  39.15 3.921e-10 ***
> 
> 
> I didn't create any kind of dummy or factor variables for my
categorical
> data (at least, not on purpose).
> 
> With a remaining 8 variables, I tried to run a logistic regression (glm)
> against my dependent variable(MIN_Mstocked).  When I do a summary of the
Estimates from this model (and especially standard errors and P-values) 
will be invalid because they do not take into account the stepwise 
procedure above that was used to torture the data until they confessed.

Frank
> glm, I am provided with the usual table of estimate, std error, z value,
and
> Pr(>|z|)... BUT there are some coefficients missing in the list.  None
of
> the categorical data is complete.  Some are missing only one category,
while
> others are missing 4 or 5 categories.  
> 
> e.g.
> 
> Coefficients:
>                        Estimate Std. Error z value Pr(>|z|)    
> (Intercept)          -1.324e+02  1.363e+03  -0.097 0.922611    
> ORG_CODE[T.DLA]      -1.504e+01  1.363e+03  -0.011 0.991192    
> ORG_CODE[T.DMO]      -1.494e+01  1.363e+03  -0.011 0.991253    
> ORG_CODE[T.DPG]      -1.766e+01  1.363e+03  -0.013 0.989658    
> ORG_CODE[T.DVA]      -1.841e+01  1.363e+03  -0.014 0.989220    
> BECLBL08[T.SBS dw 2] -6.733e-01  5.903e-01  -1.141 0.254033    
> BECLBL08[T.SBS dw 3] -1.094e+00  5.714e-01  -1.914 0.055586 .  
> BECLBL08[T.SBS mc 2]  1.573e-01  5.004e-01   0.314 0.753211    
> BECLBL08[T.SBS mc 3]  1.402e+00  5.824e-01   2.408 0.016043 *  
> BECLBL08[T.SBS mk 1] -2.388e+00  7.529e-01  -3.172 0.001514 ** 
> BECLBL08[T.SBS mw]   -1.672e+01  1.393e+03  -0.012 0.990425    
> BECLBL08[T.SBS vk]   -1.614e+01  1.243e+03  -0.013 0.989640    
> BECLBL08[T.SBS wk 1] -3.640e+00  8.174e-01  -4.453 8.48e-06 ***
> BECLBL08[T.SBS wk 3] -1.838e+01  1.363e+03  -0.013 0.989240    
> PEM_SScat[T.B]       -1.815e+01  3.956e+03  -0.005 0.996339    
> PEM_SScat[T.C]        1.998e-01  3.925e-01   0.509 0.610792    
> PEM_SScat[T.D]       -2.314e-01  3.215e-01  -0.720 0.471621    
> PEM_SScat[T.E]        5.581e-01  3.433e-01   1.626 0.104020    
> PEM_SScat[T.F]       -1.113e+00  5.782e-01  -1.926 0.054153 .  
> PEM_SScat[T.G]        1.780e-01  4.420e-01   0.403 0.687150    
> PEM_SScat[T.H]        1.670e+01  3.956e+03   0.004 0.996633    
> PEM_SScat[T.I]        2.751e-01  9.313e-01   0.295 0.767705    
> PEM_SScat[T.J]       -2.623e-01  9.693e-01  -0.271 0.786649    
> PEM_SScat[T.K]       -1.862e+01  3.956e+03  -0.005 0.996244    
> PEM_SScat[T.L]       -1.661e+01  1.211e+03  -0.014 0.989056    
> SOIL_NUTR[T.C]       -1.119e+00  3.781e-01  -2.960 0.003073 ** 
> SOIL_NUTR[T.D]       -7.912e-02  9.049e-01  -0.087 0.930320    
> cSEEDSRCE_SW         -1.512e-03  4.930e-04  -3.066 0.002170 ** 
> cMSP                  1.808e-02  5.304e-03   3.409 0.000652 ***
> ceFFP                 2.889e-01  4.662e-02   6.196 5.80e-10 ***
> cEXT_Cold            -1.880e+00  3.330e-01  -5.647 1.63e-08 ***
> 
> There should be a PEM_Sscat[T.A].  It is the most prevalent occurrence in
> this category.
> 
> ORG_CODE is missing more than 6 categories in the list
> 
> SOIL_NUTR should have a [T.B]
> 
> Does that help? 
> 
> -----Original Message-----
> From: Kevin E. Thorpe [mailto:kevin.thorpe at utoronto.ca] 
> Sent: Saturday, September 27, 2008 6:21 AM
> To: Darin Brooks
> Cc: r-help at r-project.org
> Subject: Re: [R] logistic regression
> 
> 
> Darin Brooks wrote:
>> Good afternoon
>>  
>> I have what I hope is a simple logistic regression issue.
>>  
>> I started with 44 independent variables and then used the drop1, 
>> test="chisq" to reduce the list to 8 significant independent
variables.
>>  
>> drop1(sep22lr, test="Chisq") and wound up with this model:
>>  
>> Model: MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
>> cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
>>  
>> 4 of the remaining variables are categorical and 4 are continuous.
>>  
>> However, when I run a glm and then a summary on the glm - some of the 
>> categorical data is missing from the output.
>>  
>> The PEM_SScat is missing only one variable ... the BECLBL08 is missing 
>> several variables ... the ORG_CODE is missing 4 .. and the SOIL_NUTR 
>> is missing 1 variable.
>>  
>> It seems arbitrary to the number of variables missing.  Is there 
>> something wrong with my syntax in calling the logistic model?  Am I not
> understanding
>> the inputs correctly?   
>>  
>> Any help would be appreciated.
>>  
> 
> I'm not sure I fully understand your question.  It sounds like you
created
> your own dummy variables for your categorical variables. Did you?  Or did
> you use factor variables for your categorical variables?
> If the latter, then I REALLY don't understand your question.
> 
> Kevin
> 
> --
> Kevin E. Thorpe
> Biostatistician/Trialist, Knowledge Translation Program Assistant
Professor,
> Dalla Lana School of Public Health University of Toronto
> email: kevin.thorpe at utoronto.ca  Tel: 416.864.5776  Fax: 416.864.6057 No
> virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> 
> 6:55 PM
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Dieter Menne

2008-Sep-27 21:45 UTC

head link

[R] FW: logistic regression

Frank E Harrell Jr <f.harrell <at> vanderbilt.edu> writes:
> Estimates from this model (and especially standard errors and P-values) 
> will be invalid because they do not take into account the stepwise 
> procedure above that was used to torture the data until they confessed.
> 
> Frank
Please book this as a fortune.

Dieter

(Ted Harding)

2008-Sep-27 22:30 UTC

head link

[R] FW: logistic regression

On 27-Sep-08 21:45:23, Dieter Menne wrote:> Frank E Harrell Jr <f.harrell <at> vanderbilt.edu> writes:
> 
>> Estimates from this model (and especially standard errors and
>> P-values) 
>> will be invalid because they do not take into account the stepwise 
>> procedure above that was used to torture the data until they
>> confessed.
>> 
>> Frank
> 
> Please book this as a fortune.
> 
> Dieter
Seconded!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 27-Sep-08                                       Time: 23:30:19
------------------------------ XFMail ------------------------------

Darin Brooks

2008-Sep-27 23:24 UTC

head link

[R] FW: logistic regression

Glad you were amused.

I assume that "booking this as a fortune" means that this was an
idiotic way
to model the data?

MARS?  Boosted Regression Trees?  Any of these a better choice to extract
significant predictors (from a list of about 44) for a measured dependent
variable?

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Ted Harding
Sent: Saturday, September 27, 2008 4:30 PM
To: r-help at stat.math.ethz.ch
Subject: Re: [R] FW: logistic regression

On 27-Sep-08 21:45:23, Dieter Menne wrote:> Frank E Harrell Jr <f.harrell <at> vanderbilt.edu> writes:
> 
>> Estimates from this model (and especially standard errors and
>> P-values)
>> will be invalid because they do not take into account the stepwise 
>> procedure above that was used to torture the data until they 
>> confessed.
>> 
>> Frank
> 
> Please book this as a fortune.
> 
> Dieter
Seconded!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 27-Sep-08                                       Time: 23:30:19
------------------------------ XFMail ------------------------------

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
No virus found in this incoming message.
Checked by AVG - http://www.avg.com

6:55 PM

Frank E Harrell Jr

2008-Sep-28 01:15 UTC

head link

[R] FW: logistic regression

Darin Brooks wrote:> Glad you were amused.
> 
> I assume that "booking this as a fortune" means that this was an
idiotic way
> to model the data?
Dieter was nominating this for the "fortunes" package in R.  (Thanks
Dieter)
> 
> MARS?  Boosted Regression Trees?  Any of these a better choice to extract
> significant predictors (from a list of about 44) for a measured dependent
> variable?
Or use a data reduction method (principal components, variable 
clustering, etc.) or redundancy analysis (to remove individual 
predictors before examining associations with Y), or fit the full model 
using penalized maximum likelihood estimation.  lasso and lasso-like 
methods are also worth pursuing.

Cheers
Frank
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On
> Behalf Of Ted Harding
> Sent: Saturday, September 27, 2008 4:30 PM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] FW: logistic regression
> 
> 
> 
> On 27-Sep-08 21:45:23, Dieter Menne wrote:
>> Frank E Harrell Jr <f.harrell <at> vanderbilt.edu> writes:
>>
>>> Estimates from this model (and especially standard errors and
>>> P-values)
>>> will be invalid because they do not take into account the stepwise 
>>> procedure above that was used to torture the data until they 
>>> confessed.
>>>
>>> Frank
>> Please book this as a fortune.
>>
>> Dieter
> 
> Seconded!
> Ted.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 27-Sep-08                                       Time: 23:30:19
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> 
> 6:55 PM
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

John Kane

2008-Sep-28 16:03 UTC

head link

[R] FW: logistic regression

--- On Sat, 9/27/08, Dieter Menne <dieter.menne at menne-biomed.de> wrote:
> From: Dieter Menne <dieter.menne at menne-biomed.de>
> Subject: Re: [R] FW:  logistic regression
> To: r-help at stat.math.ethz.ch
> Received: Saturday, September 27, 2008, 5:45 PM
> Frank E Harrell Jr <f.harrell <at>
> vanderbilt.edu> writes:
> 
> > Estimates from this model (and especially standard
> errors and P-values) 
> > will be invalid because they do not take into account
> the stepwise 
> > procedure above that was used to torture the data
> until they confessed.
> > 
> > Frank
> 
> Please book this as a fortune.
> 
> Dieter
Here, here! I vote yes.


      __________________________________________________________________
[[elided Yahoo spam]]

Greg Snow

2008-Sep-29 17:24 UTC

head link

[R] FW: logistic regression

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Frank E Harrell Jr
> Sent: Saturday, September 27, 2008 7:15 PM
> To: Darin Brooks
> Cc: dieter.menne at menne-biomed.de; r-help at stat.math.ethz.ch;
> ted.harding at manchester.ac.uk
> Subject: Re: [R] FW: logistic regression
>
> Darin Brooks wrote:
> > Glad you were amused.
> >
> > I assume that "booking this as a fortune" means that this
was an
> idiotic way
> > to model the data?
>
> Dieter was nominating this for the "fortunes" package in R. 
(Thanks
> Dieter)
>
> >
> > MARS?  Boosted Regression Trees?  Any of these a better choice to
> extract
> > significant predictors (from a list of about 44) for a measured
> dependent
> > variable?
>
> Or use a data reduction method (principal components, variable
> clustering, etc.) or redundancy analysis (to remove individual
> predictors before examining associations with Y), or fit the full model
> using penalized maximum likelihood estimation.  lasso and lasso-like
> methods are also worth pursuing.
Frank (and any others who want to share an opinion):

What are your thoughts on model averaging as part of the above list?


--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

Frank E Harrell Jr

2008-Sep-30 01:50 UTC

head link

[R] FW: logistic regression

Greg Snow wrote:>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Frank E Harrell Jr
>> Sent: Saturday, September 27, 2008 7:15 PM
>> To: Darin Brooks
>> Cc: dieter.menne at menne-biomed.de; r-help at stat.math.ethz.ch;
>> ted.harding at manchester.ac.uk
>> Subject: Re: [R] FW: logistic regression
>>
>> Darin Brooks wrote:
>>> Glad you were amused.
>>>
>>> I assume that "booking this as a fortune" means that this
was an
>> idiotic way
>>> to model the data?
>> Dieter was nominating this for the "fortunes" package in R. 
(Thanks
>> Dieter)
>>
>>> MARS?  Boosted Regression Trees?  Any of these a better choice to
>> extract
>>> significant predictors (from a list of about 44) for a measured
>> dependent
>>> variable?
>> Or use a data reduction method (principal components, variable
>> clustering, etc.) or redundancy analysis (to remove individual
>> predictors before examining associations with Y), or fit the full model
>> using penalized maximum likelihood estimation.  lasso and lasso-like
>> methods are also worth pursuing.
> 
> Frank (and any others who want to share an opinion):
> 
> What are your thoughts on model averaging as part of the above list?
Model averaging has good performance but no advantage over fitting a 
single complex model using penalized maximum likelihood estimation.

Frank
> 
> 
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
> 
> 
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Kjetil Halvorsen

2008-Oct-07 01:05 UTC

head link

[R] FW: logistic regression

An embedded and charset-unspecified text was scrubbed...
Name: ikke tilgjengelig
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20081006/0547512c/attachment.pl>

R help - Sep 2008 - FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

[R] FW: logistic regression

Seemingly Similar Threads