thr3ads.net - R help - [R] Logistic regression problem [Sep 2008]

If this information is useful, please help other people find it:
Share via:

milicic.marko

2008-Sep-27 17:51 UTC

[R] Logistic regression problem

I have a huge data set with thousands of variable and one binary
variable. I know that most of the variables are correlated and are not
good predictors... but...

It is very hard to start modeling with such a huge dataset. What would
be your suggestion. How to make a first cut... how to eliminate most
of the variables but not to ignore potential interactions... for
example, maybe variable A is not good predictor and variable B is not
good predictor either, but maybe A and B together are good
predictor...

Any suggestion is welcomed

Milicic B. Marko

2008-Sep-30 17:31 UTC

head link

[R] Logistic regression problem

The only solution I can see is fitting all possib le 2 factor models enabling
interactions and then assessing if interaction term is significant...


any more ideas?




Milicic B. Marko wrote:> 
> I have a huge data set with thousands of variable and one binary
> variable. I know that most of the variables are correlated and are not
> good predictors... but...
> 
> It is very hard to start modeling with such a huge dataset. What would
> be your suggestion. How to make a first cut... how to eliminate most
> of the variables but not to ignore potential interactions... for
> example, maybe variable A is not good predictor and variable B is not
> good predictor either, but maybe A and B together are good
> predictor...
> 
> Any suggestion is welcomed
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
-- 
View this message in context:
http://www.nabble.com/Logistic-regression-problem-tp19704948p19746846.html
Sent from the R help mailing list archive at Nabble.com.

Bernardo Rangel Tura

2008-Sep-30 21:39 UTC

head link

[R] Logistic regression problem

Em S?b, 2008-09-27 ?s 10:51 -0700, milicic.marko
escreveu:> I have a huge data set with thousands of variable and one binary
> variable. I know that most of the variables are correlated and are not
> good predictors... but...
> 
> It is very hard to start modeling with such a huge dataset. What would
> be your suggestion. How to make a first cut... how to eliminate most
> of the variables but not to ignore potential interactions... for
> example, maybe variable A is not good predictor and variable B is not
> good predictor either, but maybe A and B together are good
> predictor...
> 
> Any suggestion is welcomed

milicic.marko

I think do you start with a rpart("binary variable"~.)
This show you a set of variables to start a model and the start set to
curoff  for continous variables
-- 
Bernardo Rangel Tura, M.D,MPH,Ph.D
National Institute of Cardiology
Brazil

Frank E Harrell Jr

2008-Sep-30 23:56 UTC

head link

[R] Logistic regression problem

Bernardo Rangel Tura wrote:> Em S?b, 2008-09-27 ?s 10:51 -0700, milicic.marko escreveu:
>> I have a huge data set with thousands of variable and one binary
>> variable. I know that most of the variables are correlated and are not
>> good predictors... but...
>>
>> It is very hard to start modeling with such a huge dataset. What would
>> be your suggestion. How to make a first cut... how to eliminate most
>> of the variables but not to ignore potential interactions... for
>> example, maybe variable A is not good predictor and variable B is not
>> good predictor either, but maybe A and B together are good
>> predictor...
>>
>> Any suggestion is welcomed
> 
> 
> milicic.marko
> 
> I think do you start with a rpart("binary variable"~.)
> This show you a set of variables to start a model and the start set to
> curoff  for continous variables
I cannot imagine a worse way to formulate a regression model.  Reasons 
include

1. Results of recursive partitioning are not trustworthy unless the 
sample size exceeds 50,000 or the signal to noise ratio is extremely high.

2. The type I error of tests from the final regression model will be 
extraordinarily inflated.

3. False interactions will appear in the model.

4. The cutoffs so chosen will not replicate and in effect assume that 
covariate effects are discontinuous and piecewise flat.  The use of 
cutoffs results in a huge loss of information and power and makes the 
analysis arbitrary and impossible to interpret (e.g., a high covariate 
value:low covariate value odds ratio or mean difference is a complex 
function of all the covariate values in the sample).

5. The model will not validate in new data.

Frank
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Bernardo Rangel Tura

2008-Oct-01 09:34 UTC

head link

[R] Logistic regression problem

Em Ter, 2008-09-30 ?s 18:56 -0500, Frank E Harrell Jr
escreveu:> Bernardo Rangel Tura wrote:
> > Em S?b, 2008-09-27 ?s 10:51 -0700, milicic.marko escreveu:
> >> I have a huge data set with thousands of variable and one binary
> >> variable. I know that most of the variables are correlated and are
not
> >> good predictors... but...
> >>
> >> It is very hard to start modeling with such a huge dataset. What
would
> >> be your suggestion. How to make a first cut... how to eliminate
most
> >> of the variables but not to ignore potential interactions... for
> >> example, maybe variable A is not good predictor and variable B is
not
> >> good predictor either, but maybe A and B together are good
> >> predictor...
> >>
> >> Any suggestion is welcomed
> > 
> > 
> > milicic.marko
> > 
> > I think do you start with a rpart("binary variable"~.)
> > This show you a set of variables to start a model and the start set to
> > curoff  for continous variables
> 
> I cannot imagine a worse way to formulate a regression model.  Reasons 
> include
> 
> 1. Results of recursive partitioning are not trustworthy unless the 
> sample size exceeds 50,000 or the signal to noise ratio is extremely high.
> 
> 2. The type I error of tests from the final regression model will be 
> extraordinarily inflated.
> 
> 3. False interactions will appear in the model.
> 
> 4. The cutoffs so chosen will not replicate and in effect assume that 
> covariate effects are discontinuous and piecewise flat.  The use of 
> cutoffs results in a huge loss of information and power and makes the 
> analysis arbitrary and impossible to interpret (e.g., a high covariate 
> value:low covariate value odds ratio or mean difference is a complex 
> function of all the covariate values in the sample).
> 
> 5. The model will not validate in new data.
Professor Frank,

Thank you for your explain.

Well, if my first idea is wrong what is your opinion on the following
approach?

1- Make PCA with data excluding the binary variable
2- Put de principal components in logistic model
3- After revert principal componentes in variable (only if is
interesting for milicic.marko)

If this approach is wrong too what is your approach?
-- 
Bernardo Rangel Tura, M.D,MPH,Ph.D
National Institute of Cardiology
Brazil

Frank E Harrell Jr

2008-Oct-01 12:19 UTC

head link

[R] Logistic regression problem

Bernardo Rangel Tura wrote:> Em Ter, 2008-09-30 ?s 18:56 -0500, Frank E Harrell Jr escreveu:
>> Bernardo Rangel Tura wrote:
>>> Em S?b, 2008-09-27 ?s 10:51 -0700, milicic.marko escreveu:
>>>> I have a huge data set with thousands of variable and one
binary
>>>> variable. I know that most of the variables are correlated and
are not
>>>> good predictors... but...
>>>>
>>>> It is very hard to start modeling with such a huge dataset.
What would
>>>> be your suggestion. How to make a first cut... how to eliminate
most
>>>> of the variables but not to ignore potential interactions...
for
>>>> example, maybe variable A is not good predictor and variable B
is not
>>>> good predictor either, but maybe A and B together are good
>>>> predictor...
>>>>
>>>> Any suggestion is welcomed
>>>
>>> milicic.marko
>>>
>>> I think do you start with a rpart("binary variable"~.)
>>> This show you a set of variables to start a model and the start set
to
>>> curoff  for continous variables
>> I cannot imagine a worse way to formulate a regression model.  Reasons 
>> include
>>
>> 1. Results of recursive partitioning are not trustworthy unless the 
>> sample size exceeds 50,000 or the signal to noise ratio is extremely
high.
>>
>> 2. The type I error of tests from the final regression model will be 
>> extraordinarily inflated.
>>
>> 3. False interactions will appear in the model.
>>
>> 4. The cutoffs so chosen will not replicate and in effect assume that 
>> covariate effects are discontinuous and piecewise flat.  The use of 
>> cutoffs results in a huge loss of information and power and makes the 
>> analysis arbitrary and impossible to interpret (e.g., a high covariate 
>> value:low covariate value odds ratio or mean difference is a complex 
>> function of all the covariate values in the sample).
>>
>> 5. The model will not validate in new data.
> 
> Professor Frank,
> 
> Thank you for your explain.
> 
> Well, if my first idea is wrong what is your opinion on the following
> approach?
> 
> 1- Make PCA with data excluding the binary variable
> 2- Put de principal components in logistic model
> 3- After revert principal componentes in variable (only if is
> interesting for milicic.marko)
> 
> If this approach is wrong too what is your approach?

Hi Bernardo,

If there is a large number of potential predictors and no previous 
knowledge to guide the modeling, principal components (PC) is often an 
excellent way to proceed.  The first few PCs can be put into the model. 
  The result is not always very interpretable, but you can "decode"
the
PCs by using stepwise regression or recursive partitioning (which are 
safer in this context because the stepwise methods are not exposed to 
the Y variable).  You can also add PCs in a stepwise fashion in the 
pre-specified order of variance explained.

There are many variations on this theme including nonlinear principal 
components (e.g., the transcan function in the Hmisc package) which may 
explain more variance of the predictors.

Frank
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Robert A LaBudde

2008-Oct-01 15:51 UTC

head link

[R] Logistic regression problem

It would not be possible to answer your original 
question until you specify your goal.

Is it to develop a model with external validity 
that will generalize to new data? (You are not 
likely to succeed, if you are starting with a 
"boil the ocean" approach with 44,000+ covariates 
and millions of records.) This is the point Prof. Harrell is making.

Or is it to reduce a large dataset to a tractable 
predictor formula that only interpolates your dataset?

If the former, you will need external modeling 
information to select the "wheat from the chaff" 
in your excessive predictor set.

Assuming it is the latter, then almost any 
approach that ends up with a tractable model 
(that has no meaning other than interpolation of 
this specific dataset) will be useful. For this, 
regression trees or even stepwise regression 
would work. The algorithm must be very simple and 
computer efficient. This is the area of data mining approaches.

I would suggest you start by looking at covariate 
patterns to find out where the scarcity lies. 
These will end up high leverage data.

Another place to start is common sense: Thousands 
of covariates cannot all contain independent 
information of value. Try to cluster them and 
pick the best representative from each cluster 
based on expert knowledge. You may solve your problem quickly that way.

At 05:34 AM 10/1/2008, Bernardo Rangel Tura wrote:>Em Ter, 2008-09-30 ? s 18:56 -0500, Frank E 
>Harrell Jr escreveu: > Bernardo Rangel Tura 
>wrote: > > Em S??b, 2008-09-27 ? s 10:51 -0700, 
>milicic.marko escreveu: > >> I have a huge data 
>set with thousands of variable and one 
>binary > >> variable. I know that most of the 
>variables are correlated and are not > >> good 
>predictors... but... > >> > >> It is very hard 
>to start modeling with such a huge dataset. What 
>would > >> be your suggestion. How to make a 
>first cut... how to eliminate most > >> of the 
>variables but not to ignore potential 
>interactions... for > >> example, maybe variable 
>A is not good predictor and variable B is 
>not > >> good predictor either, but maybe A and 
>B together are good > >> predictor... > >> > >> 
>Any suggestion is welcomed > > > > > > 
>milicic.marko > > > > I think do you start with 
>a rpart("binary variable"~.) > > This show you a 
>set of variables to start a model and the start 
>set to > > curoff  for continous variables > > I 
>cannot imagine a worse way to formulate a 
>regression model.  Reasons > include > > 1. 
>Results of recursive partitioning are not 
>trustworthy unless the > sample size exceeds 
>50,000 or the signal to noise ratio is extremely 
>high. > > 2. The type I error of tests from the 
>final regression model will be > extraordinarily 
>inflated. > > 3. False interactions will appear 
>in the model. > > 4. The cutoffs so chosen will 
>not replicate and in effect assume that > 
>covariate effects are discontinuous and 
>piecewise flat.  The use of > cutoffs results in 
>a huge loss of information and power and makes 
>the > analysis arbitrary and impossible to 
>interpret (e.g., a high covariate > value:low 
>covariate value odds ratio or mean difference is 
>a complex > function of all the covariate values 
>in the sample). > > 5. The model will not 
>validate in new data. Professor Frank, Thank you 
>for your explain. Well, if my first idea is 
>wrong what is your opinion on the following 
>approach? 1- Make PCA with data excluding the 
>binary variable 2- Put de principal components 
>in logistic model 3- After revert principal 
>componentes in variable (only if is interesting 
>for milicic.marko) If this approach is wrong too 
>what is your approach? -- Bernardo Rangel Tura, 
>M.D,MPH,Ph.D National Institute of Cardiology 
>Brazil 
>______________________________________________ 
>R-help at r-project.org mailing list 
>https://stat.ethz.ch/mailman/listinfo/r-help 
>PLEASE do read the posting guide 
>http://www.R-project.org/posting-guide.html and 
>provide commented, minimal, self-contained, reproducible code.
===============================================================Robert A.
LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
================================================================

R help - Sep 2008 - Logistic regression problem

[R] Logistic regression problem

[R] Logistic regression problem

[R] Logistic regression problem

[R] Logistic regression problem

[R] Logistic regression problem

[R] Logistic regression problem

[R] Logistic regression problem