thr3ads.net - R help - [R] How to improve, at all, a simple GLM code [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Abigail Clifton

2012-Mar-29 19:48 UTC

[R] How to improve, at all, a simple GLM code

Hi There,

I am trying to fit a logit model to some data in a CSV file in R.
Here is my code:

Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE)
Prepared_Data
attach(Prepared_Data)
lrfit<-glm(C3~A1*B2*D4*E5,family = binomial)
anova(lrfit, test="Chisq")
write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv")
shell.exec("CWModelA.csv")

I am unsure as to how many methods there are of choosing a suitable model,
however, I was hoping to fit the full/saturated model and choose the significant
terms only as my final model.
My first question therefore: is there a better way to fit a model to some data?
Is there a function or way of getting R to print the optimum model?

My CSV file, when opened in excel, contains approximately 3500 rows x 27
columns. I can only seem to run 'anova()' on the saturated/full model
including the first four columns/factors. If I take any more into consideration
(e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops responding/I have to force quit.
Why is this? How can I get around it as I need to include all 27 columns?

Any advice or constructive criticism is appreciated - even if it means I have to
start from scratch.

Many Thanks,

AJC?

Ben Bolker

2012-Mar-29 21:19 UTC

head link

[R] How to improve, at all, a simple GLM code

Abigail Clifton <abigailclifton <at> me.com> writes:

> I am trying to fit a logit model to some data in a CSV file in R.
 It would be helpful to link back to your previous question:

http://thread.gmane.org/gmane.comp.lang.r.general/259353
> Here is my code:
> 
> Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE)
> Prepared_Data
> attach(Prepared_Data)
> lrfit<-glm(C3~A1*B2*D4*E5,family = binomial)
> anova(lrfit, test="Chisq")
> write.csv(anova(lrfit, test="Chisq"),
file="CWModelA.csv")
> shell.exec("CWModelA.csv")
  This is still not a reproducible example, although
it's a little closer.  Did you read the "recommended reading"
in my previous answer???

> I am unsure as to how many methods there are of choosing a suitable model, 
 Lots, and it depends very much on why you are doing the analysis in
the first place.  Are you (1) trying to find a good predictive model?
(2) Looking for interesting patterns in the data?  (3) Trying to test
hypotheses about which predictors have a significant effect on the
outcome?  (4) Partition the variance explained by different predictors?
> however, I was hoping to fit the
> full/saturated model and choose the significant terms only as 
> my final model.
  In general this is a poor choice for goal #1 above, not necessarily
bad for #2, absolutely terrible for #3, irrelevant for #4.  I'm
guessing you are interested in the best predictive model, since you
mentioned something in your previous message about working out the
probability of default on loan applications.  I would say your best
bet is to use penalized approaches (see the glmnet package, and
library("sos"); findFn("lasso")).
> My first question therefore: is there a better way to fit a model to
> some data? Is there a function or way of getting R to print the
> optimum model?
> My CSV file, when opened in excel, contains approximately 3500 rows
> x 27 columns. I can only seem to run 'anova()' on the
saturated/full
> model including the first four columns/factors. If I take any more
> into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops
> responding/I have to force quit. Why is this? How can I get around
> it as I need to include all 27 columns?
   For continuous predictors, the number of parameters of the
saturated models grows as 2^n; 2^27 is >134 million, so you
probably don't want to do that.  It's potentially even worse
for categorical predictors (prod(levels(f)), so e.g. 3^n > 7*10^12
for three-level predictors).

  It's still not sufficiently clear why you're having a problem 
because you haven't given enough information: in the example I
gave in my previous answer, I used 7 continuous variables for
128 parameters without too much difficulty, but if you had (say)
5 levels for each of 7 predictors then you would be trying
to estimate 78125 parameters ...

  Bottom line, it may simply not be reasonable to fit the
saturated model.  Hard-core machine learning approaches (and
*maybe* the penalized regression approaches) might be able
to handle a few thousand predictors for n=3500, but a model
with tens of thousands of parameters (or more) feels somewhat crazy.
(Someone else is welcome to tell me how this could be done.)

  Ben Bolker

Ben Bolker

2012-Mar-30 20:26 UTC

head link

[R] How to improve, at all, a simple GLM code

On 12-03-30 12:40 PM, Clifton, Abigail J. wrote:> Hi again!
> 
> Thanks very much for the code, it appears to work! Finally, I want
> to extract the coefficients and tried coef(g1), which works.
> However, there only appear to be intercepts/coefficients for 'V22N'
> out of thousands of possibilities, which are all displayed as
> dots/NaN. Is there a way of getting more coefficients - perhaps by
> changing lambda or something like that? Is it also possible to
> print the final 'model'?

  I'm afraid I'm out of time right now -- cc'ing to r-help in case
someone else has the time and energy to help.  All I can suggest is
that you spend some time reading through all of the documentation for
the package (start with help(package="glmnet") and browse through all
the help pages, run the examples, etc.  Unfortunately there is no
general-purpose vignette for that package ... an entire book on the
subject is available online
http://www-stat.stanford.edu/~tibs/ElemStatLearn/ , but that won't
provide quick answers ...

  Ben Bolker

> Kind regards,
> 
> Abigail
> 
> 
> -----Original Message----- From: Ben Bolker <bbolker at gmail.com> 
> Sender: r-help-bounces at r-project.orgDate: Fri, 30 Mar 2012 02:58:04
>  To: <r-help at stat.math.ethz.ch> Subject: Re: [R] How to improve,
> at all, a simple GLM code
> 
> Abigail Clifton <abigailclifton <at> me.com> writes:
> 
>> I am wanting to find a good predictive model, yes. It's part of a
>>  project so if I have time after finding the model I may want to 
>> find some patterns but it's not a priority. I just want the
>> model for now (I need the coefficients above all).
> 
>> It's all categorical data, I categorised any continuous data 
>> before I started trying to fit the glm.
> 
> That's not necessarily a good idea (categorising often loses power 
> relative to fitting something like an additive model), but OK.
> 
> 
>> I was unsure of how to get the csv file to you,however, I have 
>> uploaded it and it should be available for download from here: 
>> http://www.filedropper.com/prepareddata
> 
> Here's how far I got:
> 
> Prepared_Data <-  na.omit(read.csv("Prepared_Data.csv", 
> header=TRUE)) pd <- Prepared_Data[,-3]  ## data minus response 
> variable
> 
> ## how many levels per variable? lev <- sapply(pd,function(x) 
> length(unique(x)))
> 
> ## total parameters for n variables par(las=1,bty="l") 
> plot(cumprod(lev),log="y")
> 
> library(Matrix) m <- sparse.model.matrix(~.^2,data=pd)  ## slower 
> than model.matrix ncol(m)  ##8352 columns (!!)
> 
> library(glmnet) g1 <- glmnet(m,Prepared_Data$C3,
> family="binomial")
> 
> This doesn't appear to work properly, yet (I get funny values),
> but it's the direction I would go ...
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
> posting guide http://www.R-project.org/posting-guide.html and
> provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more reasonably related threads

R help - Mar 2012 - How to improve, at all, a simple GLM code

[R] How to improve, at all, a simple GLM code

[R] How to improve, at all, a simple GLM code

[R] How to improve, at all, a simple GLM code

Maybe Matching Threads