Hi There, I am trying to fit a logit model to some data in a CSV file in R. Here is my code: Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE) Prepared_Data attach(Prepared_Data) lrfit<-glm(C3~A1*B2*D4*E5,family = binomial) anova(lrfit, test="Chisq") write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv") shell.exec("CWModelA.csv") I am unsure as to how many methods there are of choosing a suitable model, however, I was hoping to fit the full/saturated model and choose the significant terms only as my final model. My first question therefore: is there a better way to fit a model to some data? Is there a function or way of getting R to print the optimum model? My CSV file, when opened in excel, contains approximately 3500 rows x 27 columns. I can only seem to run 'anova()' on the saturated/full model including the first four columns/factors. If I take any more into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops responding/I have to force quit. Why is this? How can I get around it as I need to include all 27 columns? Any advice or constructive criticism is appreciated - even if it means I have to start from scratch. Many Thanks, AJC?
Abigail Clifton <abigailclifton <at> me.com> writes:> I am trying to fit a logit model to some data in a CSV file in R.It would be helpful to link back to your previous question: http://thread.gmane.org/gmane.comp.lang.r.general/259353> Here is my code: > > Prepared_Data = read.csv("Prepared_Data.csv", header=TRUE) > Prepared_Data > attach(Prepared_Data) > lrfit<-glm(C3~A1*B2*D4*E5,family = binomial) > anova(lrfit, test="Chisq") > write.csv(anova(lrfit, test="Chisq"), file="CWModelA.csv") > shell.exec("CWModelA.csv")This is still not a reproducible example, although it's a little closer. Did you read the "recommended reading" in my previous answer???> I am unsure as to how many methods there are of choosing a suitable model,Lots, and it depends very much on why you are doing the analysis in the first place. Are you (1) trying to find a good predictive model? (2) Looking for interesting patterns in the data? (3) Trying to test hypotheses about which predictors have a significant effect on the outcome? (4) Partition the variance explained by different predictors?> however, I was hoping to fit the > full/saturated model and choose the significant terms only as > my final model.In general this is a poor choice for goal #1 above, not necessarily bad for #2, absolutely terrible for #3, irrelevant for #4. I'm guessing you are interested in the best predictive model, since you mentioned something in your previous message about working out the probability of default on loan applications. I would say your best bet is to use penalized approaches (see the glmnet package, and library("sos"); findFn("lasso")).> My first question therefore: is there a better way to fit a model to > some data? Is there a function or way of getting R to print the > optimum model?> My CSV file, when opened in excel, contains approximately 3500 rows > x 27 columns. I can only seem to run 'anova()' on the saturated/full > model including the first four columns/factors. If I take any more > into consideration (e.g. if I did C3~A1*B2*D4*E5*F6*G7), R stops > responding/I have to force quit. Why is this? How can I get around > it as I need to include all 27 columns?For continuous predictors, the number of parameters of the saturated models grows as 2^n; 2^27 is >134 million, so you probably don't want to do that. It's potentially even worse for categorical predictors (prod(levels(f)), so e.g. 3^n > 7*10^12 for three-level predictors). It's still not sufficiently clear why you're having a problem because you haven't given enough information: in the example I gave in my previous answer, I used 7 continuous variables for 128 parameters without too much difficulty, but if you had (say) 5 levels for each of 7 predictors then you would be trying to estimate 78125 parameters ... Bottom line, it may simply not be reasonable to fit the saturated model. Hard-core machine learning approaches (and *maybe* the penalized regression approaches) might be able to handle a few thousand predictors for n=3500, but a model with tens of thousands of parameters (or more) feels somewhat crazy. (Someone else is welcome to tell me how this could be done.) Ben Bolker
On 12-03-30 12:40 PM, Clifton, Abigail J. wrote:> Hi again! >> Thanks very much for the code, it appears to work! Finally, I want > to extract the coefficients and tried coef(g1), which works. > However, there only appear to be intercepts/coefficients for 'V22N' > out of thousands of possibilities, which are all displayed as > dots/NaN. Is there a way of getting more coefficients - perhaps by > changing lambda or something like that? Is it also possible to > print the final 'model'?I'm afraid I'm out of time right now -- cc'ing to r-help in case someone else has the time and energy to help. All I can suggest is that you spend some time reading through all of the documentation for the package (start with help(package="glmnet") and browse through all the help pages, run the examples, etc. Unfortunately there is no general-purpose vignette for that package ... an entire book on the subject is available online http://www-stat.stanford.edu/~tibs/ElemStatLearn/ , but that won't provide quick answers ... Ben Bolker> Kind regards, > > Abigail > > > -----Original Message----- From: Ben Bolker <bbolker at gmail.com> > Sender: r-help-bounces at r-project.orgDate: Fri, 30 Mar 2012 02:58:04 > To: <r-help at stat.math.ethz.ch> Subject: Re: [R] How to improve, > at all, a simple GLM code > > Abigail Clifton <abigailclifton <at> me.com> writes: > >> I am wanting to find a good predictive model, yes. It's part of a >> project so if I have time after finding the model I may want to >> find some patterns but it's not a priority. I just want the >> model for now (I need the coefficients above all). > >> It's all categorical data, I categorised any continuous data >> before I started trying to fit the glm. > > That's not necessarily a good idea (categorising often loses power > relative to fitting something like an additive model), but OK. > > >> I was unsure of how to get the csv file to you,however, I have >> uploaded it and it should be available for download from here: >> http://www.filedropper.com/prepareddata > > Here's how far I got: > > Prepared_Data <- na.omit(read.csv("Prepared_Data.csv", > header=TRUE)) pd <- Prepared_Data[,-3] ## data minus response > variable > > ## how many levels per variable? lev <- sapply(pd,function(x) > length(unique(x))) > > ## total parameters for n variables par(las=1,bty="l") > plot(cumprod(lev),log="y") > > library(Matrix) m <- sparse.model.matrix(~.^2,data=pd) ## slower > than model.matrix ncol(m) ##8352 columns (!!) > > library(glmnet) g1 <- glmnet(m,Prepared_Data$C3, > family="binomial") > > This doesn't appear to work properly, yet (I get funny values), > but it's the direction I would go ... > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the > posting guide http://www.R-project.org/posting-guide.html and > provide commented, minimal, self-contained, reproducible code.
Possibly Parallel Threads
- Possibly more coefficients?
- CForest Error Logical Subscript Too Long
- Data handling/optimum glm method.
- Creating dummy vars with contrasts - why does the returned identity matrix contain all levels (and not n-1 levels) ?
- glmnet: obtain predictions using predict and also by extracting coefficients