I would like to run a logistic regression on some factor variables (main effects and eventually an interaction) that are very sparse. I have a moderately large dataset, ~100k observations with 1500 factor levels for one variable (x1) and 600 for another (X2), creating ~19000 levels for the interaction (X1:X2). I would like to take advantage of the sparseness in these factors to avoid using GLM. Actually glm is not an option given the size of the design matrix. I have looked through the Matrix package as well as other packages without much help. Is there some option, some modification of glm, some way that it will recognize a sparse matrix and avoid large matrix inversions? -Robin [[alternative HTML version deleted]]
On 05/22/2010 02:19 PM, Robin Jeffries wrote:> I would like to run a logistic regression on some factor variables (main > effects and eventually an interaction) that are very sparse. I have a > moderately large dataset, ~100k observations with 1500 factor levels for one > variable (x1) and 600 for another (X2), creating ~19000 levels for the > interaction (X1:X2). > > I would like to take advantage of the sparseness in these factors to avoid > using GLM. Actually glm is not an option given the size of the design > matrix. > > I have looked through the Matrix package as well as other packages without > much help. > > Is there some option, some modification of glm, some way that it will > recognize a sparse matrix and avoid large matrix inversions? > > -Robin >Robin, It is doubtful that fixed effects are appropriate for your situation, but if you do want to use them there is experimental code in the lrm function in the rms package to handle "strat" (strata) factors that makes use of the sparse matrix representation. Not sure if it handles more than one factor, and you'll have to play with the code to make sure this method is activated. Take a look at lrm.fit.strat.s that comes with the source package, the see what is needed in lrm to use it. Frank -- Frank E Harrell Jr Professor and Chairman School of Medicine Department of Biostatistics Vanderbilt University
As Frank mentioned in his reply, expecting to estimate tens of thousands of fixed-effects parameters in a logistic regression is optimistic. You could start with a generalized linear mixed model instead library(lme4) fm1 <- glmer(resp ~ 1 + (1|f1) + (1|f2) + (1|f1:f2), mydata, binomial)) If you have difficulty with that it might be best to switch the discussion to the R-SIG-Mixed-Models at R-project.org mailing list. On Sat, May 22, 2010 at 2:19 PM, Robin Jeffries <rjeffries at ucla.edu> wrote:> I would like to run a logistic regression on some factor variables (main > effects and eventually an interaction) that are very sparse. I have a > moderately large dataset, ~100k observations with 1500 factor levels for one > variable (x1) and 600 for another (X2), creating ~19000 levels for the > interaction (X1:X2). > > I would like to take advantage of the sparseness in these factors to avoid > using GLM. Actually glm is not an option given the size of the design > matrix. > > I have looked through the Matrix package as well as other packages without > much help. > > Is there some option, some modification of glm, some way that it will > recognize a sparse matrix and avoid large matrix inversions? > > -Robin > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >