Hi all. Sorry for the long email. I have been trying to find someone local to
work on this with me, without much luck. I went in to our local stats
consulting service here, and the guy there told me that I already know more
about model selection than he does. :-< He pointed me towards another
professor that can perhaps help, but that prof is busy until mid-June, so I want
to get as much figured out as I can before I eventually meet with him. And that
prof may turn out not to be much help anyway, in which case all I've got is
you folks. :->
So I've got a big dataset (300,000 rows) in which the dependent variable
is binary, and there are five independent variables, each continuous, and each
uniformly distributed between 0 and 1. I'm trying to use binary logistic
regression to get an explanatory model with decent predictive value (but being
simple, for explanation, is more important than being optimally predictive).
Considering the squares of those variables as well (but no higher powers, this
list talked me out of that recently because of all the problems with
collinearity and polynomial fits), and many interactions between both the
variables and the squares of the variables, I have a full model formula of 116
terms; but most of those terms are of small effect (although often still
significant in a glm fit, since 300,000 rows is a lot of data). So I have a
model selection problem: I want to find a simple explanatory model containing
just the terms of large effect that are most important in getting an
understanding of what strongly influences the dependent variable.
I was previously trying to do model selection using a step-down approach with
a large per-term penalty (much larger than the standard BIC penalty) to force
more terms to drop out. I looked at a wide range of penalty values, got a
corresponding set of models with different numbers of terms, looked at the
correct prediction rate for each of those models, and basically chose the
simplest model that still gave me a pretty good prediction rate (for some
subjective definition of "pretty good"). Typically the model I chose
would have about 20 terms, out of the original 116. (A standard BIC step-down
would retain more like 100 terms, with only a very slightly better prediction
rate.)
So that was working OK, but in discussions with the folks on this list (thanks
everybody for your help!), I have been exploring using the lasso for this
instead, to avoid the problems with step-down, gain the benefits of shrinkage,
and so forth; clearly it should be much better than my homegrown model selection
procedure. I've been reading about the lasso in Tibshirani (1996) and in
The Elements of Statistical Learning. I'm using glmnet() at present; I have
seen that lars() also exists, but I don't understand its documentation as
well, so I'm starting with glmnet. So there's one question:
1. Is my choice of glmnet() ok? On what basis should I choose glmnet() vs.
lars()?
The lasso wants variables to be centered and scaled, I gather. glmnet() can
do this for me, but I want to understand exactly what the variables are that the
fit is done on, so that I can interpret the coefficients properly, so I want to
do this myself. (I'm also concerned that glmnet() might not do it
correctly, since it doesn't know that some of the terms in the formula are
squares and interactions.) So I'm passing standardize=FALSE to glmnet(),
and I'm doing my own scaling before, like (where df is my dataframe):
df$Cx <- scale(df$x) # an independent variable, centered and scaled
df$Cx_sq <- df$Cx ^ 2 # the square of that variable
I do this for each of the five independent variables. So the variables
themselves are centered and scaled, while the squared versions of the variables
are exactly the square of the scaled variable; I do not scale them again. So
question 2:
2. Is the way I'm scaling the variables before calling glmnet() correct? Or
should the squares themselves be centered and scaled?
Having scaled the variables in this way, I then construct a model matrix and
call glmnet() (where f is the 116-term formula and df is my dataframe):
mf <- model.frame(f, df)
mm <- model.matrix(formula(f), mf)[,-1]
lasso <- glmnet(mm, y=df$outcome, family="binomial",
standardize=FALSE)
I do it this way because glmnet() doesn't support being passed a formula
and a dataframe. I think this is doing the right thing. The model.matrix()
call constructs new columns for all of the interactions in the formula, which of
course act as separate independent variables in the regression. One worry I
have is that those, like the squares discussed above, are not themselves
centered and scaled. If there's an interaction between, say, Cx and Cy,
then the model matrix column for Cx:Cy is of course just the product of the Cx
column and the Cy column, and so it is not centered/scaled. I don't know if
this is correct or not. So question 3:
3. Is my model matrix correct, or do I have a problem with the scale of the
interaction variables?
Having run glmnet(), I now have lots of coefficients for the different values
of lambda. One thing that seems problematic is that the coefficient for an
interaction (Cx:Cy, say), will become non-zero while the coefficients for the
underlying terms Cx and Cy are still zero. Of course glmnet() has no way of
preventing this, since it doesn't know the formula or which matrix columns
are interactions, etc. From my previous work on model selection via step-down,
my understanding is that this state of affairs should not be allowed. Using
step-down, Cx and Cy will not be dropped as long as Cx:Cy is retained, for
example. As I understand it, this is for essentially the same conceptual
reasons that one ought to retain the intercept in a regression even if it is
non-significant; the slope of the regression depends upon it, and so dropping it
is not correct. However, I don't see any way that I can tell glmnet() about
these relationships, since it doesn't have a functional interface. So:
4. Is it a problem that the lasso fit gives non-zero coefficients for
interactions whose underlying terms have zero coefficients?
And finally comes the actual model selection problem. The best lambda value
given by cv.glmnet() is for a model that contains most of the 116 terms; I
suppose that, as with the BIC step-down, this is because 300,000 rows is a lot
of data, and so most of those terms are in fact well-supported by the data. But
I want a simple explanatory model with only the terms of large effect, as
described before. I'm thinking of finding the prediction rate for each
lambda value used by glmnet, and using the same subjective "simplest model
with a pretty good prediction rate" criterion that I was using before. But
I'm wondering:
5. Is there any way to choose a simple explanatory model, smaller than the best
predictive model supported by the data, that is less arbitrary / subjective?
And finally, I've got sort of a philosophical question that I'm
struggling with that makes me wonder if the lasso is really the right tool for
my job in the first place. As lambda increases, the coefficients of terms
diverge from zero, one by one, until all terms have non-zero coefficients and
you've arrived at the full model. The funny thing is, for most values of
lambda, some or another term in my formula will have only just recently diverged
from zero, and so that term will have a very small coefficient -- the shrinkage
aspect of the lasso will have shrunk it almost, but not quite, to zero. This is
weird, given that my goal is a simple explanatory model; having terms with very
small but non-zero coefficients is contrary to the goals of my statistical
analysis. The lasso is considered to be better for producing explanatory models
than some other methods, because coefficients do eventually go to zero, instead
of just getting smaller and smaller; but still, I wonder if I ought to be using
some more extreme method of model selection that, like the step-down I was using
before, either keeps a term in or drops it, with no in-between. On the other
hand, I gather that the shrinkage that the lasso does is considered to be a good
thing. I don't really understand enough about all this to know what to do.
So:
6. Should I be looking at a method other than the lasso, so that I would not get
some terms with very small coefficients? Or should I use the lasso, and just
ignore, in my explanation of the model, terms with tiny coefficients? Or is
there some other way of handling this problem?
Thanks to anybody who has even read this far, and many many thanks to any
responses!
Ben Haller
McGill University
http://biology.mcgill.ca/grad/ben/