Richard A. O'Keefe
2005-Dec-08 22:43 UTC
[R] logistic regression with constrained coefficients?
I am trying to automatically construct a distance function from a training set in order to use it to cluster another data set. The variables are nominal. One variable is a "class" variable having two values; it is kept separate from the others. I have a method which constructs a distance matrix for the levels of a nominal variable in the context of the other variables. I want to construct a linear combination of these which gives me a distance between whole cases that is well associated with the class variable, in that "combined distance between two cases large => they most likely belong to different classes." So from my training set I construct a set of (d1(x1,y1), ..., dn(xn,yn), x_class != y_class) rows bound together as a data frame (actually I construct it by columns), and then the obvious thing to try was glm(different.class ~ ., family = binomial(), data = distance.frame) The thing is that this gives me both positve and negative coefficients, whereas the linear combination is only guaranteed to be a metric if the coefficients are all non-negative. There are four fairly obvious ways to deal with that: (1) just force the negative coefficients to 0 and hope. This turns out to work rather well, but still... (2) keep all the coefficients but take max(0, linear combination of distances). This turns out to work rather well, but still... (3) Drop the variables with negative coefficients from the model, refit, and iterate until no negative coefficients remain. This can hardly be said to work; sometimes nearly all the variables are dropped. (4) Use a version of glm() that will let me constrain the coefficients to be non-negative. I *have* searched the R-help archives, and I see that the question about logistic regression with constrained coefficients has come up before, but it didn't really get a satisfactory answer. I've also searched the documentation of more contributed packages than I could possibly understand. There is obviously some way to do this using R's general non-linear optimisation functions. However, I don't know how to formulate logistic regression that way. This whole thing is heuristic. I am not hell-bent on (ab?)using logistic regression this way. It was just an obvious thing to try. Suggestions for other means to the same end will be welcome.
Prof Brian Ripley
2005-Dec-08 23:02 UTC
[R] logistic regression with constrained coefficients?
On Fri, 9 Dec 2005, Richard A. O'Keefe wrote:> I am trying to automatically construct a distance function from > a training set in order to use it to cluster another data set. > The variables are nominal. One variable is a "class" variable > having two values; it is kept separate from the others. > > I have a method which constructs a distance matrix for the levels > of a nominal variable in the context of the other variables. > > I want to construct a linear combination of these which gives me > a distance between whole cases that is well associated with the > class variable, in that > "combined distance between two cases large => > they most likely belong to different classes." > > So from my training set I construct a set of > (d1(x1,y1), ..., dn(xn,yn), x_class != y_class) > rows bound together as a data frame (actually I construct it by > columns), and then the obvious thing to try was > > glm(different.class ~ ., family = binomial(), data = distance.frame) > > The thing is that this gives me both positve and negative coefficients, > whereas the linear combination is only guaranteed to be a metric if the > coefficients are all non-negative. > > There are four fairly obvious ways to deal with that: > (1) just force the negative coefficients to 0 and hope. > This turns out to work rather well, but still... > (2) keep all the coefficients but take max(0, linear combination of distances). > This turns out to work rather well, but still... > (3) Drop the variables with negative coefficients from the model, > refit, and iterate until no negative coefficients remain. > This can hardly be said to work; sometimes nearly all the variables > are dropped. > (4) Use a version of glm() that will let me constrain the coefficients > to be non-negative. > > I *have* searched the R-help archives, and I see that the question about > logistic regression with constrained coefficients has come up before, but > it didn't really get a satisfactory answer. I've also searched the > documentation of more contributed packages than I could possibly understand. > > There is obviously some way to do this using R's general non-linear > optimisation functions. However, I don't know how to formulate logistic > regression that way.There is a worked example in MASS (the book) p.445, including adding constraints. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595