Gabrielle Perron
2017-Mar-27 15:23 UTC
[R] R glm function ignores some predictor variables
Hi, This is my first time using this mailing list. I have looked at the posting guide, but please do let me know if I should be doing something differently. Here is my question, I apologize in advance for not being able to provide example data, I am using very large tables, and what I am trying to do works fine with simpler examples, so providing example data cannot help. It has always worked for me until now. So I am just trying to get your ideas on what might be the issue. But if there is any way I could provide more information, do let me know. So, I have a vector corresponding to a response variable and a table of predictor variables. The response vector is numeric, the predictor variables (columns of the table) are in the binary format (0s and 1s). I am running the glm function (multivariate linear regression) using the response vector and the table of predictors: fit <- glm(response ~ as.matrix(predictors), na.action=na.exclude) coeff <- as.vector(coef(summary(fit))[,4])[-1] When I have been doing that in the past, I would extract the vector of regression coefficient to use it for further analysis. The problem is that now the regression returns a vector of coefficients which is missing some values. Essentially some predictor variables are not attributed a coefficient at all by glm. But there are no error messages. The summary of the model looks normal, but some predictor variables are missing like I mentioned. Most other predictors have assigned data (coefficient, pvalue, etc.). About 30 predictors are missing from the model, over 200. I have tried using different response variables (vectors), but I am getting the same issue, although the missing predictors vary depending on the response vector... Any ideas on what might be going on? I think this can happen if some variables have 0 variance, but I have checked that. There are also no NA values and no missing values in the tables. What could cause glm to ignore/remove some predictor variables? Any suggestion is welcome! Thank you, Gabrielle [[alternative HTML version deleted]]
Hi Gabrielle, With that number of binary predictors it would be no surprise if some were linear combinations of others. Jim On Tue, Mar 28, 2017 at 2:23 AM, Gabrielle Perron <gabrielle.perron at mail.mcgill.ca> wrote:> Hi, > > > This is my first time using this mailing list. I have looked at the posting guide, but please do let me know if I should be doing something differently. > > > Here is my question, I apologize in advance for not being able to provide example data, I am using very large tables, and what I am trying to do works fine with simpler examples, so providing example data cannot help. It has always worked for me until now. So I am just trying to get your ideas on what might be the issue. But if there is any way I could provide more information, do let me know. > > > So, I have a vector corresponding to a response variable and a table of predictor variables. The response vector is numeric, the predictor variables (columns of the table) are in the binary format (0s and 1s). > > > I am running the glm function (multivariate linear regression) using the response vector and the table of predictors: > > > fit <- glm(response ~ as.matrix(predictors), na.action=na.exclude) > > coeff <- as.vector(coef(summary(fit))[,4])[-1] > > > When I have been doing that in the past, I would extract the vector of regression coefficient to use it for further analysis. > > > The problem is that now the regression returns a vector of coefficients which is missing some values. Essentially some predictor variables are not attributed a coefficient at all by glm. But there are no error messages. > > > The summary of the model looks normal, but some predictor variables are missing like I mentioned. Most other predictors have assigned data (coefficient, pvalue, etc.). > > About 30 predictors are missing from the model, over 200. > > > I have tried using different response variables (vectors), but I am getting the same issue, although the missing predictors vary depending on the response vector... > > > Any ideas on what might be going on? I think this can happen if some variables have 0 variance, but I have checked that. There are also no NA values and no missing values in the tables. > > > What could cause glm to ignore/remove some predictor variables? > > > Any suggestion is welcome! > > > Thank you, > > > Gabrielle > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
> On 27 Mar 2017, at 17:23 , Gabrielle Perron <gabrielle.perron at mail.mcgill.ca> wrote: > > Hi, > > > This is my first time using this mailing list. I have looked at the posting guide, but please do let me know if I should be doing something differently.Avoid sending in HTML. It's not really bad here except for excessive inter-paragraph spacing, but the autoconversion to plain text can make posts almost unreadable. --- It looks like some of your predictors are equal to linear combinations of other predictors. E.g., if you have dummies for mutually exclusive groups, they will sum to a vector of ones, which is already in the model for the intercept term, so one must be removed. (R has a better interface to this sort of thing, using factor variables, but that is another story.) With many predictors, this can happen in less obvious ways as well. For plain multiple regression, I think most would use lm() and not glm(). It shouldn't make much of a difference, but some details may differ. -pd> Here is my question, I apologize in advance for not being able to provide example data, I am using very large tables, and what I am trying to do works fine with simpler examples, so providing example data cannot help. It has always worked for me until now. So I am just trying to get your ideas on what might be the issue. But if there is any way I could provide more information, do let me know. > > > So, I have a vector corresponding to a response variable and a table of predictor variables. The response vector is numeric, the predictor variables (columns of the table) are in the binary format (0s and 1s). > > > I am running the glm function (multivariate linear regression) using the response vector and the table of predictors: > > > fit <- glm(response ~ as.matrix(predictors), na.action=na.exclude) > > coeff <- as.vector(coef(summary(fit))[,4])[-1] > > > When I have been doing that in the past, I would extract the vector of regression coefficient to use it for further analysis. > > > The problem is that now the regression returns a vector of coefficients which is missing some values. Essentially some predictor variables are not attributed a coefficient at all by glm. But there are no error messages. > > > The summary of the model looks normal, but some predictor variables are missing like I mentioned. Most other predictors have assigned data (coefficient, pvalue, etc.). > > About 30 predictors are missing from the model, over 200. > > > I have tried using different response variables (vectors), but I am getting the same issue, although the missing predictors vary depending on the response vector... > > > Any ideas on what might be going on? I think this can happen if some variables have 0 variance, but I have checked that. There are also no NA values and no missing values in the tables. > > > What could cause glm to ignore/remove some predictor variables? > > > Any suggestion is welcome! > > > Thank you, > > > Gabrielle > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com