Hello R-helpers, I'd like a function that given an arbitrary formula and a data frame returns the residual of the dependent variable, and maintains all NA values. Here's an example that will give me what I want if my formula is y~x1+x2+x3 and my data frame is df: resid(lm(y~x1+x2+x3, data=df, na.action=na.exclude)) Here's the catch, I do not want my function to ever fail due to a factor with only one level. A one-level factor may appear because 1) the user passed it in, or 2) (more common) only one factor in a term is left after na.exclude removes the other NA values. Here is the error I would get above if one of the terms was a factor with one level: Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels Instead of giving me an error, I'd like the function to do just what lm() normally does when it sees a variable with no variance, ignore the variable (coefficient is NA) and continue to regress out all the other variables. Thus if 'x2' is a factor with one variable in the above example, I'd like the function to return the result of: resid(lm(y~x1+x3, data=df, na.action=na.exclude)) Can anyone provide me a straight forward recommendation for how to do this? I feel like it should be easy, but I'm honestly stuck, and my Google searching for this hasn't gotten anywhere. The key is that I'd like the solution to be generic enough to work with an arbitrary linear formula, and not substantially kludgy (like trying ever combination of regressions terms until one works) as I'll be running this a lot on big data sets and don't want my computation time swamped by running unnecessary regressions or checking for number of factors after removing NAs. Thanks in advance! --Robert PS. The Google search feature in the R-help archives appears to be down: http://tolstoy.newcastle.edu.au/R/ [[alternative HTML version deleted]]
Robert McGehee <rmcgehee <at> gmail.com> writes:> > Hello R-helpers, > I'd like a function that given an arbitrary formula and a data frame > returns the residual of the dependent variable, and maintains all > NA values. > > Here's an example that will give me what I want if my formula is y~x1+x2+x3 > and my data frame is df: > > resid(lm(y~x1+x2+x3, data=df, na.action=na.exclude)) > > Here's the catch, I do not want my function to ever fail due to a factor > with only one level. A one-level factor may appear because 1) the user > passed it in, or 2) (more common) only one factor in a term is left after > na.exclude removes the other NA values. >[snip to try to make Gmane happy]> > Can anyone provide me a straight forward recommendation for how > to do this?The only approach I can think of is to screen for single-level factors yourself and remove these factors from the formula. It's a little tricky; you can't call model.frame() with a single-level factor (that's where the error comes from), and you have to strip out NA values yourself so you can see which factors end up with only a single level after NA removal.
> On Mar 10, 2016, at 2:00 PM, Robert McGehee <rmcgehee at gmail.com> wrote: > > Hello R-helpers, > I'd like a function that given an arbitrary formula and a data frame > returns the residual of the dependent variable,and maintains all NA values.What does "maintains all NA values" actually mean?> > Here's an example that will give me what I want if my formula is y~x1+x2+x3 > and my data frame is df: > > resid(lm(y~x1+x2+x3, data=df, na.action=na.exclude)) > > Here's the catch, I do not want my function to ever fail due to a factor > with only one level. A one-level factor may appear because 1) the user > passed it in, or 2) (more common) only one factor in a term is left after > na.exclude removes the other NA values. > > Here is the error I would getFrom what code?> above if one of the terms was a factor with > one level: > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > contrasts can be applied only to factors with 2 or more levelsUnable to create that error with the actions you decribe but to not actually offer in coded form:> dfrm <- data.frame(y=rnorm(10), x1=rnorm(10) ,x2=TRUE, x3=rnorm(10)) > lm(y~x1+x2+x3, dfrm)Call: lm(formula = y ~ x1 + x2 + x3, data = dfrm) Coefficients: (Intercept) x1 x2TRUE x3 -0.16274 -0.30032 NA -0.09093> resid(lm(y~x1+x2+x3, data=dfrm, na.action=na.exclude))1 2 3 4 5 6 -0.16097245 0.65408508 -0.70098223 -0.15360434 1.26027872 0.55752239 7 8 9 10 -0.05965653 -2.17480605 1.42917190 -0.65103650>> Instead of giving me an error, I'd like the function to do just what lm() > normally does when it sees a variable with no variance, ignore the variable > (coefficient is NA) and continue to regress out all the other variables. > Thus if 'x2' is a factor with one variable in the above example, I'd like > the function to return the result of: > resid(lm(y~x1+x3, data=df, na.action=na.exclude)) > Can anyone provide me a straight forward recommendation for how to do this? > I feel like it should be easy, but I'm honestly stuck, and my Google > searching for this hasn't gotten anywhere. The key is that I'd like the > solution to be generic enough to work with an arbitrary linear formula, and > not substantially kludgy (like trying ever combination of regressions terms > until one works) as I'll be running this a lot on big data sets and don't > want my computation time swamped by running unnecessary regressions or > checking for number of factors after removing NAs. > > Thanks in advance! > --Robert > > > PS. The Google search feature in the R-help archives appears to be down: > http://tolstoy.newcastle.edu.au/R/It's working for me.> > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Here's an example for clarity:> df <- data.frame(y=c(0,2,4,6,8), x1=c(1,1,2,2,NA),x2=factor(c("A","A","A","A","B")))> resid(lm(y~x1+x2, data=df, na.action=na.exclude)Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels Note that the x2 factor variable contains two levels, but the "B" level is excluded in the regression due to the NA value in x1. Hence the error. Instead of the above error, I would like a function that returns the residual of the regression without the offending term, which in this case would be equivalent to:> resid(lm(y~x1, data=df, na.action=na.exclude)1 2 3 4 5 -1 1 -1 1 NA Note the 5th term returns an NA as there is an NA in the x1 independent variable, which was what I had meant by maintain NAs. I'm currently leaning towards rewriting model.matrix.default so that it removes offending terms rather than give an error, but if someone has done this already (or something more elegant), that would of course be preferred :) --Robert On Thu, Mar 10, 2016 at 7:39 PM, David Winsemius <dwinsemius at comcast.net> wrote:> > > On Mar 10, 2016, at 2:00 PM, Robert McGehee <rmcgehee at gmail.com> wrote: > > > > Hello R-helpers, > > I'd like a function that given an arbitrary formula and a data frame > > returns the residual of the dependent variable,and maintains all NA > values. > > What does "maintains all NA values" actually mean? > > > > Here's an example that will give me what I want if my formula is > y~x1+x2+x3 > > and my data frame is df: > > > > resid(lm(y~x1+x2+x3, data=df, na.action=na.exclude)) > > > > Here's the catch, I do not want my function to ever fail due to a factor > > with only one level. A one-level factor may appear because 1) the user > > passed it in, or 2) (more common) only one factor in a term is left after > > na.exclude removes the other NA values. > > > > Here is the error I would get > > From what code? > > > > above if one of the terms was a factor with > > one level: > > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > > contrasts can be applied only to factors with 2 or more levels > > Unable to create that error with the actions you decribe but to not > actually offer in coded form: > > > > dfrm <- data.frame(y=rnorm(10), x1=rnorm(10) ,x2=TRUE, x3=rnorm(10)) > > lm(y~x1+x2+x3, dfrm) > > Call: > lm(formula = y ~ x1 + x2 + x3, data = dfrm) > > Coefficients: > (Intercept) x1 x2TRUE x3 > -0.16274 -0.30032 NA -0.09093 > > > resid(lm(y~x1+x2+x3, data=dfrm, na.action=na.exclude)) > 1 2 3 4 5 6 > -0.16097245 0.65408508 -0.70098223 -0.15360434 1.26027872 0.55752239 > 7 8 9 10 > -0.05965653 -2.17480605 1.42917190 -0.65103650 > > > > > > > Instead of giving me an error, I'd like the function to do just what lm() > > normally does when it sees a variable with no variance, ignore the > variable > > (coefficient is NA) and continue to regress out all the other variables. > > Thus if 'x2' is a factor with one variable in the above example, I'd like > > the function to return the result of: > > resid(lm(y~x1+x3, data=df, na.action=na.exclude)) > > Can anyone provide me a straight forward recommendation for how to do > this? > > I feel like it should be easy, but I'm honestly stuck, and my Google > > searching for this hasn't gotten anywhere. The key is that I'd like the > > solution to be generic enough to work with an arbitrary linear formula, > and > > not substantially kludgy (like trying ever combination of regressions > terms > > until one works) as I'll be running this a lot on big data sets and don't > > want my computation time swamped by running unnecessary regressions or > > checking for number of factors after removing NAs. > > > > Thanks in advance! > > --Robert > > > > > > PS. The Google search feature in the R-help archives appears to be down: > > http://tolstoy.newcastle.edu.au/R/ > > It's working for me. > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > >[[alternative HTML version deleted]]
> -----Original Message----- > From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of David > Winsemius > Sent: Thursday, March 10, 2016 4:39 PM > To: Robert McGehee > Cc: r-help at r-project.org > Subject: Re: [R] Regression with factor having1 level > > > > On Mar 10, 2016, at 2:00 PM, Robert McGehee <rmcgehee at gmail.com> > wrote: > > > > Hello R-helpers, > > I'd like a function that given an arbitrary formula and a data frame > > returns the residual of the dependent variable,and maintains all NA values. > > What does "maintains all NA values" actually mean? > > > > Here's an example that will give me what I want if my formula is > > y~x1+x2+x3 and my data frame is df: > > > > resid(lm(y~x1+x2+x3, data=df, na.action=na.exclude)) > > > > Here's the catch, I do not want my function to ever fail due to a > > factor with only one level. A one-level factor may appear because 1) > > the user passed it in, or 2) (more common) only one factor in a term > > is left after na.exclude removes the other NA values. > > > > Here is the error I would get > > From what code? > > > > above if one of the terms was a factor with one level: > > Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : > > contrasts can be applied only to factors with 2 or more levels > > Unable to create that error with the actions you decribe but to not actually > offer in coded form: > > > > dfrm <- data.frame(y=rnorm(10), x1=rnorm(10) ,x2=TRUE, x3=rnorm(10)) > > lm(y~x1+x2+x3, dfrm) > > Call: > lm(formula = y ~ x1 + x2 + x3, data = dfrm) > > Coefficients: > (Intercept) x1 x2TRUE x3 > -0.16274 -0.30032 NA -0.09093 > > > resid(lm(y~x1+x2+x3, data=dfrm, na.action=na.exclude)) > 1 2 3 4 5 6 > -0.16097245 0.65408508 -0.70098223 -0.15360434 1.26027872 0.55752239 > 7 8 9 10 > -0.05965653 -2.17480605 1.42917190 -0.65103650 > > > > > > > Instead of giving me an error, I'd like the function to do just what > > lm() normally does when it sees a variable with no variance, ignore > > the variable (coefficient is NA) and continue to regress out all the other > variables. > > Thus if 'x2' is a factor with one variable in the above example, I'd > > like the function to return the result of: > > resid(lm(y~x1+x3, data=df, na.action=na.exclude)) Can anyone provide > > me a straight forward recommendation for how to do this? > > I feel like it should be easy, but I'm honestly stuck, and my Google > > searching for this hasn't gotten anywhere. The key is that I'd like > > the solution to be generic enough to work with an arbitrary linear > > formula, and not substantially kludgy (like trying ever combination of > > regressions terms until one works) as I'll be running this a lot on > > big data sets and don't want my computation time swamped by running > > unnecessary regressions or checking for number of factors after removing > NAs. > > > > Thanks in advance! > > --Robert > > > > > > PS. The Google search feature in the R-help archives appears to be down: > > http://tolstoy.newcastle.edu.au/R/ > > It's working for me. > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA >I agree that what is wanted is not clear. However, if dfrm is created with x2 as a factor, then you get the error message that the OP mentions when you run the regression.> dfrm <- data.frame(y=rnorm(10), x1=rnorm(10) ,x2=as.factor(TRUE), x3=rnorm(10)) > lm(y~x1+x2+x3, dfrm, na.action=na.exclude)Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied Dan Daniel Nordlund, PhD Research and Data Analysis Division Services & Enterprise Support Administration Washington State Department of Social and Health Services