Liaw, Andy
2005-Feb-15 03:41 UTC
[R] using poly in a linear regression in the presence of NA f ails (despite subsetting them out)
This smells like a bug to me. The error is triggered by the line: variables <- eval(predvars, data, env) inside model.frame.default(). At that point, na.action has not been applied, so poly() ended being called on data that still contains missing values. The qr() that issued the error is for generating the orthogonal basis when evaluating poly(), not for fitting the linear model itself. Essentially, calling model.frame(y ~ poly(x, 2), data=data.frame(x=c(NA, 1:3), y=rnorm(4)), na.action=na.omit) would show the same error. The obvious workaround is to omit cases with NAs before calling lm(). Andy> From: Markus J?ntti > > I ran into a to me surprising result on running lm with an orthogonal > polynomial among the predictors. > > The lm command resulted in > > Error in qr(X) : NA/NaN/Inf in foreign function call (arg 1) > Error during wrapup: > > despite my using a "subset" in the call to get rid of NA's. > > poly is apparently evaluated before any NA's are subsetted out > of the data. > > Example code (attached to this e-mail as as a script): > > ## generate some data > > n <- 50 > > x <- runif(n) > > a0 <- 10 > > a1 <- .5 > > sigma.e <- 1 > > y <- a0 + a1*x + rnorm(n)*sigma.e > > tmp.d <- data.frame(y, x) > > rm(list=c("n", "x", "a0", "a1", "sigma.e", "y")) > > > > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d) > + > + ## now make a few NA's > + > + tmp.d$x[1:2] <- rep(NA, 2) > Error: syntax error > Error during wrapup: > > > > ## this fails, just as it should > > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d)) > > Call: > lm(formula = y ~ poly(x, 2), data = tmp.d) > > Coefficients: > (Intercept) poly(x, 2)1 poly(x, 2)2 > 10.380 -0.242 -1.441 > > > > > ## these also fail, but should not? > > > > print(lm.2 <- lm(y ~ poly(x, 2), data = tmp.d, subset = !is.na(x))) > > Call: > lm(formula = y ~ poly(x, 2), data = tmp.d, subset = !is.na(x)) > > Coefficients: > (Intercept) poly(x, 2)1 poly(x, 2)2 > 10.380 -0.242 -1.441 > > > print(lm.3 <- lm(y ~ poly(x, 2), data = tmp.d, na.action = > na.omit)) > > Call: > lm(formula = y ~ poly(x, 2), data = tmp.d, na.action = na.omit) > > Coefficients: > (Intercept) poly(x, 2)1 poly(x, 2)2 > 10.380 -0.242 -1.441 > > > > > ## but this works > > > > print(lm.3 <- lm(y ~ poly(x, 2), data = subset(tmp.d, subset = > !is.na(x)))) > > Call: > lm(formula = y ~ poly(x, 2), data = subset(tmp.d, subset = !is.na(x))) > > Coefficients: > (Intercept) poly(x, 2)1 poly(x, 2)2 > 10.380 -0.242 -1.441 > > -------------------- > > The documentation of lm is *not* misleading at this point, saying that > > subset an optional vector specifying a subset of > observations to be > used in the fitting process. > > which implies that data are subsetted once lm.fit is called. > All the same, this behavior is a little unexpected to me. > Is it to be considered a feature, that is, does it produce beneficial > side effects which explain why it works as it does? > > Regards, > > Markus > > I am running R on a Debian testing system with kernel 2.6.10 and > > > version > _ > platform i386-pc-linux-gnu > arch i386 > os linux-gnu > system i386, linux-gnu > status > major 2 > minor 0.1 > year 2004 > month 11 > day 15 > language R > -- > Markus Jantti > Abo Akademi University > markus.jantti at iki.fi > http://www.iki.fi/~mjantti >
Prof Brian Ripley
2005-Feb-15 07:19 UTC
[R] using poly in a linear regression in the presence of NA f ails (despite subsetting them out)
Andy, I don't think it is a bug. The problem is that poly(x, 2) depends on the possible set of x values, and so needs to know all of them, unlike e.g. log(x) which is observation-by-observation. Silently omitting missing values is not a good idea in such cases, especially if the values are missing in other variables (which is what na.action is likely to do). I would say models with poly, ns, bs etc are inadvisable in the presence of missing values in their argument. We could make poly() give an informative message, though. Brian On Mon, 14 Feb 2005, Liaw, Andy wrote:> This smells like a bug to me. The error is triggered by the line: > > variables <- eval(predvars, data, env) > > inside model.frame.default(). At that point, na.action has not been > applied, so poly() ended being called on data that still contains missing > values. The qr() that issued the error is for generating the orthogonal > basis when evaluating poly(), not for fitting the linear model itself. > > Essentially, calling > > model.frame(y ~ poly(x, 2), data=data.frame(x=c(NA, 1:3), y=rnorm(4)), > na.action=na.omit) > > would show the same error. The obvious workaround is to omit cases with NAs > before calling lm(). > > Andy > >> From: Markus J?ntti >> >> I ran into a to me surprising result on running lm with an orthogonal >> polynomial among the predictors. >> >> The lm command resulted in >> >> Error in qr(X) : NA/NaN/Inf in foreign function call (arg 1) >> Error during wrapup: >> >> despite my using a "subset" in the call to get rid of NA's. >> >> poly is apparently evaluated before any NA's are subsetted out >> of the data. >> >> Example code (attached to this e-mail as as a script): >> > ## generate some data >> > n <- 50 >> > x <- runif(n) >> > a0 <- 10 >> > a1 <- .5 >> > sigma.e <- 1 >> > y <- a0 + a1*x + rnorm(n)*sigma.e >> > tmp.d <- data.frame(y, x) >> > rm(list=c("n", "x", "a0", "a1", "sigma.e", "y")) >> > >> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d) >> + >> + ## now make a few NA's >> + >> + tmp.d$x[1:2] <- rep(NA, 2) >> Error: syntax error >> Error during wrapup: >> > >> > ## this fails, just as it should >> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d)) >> >> Call: >> lm(formula = y ~ poly(x, 2), data = tmp.d) >> >> Coefficients: >> (Intercept) poly(x, 2)1 poly(x, 2)2 >> 10.380 -0.242 -1.441 >> >> > >> > ## these also fail, but should not? >> > >> > print(lm.2 <- lm(y ~ poly(x, 2), data = tmp.d, subset = !is.na(x))) >> >> Call: >> lm(formula = y ~ poly(x, 2), data = tmp.d, subset = !is.na(x)) >> >> Coefficients: >> (Intercept) poly(x, 2)1 poly(x, 2)2 >> 10.380 -0.242 -1.441 >> >> > print(lm.3 <- lm(y ~ poly(x, 2), data = tmp.d, na.action >> na.omit)) >> >> Call: >> lm(formula = y ~ poly(x, 2), data = tmp.d, na.action = na.omit) >> >> Coefficients: >> (Intercept) poly(x, 2)1 poly(x, 2)2 >> 10.380 -0.242 -1.441 >> >> > >> > ## but this works >> > >> > print(lm.3 <- lm(y ~ poly(x, 2), data = subset(tmp.d, subset >> !is.na(x)))) >> >> Call: >> lm(formula = y ~ poly(x, 2), data = subset(tmp.d, subset = !is.na(x))) >> >> Coefficients: >> (Intercept) poly(x, 2)1 poly(x, 2)2 >> 10.380 -0.242 -1.441 >> >> -------------------- >> >> The documentation of lm is *not* misleading at this point, saying that >> >> subset an optional vector specifying a subset of >> observations to be >> used in the fitting process. >> >> which implies that data are subsetted once lm.fit is called. >> All the same, this behavior is a little unexpected to me. >> Is it to be considered a feature, that is, does it produce beneficial >> side effects which explain why it works as it does? >> >> Regards, >> >> Markus >> >> I am running R on a Debian testing system with kernel 2.6.10 and >> >> > version >> _ >> platform i386-pc-linux-gnu >> arch i386 >> os linux-gnu >> system i386, linux-gnu >> status >> major 2 >> minor 0.1 >> year 2004 >> month 11 >> day 15 >> language R >> -- >> Markus Jantti >> Abo Akademi University >> markus.jantti at iki.fi >> http://www.iki.fi/~mjantti >> > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Liaw, Andy
2005-Feb-15 13:30 UTC
[R] using poly in a linear regression in the presence of NA f ails (despite subsetting them out)
My apologies: It's another case of me not thinking statistically... It may also help those of us whose brains run at slow clock speeds to have ?poly, ?bs and ?ns mention how they react to NAs. Best, Andy> From: Prof Brian Ripley > > Andy, > > I don't think it is a bug. The problem is that poly(x, 2) > depends on the > possible set of x values, and so needs to know all of them, unlike > e.g. log(x) which is observation-by-observation. Silently omitting > missing values is not a good idea in such cases, especially > if the values > are missing in other variables (which is what na.action is > likely to do). > > I would say models with poly, ns, bs etc are inadvisable in > the presence > of missing values in their argument. We could make poly() give an > informative message, though. > > Brian > > > On Mon, 14 Feb 2005, Liaw, Andy wrote: > > > This smells like a bug to me. The error is triggered by the line: > > > > variables <- eval(predvars, data, env) > > > > inside model.frame.default(). At that point, na.action has not been > > applied, so poly() ended being called on data that still > contains missing > > values. The qr() that issued the error is for generating > the orthogonal > > basis when evaluating poly(), not for fitting the linear > model itself. > > > > Essentially, calling > > > > model.frame(y ~ poly(x, 2), data=data.frame(x=c(NA, 1:3), > y=rnorm(4)), > > na.action=na.omit) > > > > would show the same error. The obvious workaround is to > omit cases with NAs > > before calling lm(). > > > > Andy > > > >> From: Markus J?ntti > >> > >> I ran into a to me surprising result on running lm with an > orthogonal > >> polynomial among the predictors. > >> > >> The lm command resulted in > >> > >> Error in qr(X) : NA/NaN/Inf in foreign function call (arg 1) > >> Error during wrapup: > >> > >> despite my using a "subset" in the call to get rid of NA's. > >> > >> poly is apparently evaluated before any NA's are subsetted out > >> of the data. > >> > >> Example code (attached to this e-mail as as a script): > >> > ## generate some data > >> > n <- 50 > >> > x <- runif(n) > >> > a0 <- 10 > >> > a1 <- .5 > >> > sigma.e <- 1 > >> > y <- a0 + a1*x + rnorm(n)*sigma.e > >> > tmp.d <- data.frame(y, x) > >> > rm(list=c("n", "x", "a0", "a1", "sigma.e", "y")) > >> > > >> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d) > >> + > >> + ## now make a few NA's > >> + > >> + tmp.d$x[1:2] <- rep(NA, 2) > >> Error: syntax error > >> Error during wrapup: > >> > > >> > ## this fails, just as it should > >> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d)) > >> > >> Call: > >> lm(formula = y ~ poly(x, 2), data = tmp.d) > >> > >> Coefficients: > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > >> 10.380 -0.242 -1.441 > >> > >> > > >> > ## these also fail, but should not? > >> > > >> > print(lm.2 <- lm(y ~ poly(x, 2), data = tmp.d, subset = > !is.na(x))) > >> > >> Call: > >> lm(formula = y ~ poly(x, 2), data = tmp.d, subset = !is.na(x)) > >> > >> Coefficients: > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > >> 10.380 -0.242 -1.441 > >> > >> > print(lm.3 <- lm(y ~ poly(x, 2), data = tmp.d, na.action > >> na.omit)) > >> > >> Call: > >> lm(formula = y ~ poly(x, 2), data = tmp.d, na.action = na.omit) > >> > >> Coefficients: > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > >> 10.380 -0.242 -1.441 > >> > >> > > >> > ## but this works > >> > > >> > print(lm.3 <- lm(y ~ poly(x, 2), data = subset(tmp.d, subset > >> !is.na(x)))) > >> > >> Call: > >> lm(formula = y ~ poly(x, 2), data = subset(tmp.d, subset = > !is.na(x))) > >> > >> Coefficients: > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > >> 10.380 -0.242 -1.441 > >> > >> -------------------- > >> > >> The documentation of lm is *not* misleading at this point, > saying that > >> > >> subset an optional vector specifying a subset of > >> observations to be > >> used in the fitting process. > >> > >> which implies that data are subsetted once lm.fit is called. > >> All the same, this behavior is a little unexpected to me. > >> Is it to be considered a feature, that is, does it produce > beneficial > >> side effects which explain why it works as it does? > >> > >> Regards, > >> > >> Markus > >> > >> I am running R on a Debian testing system with kernel 2.6.10 and > >> > >> > version > >> _ > >> platform i386-pc-linux-gnu > >> arch i386 > >> os linux-gnu > >> system i386, linux-gnu > >> status > >> major 2 > >> minor 0.1 > >> year 2004 > >> month 11 > >> day 15 > >> language R > >> -- > >> Markus Jantti > >> Abo Akademi University > >> markus.jantti at iki.fi > >> http://www.iki.fi/~mjantti > >> > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > > > -- > Brian D. > Ripley, ripley at stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 >
John Fox
2005-Feb-18 21:27 UTC
[Rd] RE: [R] using poly in a linear regression in the presence of NAf ails (despite subsetting them out)
Dear Andy, Brian, and Markus, I've moved this to r-devel because the issue is a bit esoteric. I apologize for joining the discussion so late, but didn't have time earlier in the week to formulate these ideas. I believe that I understand and appreciate Brian's point, but think that the issue isn't entirely clear-cut. It seems to me that two kinds of questions arise with missing data. The deeper ones are statistical but there are also nontrivial mechanical issues such as the one here. Whenever a term in a model is parametrized with a data-dependent basis, there's a potential for problems and confusion -- manifested, for example, in the issue of "safe" prediction. I don't think that these problems are unique to missing data. On the other hand, the basis selected for the subspace spanned by a term shouldn't be the most important consideration. That is, when models are equivalent -- as, for example lm(y ~ x + I(x^2)) and lm(y ~ poly(x, 2)), an argument could be made for treating them similarly. Brian's point that NAs, say, in x2, can influence the basis for poly(x1, 2) is disquieting, but note that this can happen now if there are no NAs in x1. The point, therefore, doesn't really justify the current behaviour of poly(). Indeed, if there are NAs in x2 but not in x1, the columns representing poly(x1, 2) won't be orthogonal in the subset of cases used in the model fit (though they do provide a correct basis for the term). Consider another example -- lm(y ~ f, subset = f != "a"), where f is a factor with levels "a", "b", and "c", and where there are NAs in f. Here the basis for the f term is data dependent, in that the baseline level is taken as "b" in the absence of "a", yet NAs don't cause an error. Having poly(), bs(), and ns() report a more informative error message certainly is preferable to the current situation, but an alternative would be to have them work and perhaps report a warning. If the object is to protect people from stepping into traps because of missing data, then an argument could be made for having the default na.action be na.fail, as in S-PLUS. (I wouldn't advocate this, by the way.) Perhaps I'm missing some consideration here, but isn't it clearest to allow the data, subset, and na.action arguments to determine the data in which the formula is evaluated? Regards, John -------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox --------------------------------> -----Original Message----- > From: r-help-bounces@stat.math.ethz.ch > [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Liaw, Andy > Sent: Tuesday, February 15, 2005 8:31 AM > To: 'Prof Brian Ripley' > Cc: r-help@stat.math.ethz.ch; 'Markus Jantti' > Subject: RE: [R] using poly in a linear regression in the > presence of NAf ails (despite subsetting them out) > > My apologies: It's another case of me not thinking > statistically... It may also help those of us whose brains > run at slow clock speeds to have ?poly, ?bs and ?ns mention > how they react to NAs. > > Best, > Andy > > > > From: Prof Brian Ripley > > > > Andy, > > > > I don't think it is a bug. The problem is that poly(x, 2) > depends on > > the possible set of x values, and so needs to know all of > them, unlike > > e.g. log(x) which is observation-by-observation. Silently omitting > > missing values is not a good idea in such cases, especially if the > > values are missing in other variables (which is what na.action is > > likely to do). > > > > I would say models with poly, ns, bs etc are inadvisable in the > > presence of missing values in their argument. We could make poly() > > give an informative message, though. > > > > Brian > > > > > > On Mon, 14 Feb 2005, Liaw, Andy wrote: > > > > > This smells like a bug to me. The error is triggered by the line: > > > > > > variables <- eval(predvars, data, env) > > > > > > inside model.frame.default(). At that point, na.action > has not been > > > applied, so poly() ended being called on data that still > > contains missing > > > values. The qr() that issued the error is for generating > > the orthogonal > > > basis when evaluating poly(), not for fitting the linear > > model itself. > > > > > > Essentially, calling > > > > > > model.frame(y ~ poly(x, 2), data=data.frame(x=c(NA, 1:3), > > y=rnorm(4)), > > > na.action=na.omit) > > > > > > would show the same error. The obvious workaround is to > > omit cases with NAs > > > before calling lm(). > > > > > > Andy > > > > > >> From: Markus J?ntti > > >> > > >> I ran into a to me surprising result on running lm with an > > orthogonal > > >> polynomial among the predictors. > > >> > > >> The lm command resulted in > > >> > > >> Error in qr(X) : NA/NaN/Inf in foreign function call > (arg 1) Error > > >> during wrapup: > > >> > > >> despite my using a "subset" in the call to get rid of NA's. > > >> > > >> poly is apparently evaluated before any NA's are > subsetted out of > > >> the data. > > >> > > >> Example code (attached to this e-mail as as a script): > > >> > ## generate some data > > >> > n <- 50 > > >> > x <- runif(n) > > >> > a0 <- 10 > > >> > a1 <- .5 > > >> > sigma.e <- 1 > > >> > y <- a0 + a1*x + rnorm(n)*sigma.e tmp.d <- data.frame(y, x) > > >> > rm(list=c("n", "x", "a0", "a1", "sigma.e", "y")) > > >> > > > >> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d) > > >> + > > >> + ## now make a few NA's > > >> + > > >> + tmp.d$x[1:2] <- rep(NA, 2) > > >> Error: syntax error > > >> Error during wrapup: > > >> > > > >> > ## this fails, just as it should > > >> > print(lm.1 <- lm(y ~ poly(x, 2), data = tmp.d)) > > >> > > >> Call: > > >> lm(formula = y ~ poly(x, 2), data = tmp.d) > > >> > > >> Coefficients: > > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > > >> 10.380 -0.242 -1.441 > > >> > > >> > > > >> > ## these also fail, but should not? > > >> > > > >> > print(lm.2 <- lm(y ~ poly(x, 2), data = tmp.d, subset > > !is.na(x))) > > >> > > >> Call: > > >> lm(formula = y ~ poly(x, 2), data = tmp.d, subset = !is.na(x)) > > >> > > >> Coefficients: > > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > > >> 10.380 -0.242 -1.441 > > >> > > >> > print(lm.3 <- lm(y ~ poly(x, 2), data = tmp.d, na.action > > >> na.omit)) > > >> > > >> Call: > > >> lm(formula = y ~ poly(x, 2), data = tmp.d, na.action = na.omit) > > >> > > >> Coefficients: > > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > > >> 10.380 -0.242 -1.441 > > >> > > >> > > > >> > ## but this works > > >> > > > >> > print(lm.3 <- lm(y ~ poly(x, 2), data = subset(tmp.d, subset > > >> !is.na(x)))) > > >> > > >> Call: > > >> lm(formula = y ~ poly(x, 2), data = subset(tmp.d, subset > > !is.na(x))) > > >> > > >> Coefficients: > > >> (Intercept) poly(x, 2)1 poly(x, 2)2 > > >> 10.380 -0.242 -1.441 > > >> > > >> -------------------- > > >> > > >> The documentation of lm is *not* misleading at this point, > > saying that > > >> > > >> subset an optional vector specifying a subset of > > >> observations to be > > >> used in the fitting process. > > >> > > >> which implies that data are subsetted once lm.fit is called. > > >> All the same, this behavior is a little unexpected to me. > > >> Is it to be considered a feature, that is, does it produce > > beneficial > > >> side effects which explain why it works as it does? > > >> > > >> Regards, > > >> > > >> Markus > > >> > > >> I am running R on a Debian testing system with kernel 2.6.10 and > > >> > > >> > version > > >> _ > > >> platform i386-pc-linux-gnu > > >> arch i386 > > >> os linux-gnu > > >> system i386, linux-gnu > > >> status > > >> major 2 > > >> minor 0.1 > > >> year 2004 > > >> month 11 > > >> day 15 > > >> language R > > >> -- > > >> Markus Jantti > > >> Abo Akademi University > > >> markus.jantti@iki.fi > > >> http://www.iki.fi/~mjantti > > >> > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > -- > > Brian D. > > Ripley, ripley@stats.ox.ac.uk > > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > > University of Oxford, Tel: +44 1865 272861 (self) > > 1 South Parks Road, +44 1865 272866 (PA) > > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html
Possibly Parallel Threads
- using poly in a linear regression in the presence of NA f ails (despite subsetting them out)
- using poly in a linear regression in the presence of NA fails (despite subsetting them out)
- poly() in lm() leads to wrong coefficients (but correct residuals)
- editing Sweave files in xemacs with ess (noweb), auctex and reftex
- Bug in poly() (PR#11243)