Ravi Varadhan
2015-Oct-21 14:53 UTC
[R] Linear regression with a rounded response variable
Hi, I am dealing with a regression problem where the response variable, time (second) to walk 15 ft, is rounded to the nearest integer. I do not care for the regression coefficients per se, but my main interest is in getting the prediction equation for walking speed, given the predictors (age, height, sex, etc.), where the predictions will be real numbers, and not integers. The hope is that these predictions should provide unbiased estimates of the "unrounded" walking speed. These sounds like a measurement error problem, where the measurement error is due to rounding and hence would be uniformly distributed (-0.5, 0.5). Are there any canonical approaches for handling this type of a problem? What is wrong with just doing the standard linear regression? I googled and saw that this question was asked by someone else in a stackexchange post, but it was unanswered. Any suggestions? Thank you, Ravi Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) Associate Professor, Department of Oncology Division of Biostatistics & Bionformatics Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 550 N. Broadway, Suite 1111-E Baltimore, MD 21205 410-502-2619 [[alternative HTML version deleted]]
Hi Ravi, Thanks for this interesting question. My thoughts are given below. If you believe the rounding is indeed uniformly distributed, then the problem is equivalent with adding a uniform random error between (-0.5, 0.5) for every observation in addition to the standard normal error, which will make the new error term have a mixture distribution. Intuitively, the impact of this newly added term depends on the relative scale of the original normal and the new uniform error terms. To see the exact impact, you can simulate sets of new response variables by adding uniform errors from (-0.5, 0.5) to the original response variables and see the results. I wish I could have more theoretical answers and hope this helps as well. Best, Xu Xu Tian, Ph.D. Senior Statistician Validus Research New York, NY 10005 On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at jhu.edu> wrote:> Hi, > I am dealing with a regression problem where the response variable, time > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > for the regression coefficients per se, but my main interest is in getting > the prediction equation for walking speed, given the predictors (age, > height, sex, etc.), where the predictions will be real numbers, and not > integers. The hope is that these predictions should provide unbiased > estimates of the "unrounded" walking speed. These sounds like a measurement > error problem, where the measurement error is due to rounding and hence > would be uniformly distributed (-0.5, 0.5). > > Are there any canonical approaches for handling this type of a problem? > What is wrong with just doing the standard linear regression? > > I googled and saw that this question was asked by someone else in a > stackexchange post, but it was unanswered. Any suggestions? > > Thank you, > Ravi > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > Associate Professor, Department of Oncology > Division of Biostatistics & Bionformatics > Sidney Kimmel Comprehensive Cancer Center > Johns Hopkins University > 550 N. Broadway, Suite 1111-E > Baltimore, MD 21205 > 410-502-2619 > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- *Xu Tian* [[alternative HTML version deleted]]
Charles C. Berry
2015-Oct-21 17:57 UTC
[R] Linear regression with a rounded response variable
On Wed, 21 Oct 2015, Ravi Varadhan wrote:> Hi, I am dealing with a regression problem where the response variable, > time (second) to walk 15 ft, is rounded to the nearest integer. I do > not care for the regression coefficients per se, but my main interest is > in getting the prediction equation for walking speed, given the > predictors (age, height, sex, etc.), where the predictions will be real > numbers, and not integers. The hope is that these predictions should > provide unbiased estimates of the "unrounded" walking speed. These > sounds like a measurement error problem, where the measurement error is > due to rounding and hence would be uniformly distributed (-0.5, 0.5). >Not the usual "measurement error model" problem, though, where the errors are in X and not independent of XB. Look back at the proof of the unbiasedness of least squares under the Gauss-Markov setup. The errors in Y need to have expectation zero.>From your description (but see caveat below) this is true of walking*time*, but not not exactly true of walking *speed* (modulo the usual assumptions if they apply to time). In fact if E(epsilon) = 0 were true of unrounded time, it would not be true of unrounded speed (and vice versa).> Are there any canonical approaches for handling this type of a problem?Work out the bias analytically? Parametric bootstrap? Data augmentation and friends?> What is wrong with just doing the standard linear regression? >Well, what do the actual values look like? If half the subjects have a value of 5 seconds and the rest are split between 4 and 6, your assertion that rounding induces an error of dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second group and more negative errors in the 4 second group under any plausible model). HTH, Chuck
Gabor Grothendieck
2015-Oct-21 20:11 UTC
[R] Linear regression with a rounded response variable
This could be modeled directly using Bayesian techniques. Consider the Bayesian version of the following model where we only observe y and X. y0 is not observed. y0 <- X b + error y <- round(y0) The following code is based on modifying the code in the README of the CRAN rcppbugs R package. library(rcppbugs) set.seed(123) # set up the test data - y and X are observed but not y0 NR <- 1e2L NC <- 2L X <- cbind(1, rnorm(10)) y0 <- X %*% 1:2 y <- round(y0) # for comparison run a normal linear model w/ lm.fit using X and y lm.res <- lm.fit(X,y) print(coef(lm.res)) ## x1 x2 ## 0.9569366 1.9170808 # RCppBugs Model b <- mcmc.normal(rnorm(NC),mu=0,tau=0.0001) tau.y <- mcmc.gamma(sd(as.vector(y)),alpha=0.1,beta=0.1) y.hat <- deterministic(function(X,b) { round(X %*% b) }, X, b) y.lik <- mcmc.normal(y,mu=y.hat,tau=tau.y,observed=TRUE) m <- create.model(b, tau.y, y.hat, y.lik) # run the Bayesian model based on y and X cat("running model...\n") runtime <- system.time(ans <- run.model(m, iterations=1e5L, burn=1e4L, adapt=1e3L, thin=10L)) print(apply(ans[["b"]],2,mean)) ## [1] 0.9882485 2.0009989 On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at jhu.edu> wrote:> Hi, > I am dealing with a regression problem where the response variable, time > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > for the regression coefficients per se, but my main interest is in getting > the prediction equation for walking speed, given the predictors (age, > height, sex, etc.), where the predictions will be real numbers, and not > integers. The hope is that these predictions should provide unbiased > estimates of the "unrounded" walking speed. These sounds like a measurement > error problem, where the measurement error is due to rounding and hence > would be uniformly distributed (-0.5, 0.5). > > Are there any canonical approaches for handling this type of a problem? > What is wrong with just doing the standard linear regression? > > I googled and saw that this question was asked by someone else in a > stackexchange post, but it was unanswered. Any suggestions? > > Thank you, > Ravi > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > Associate Professor, Department of Oncology > Division of Biostatistics & Bionformatics > Sidney Kimmel Comprehensive Cancer Center > Johns Hopkins University > 550 N. Broadway, Suite 1111-E > Baltimore, MD 21205 > 410-502-2619 > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com [[alternative HTML version deleted]]
peter salzman
2015-Oct-21 20:15 UTC
[R] Linear regression with a rounded response variable
here is one thought: if you plug in your numbers into any kind of regression you will get prediction that are real numbers and not necessarily integers, it may be that you predictions are good enough with this approximate value of Y. you could test this by randomly shuffling your data by +- 0.5 and compare the results with the original result. let me add another idea: if data is not fully observed this falls under the umbrella of censored data, in this case you have interval censoring. if you see 5 then the observations is in interval [4.5, 5.5] i'm not familiar with the field but i'd search for 'regression with interval censoring' peter On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at jhu.edu> wrote:> Hi, > I am dealing with a regression problem where the response variable, time > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > for the regression coefficients per se, but my main interest is in getting > the prediction equation for walking speed, given the predictors (age, > height, sex, etc.), where the predictions will be real numbers, and not > integers. The hope is that these predictions should provide unbiased > estimates of the "unrounded" walking speed. These sounds like a measurement > error problem, where the measurement error is due to rounding and hence > would be uniformly distributed (-0.5, 0.5). > > Are there any canonical approaches for handling this type of a problem? > What is wrong with just doing the standard linear regression? > > I googled and saw that this question was asked by someone else in a > stackexchange post, but it was unanswered. Any suggestions? > > Thank you, > Ravi > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > Associate Professor, Department of Oncology > Division of Biostatistics & Bionformatics > Sidney Kimmel Comprehensive Cancer Center > Johns Hopkins University > 550 N. Broadway, Suite 1111-E > Baltimore, MD 21205 > 410-502-2619 > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Peter Salzman, PhD Department of Biostatistics and Computational Biology University of Rochester [[alternative HTML version deleted]]
Hi Ravi, And remember that the vanilla rounding procedure is biased upward. That is, an observation of 5 actually may have ranged from 4.5 to 5.4. Jim On Thu, Oct 22, 2015 at 7:15 AM, peter salzman <peter.salzmanuser at gmail.com> wrote:> here is one thought: > > if you plug in your numbers into any kind of regression you will get > prediction that are real numbers and not necessarily integers, it may be > that you predictions are good enough with this approximate value of Y. you > could test this by randomly shuffling your data by +- 0.5 and compare the > results with the original result. > > let me add another idea: > > if data is not fully observed this falls under the umbrella of censored > data, in this case you have interval censoring. if you see 5 then the > observations is in interval [4.5, 5.5] > i'm not familiar with the field but i'd search for 'regression with > interval censoring' > > > peter > > > On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at jhu.edu> > wrote: > > > Hi, > > I am dealing with a regression problem where the response variable, time > > (second) to walk 15 ft, is rounded to the nearest integer. I do not care > > for the regression coefficients per se, but my main interest is in > getting > > the prediction equation for walking speed, given the predictors (age, > > height, sex, etc.), where the predictions will be real numbers, and not > > integers. The hope is that these predictions should provide unbiased > > estimates of the "unrounded" walking speed. These sounds like a > measurement > > error problem, where the measurement error is due to rounding and hence > > would be uniformly distributed (-0.5, 0.5). > > > > Are there any canonical approaches for handling this type of a problem? > > What is wrong with just doing the standard linear regression? > > > > I googled and saw that this question was asked by someone else in a > > stackexchange post, but it was unanswered. Any suggestions? > > > > Thank you, > > Ravi > > > > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg) > > Associate Professor, Department of Oncology > > Division of Biostatistics & Bionformatics > > Sidney Kimmel Comprehensive Cancer Center > > Johns Hopkins University > > 550 N. Broadway, Suite 1111-E > > Baltimore, MD 21205 > > 410-502-2619 > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Peter Salzman, PhD > Department of Biostatistics and Computational Biology > University of Rochester > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
peter dalgaard
2015-Oct-22 00:11 UTC
[R] Linear regression with a rounded response variable
> On 21 Oct 2015, at 19:57 , Charles C. Berry <ccberry at ucsd.edu> wrote: > > On Wed, 21 Oct 2015, Ravi Varadhan wrote: > >> [snippage] > > If half the subjects have a value of 5 seconds and the rest are split between 4 and 6, your assertion that rounding induces an error of dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second group and more negative errors in the 4 second group under any plausible model).Yes, and I think that the suggestion in another post to look at censored regression is more in the right direction. In general, I'd expect the bias caused by rounding the response to quite small, except at very high granularity. I did a few small experiments with the simplest possible linear model: estimating a mean based on highly rounded data,> y <- round(rnorm(1e2,pi,.5)) > mean(y)[1] 3.12> table(y)y 2 3 4 5 13 63 23 1 Or, using a bigger sample:> mean(round(rnorm(1e8,pi,.5)))[1] 3.139843 in which there is a visible bias, but quite a small one:> pi - 3.139843[1] 0.001749654 At lower granularity (sd=1 instead of .5), the bias has almost disappeared.> mean(round(rnorm(1e8,pi,1)))[1] 3.141577 If the granularity is increased sufficiently, you _will_ see a sizeable bias (because almost all observations will be round(pi)==3):> mean(round(rnorm(1e8,pi,.1)))[1] 3.00017 A full ML fit (with known sigma=1) is pretty easily done:> library(stats4) > mll <- function(mu)-sum(log(pnorm(y+.5,mu, .5)-pnorm(y-.5, mu, .5))) > mle(mll,start=list(mu=3))Call: mle(minuslogl = mll, start = list(mu = 3)) Coefficients: mu 3.122069> mean(y)[1] 3.12 As you see, the difference is only 0.002. A small simulation (1000 repl.) gave (r[1,]==MLE ; r{2,]==mean)> summary(r[1,]-r[2,])Min. 1st Qu. Median Mean 3rd Qu. Max. -0.004155 0.000702 0.001495 0.001671 0.002554 0.006860 so the corrections relative to the crude mean stay within one unit in the 2nd place. Notice that the corrections are pretty darn close to cancelling out the bias. -pd> > > HTH, > > Chuck > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com