thr3ads.net - R help - [R] Linear regression with a rounded response variable [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Ravi Varadhan

2015-Oct-21 14:53 UTC

[R] Linear regression with a rounded response variable

Hi,
I am dealing with a regression problem where the response variable, time
(second) to walk 15 ft, is rounded to the nearest integer.  I do not care for
the regression coefficients per se, but my main interest is in getting the
prediction equation for walking speed, given the predictors (age, height, sex,
etc.), where the predictions will be real numbers, and not integers.  The hope
is that these predictions should provide unbiased estimates of the
"unrounded" walking speed. These sounds like a measurement error
problem, where the measurement error is due to rounding and hence would be
uniformly distributed (-0.5, 0.5).

Are there any canonical approaches for handling this type of a problem? What is
wrong with just doing the standard linear regression?

I googled and saw that this question was asked by someone else in a
stackexchange post, but it was unanswered.  Any suggestions?

Thank you,
Ravi

Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
Associate Professor,  Department of Oncology
Division of Biostatistics & Bionformatics
Sidney Kimmel Comprehensive Cancer Center
Johns Hopkins University
550 N. Broadway, Suite 1111-E
Baltimore, MD 21205
410-502-2619


	[[alternative HTML version deleted]]

Victor Tian

2015-Oct-21 16:21 UTC

head link

[R] Linear regression with a rounded response variable

Hi Ravi,

Thanks for this interesting question. My thoughts are given below.

If you believe the rounding is indeed uniformly distributed, then the
problem is equivalent with adding a uniform random error between (-0.5,
0.5) for every observation in addition to the standard normal error, which
will make the new error term have a mixture distribution.

Intuitively, the impact of this newly added term depends on the relative
scale of the original normal and the new uniform error terms. To see the
exact impact, you can simulate sets of new response variables by adding
uniform errors from (-0.5, 0.5) to the original response variables and see
the results.

I wish I could have more theoretical answers and hope this helps as well.

Best,
Xu

Xu Tian, Ph.D.
Senior Statistician
Validus Research
New York, NY 10005

On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at jhu.edu>
wrote:
> Hi,
> I am dealing with a regression problem where the response variable, time
> (second) to walk 15 ft, is rounded to the nearest integer.  I do not care
> for the regression coefficients per se, but my main interest is in getting
> the prediction equation for walking speed, given the predictors (age,
> height, sex, etc.), where the predictions will be real numbers, and not
> integers.  The hope is that these predictions should provide unbiased
> estimates of the "unrounded" walking speed. These sounds like a
measurement
> error problem, where the measurement error is due to rounding and hence
> would be uniformly distributed (-0.5, 0.5).
>
> Are there any canonical approaches for handling this type of a problem?
> What is wrong with just doing the standard linear regression?
>
> I googled and saw that this question was asked by someone else in a
> stackexchange post, but it was unanswered.  Any suggestions?
>
> Thank you,
> Ravi
>
> Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> Associate Professor,  Department of Oncology
> Division of Biostatistics & Bionformatics
> Sidney Kimmel Comprehensive Cancer Center
> Johns Hopkins University
> 550 N. Broadway, Suite 1111-E
> Baltimore, MD 21205
> 410-502-2619
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
*Xu Tian*

	[[alternative HTML version deleted]]

Charles C. Berry

2015-Oct-21 17:57 UTC

head link

[R] Linear regression with a rounded response variable

On Wed, 21 Oct 2015, Ravi Varadhan wrote:
> Hi, I am dealing with a regression problem where the response variable, 
> time (second) to walk 15 ft, is rounded to the nearest integer.  I do 
> not care for the regression coefficients per se, but my main interest is 
> in getting the prediction equation for walking speed, given the 
> predictors (age, height, sex, etc.), where the predictions will be real 
> numbers, and not integers.  The hope is that these predictions should 
> provide unbiased estimates of the "unrounded" walking speed.
These
> sounds like a measurement error problem, where the measurement error is 
> due to rounding and hence would be uniformly distributed (-0.5, 0.5).
>
Not the usual "measurement error model" problem, though, where the
errors
are in X and not independent of XB.

Look back at the proof of the unbiasedness of least squares under the 
Gauss-Markov setup. The errors in Y need to have expectation zero.
>From your description (but see caveat below) this is true of walking *time*, but not not exactly true of walking *speed* (modulo the usual 
assumptions if they apply to time). In fact if E(epsilon) = 0 were true of 
unrounded time, it would not be true of unrounded speed (and vice versa).

> Are there any canonical approaches for handling this type of a problem?
Work out the bias analytically? Parametric bootstrap? Data augmentation 
and friends?
> What is wrong with just doing the standard linear regression?
>
Well, what do the actual values look like?

If half the subjects have a value of 5 seconds and the rest are split 
between 4 and 6, your assertion that rounding induces an error of 
dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 
second group and more negative errors in the 4 second group under any 
plausible model).


HTH,

Chuck

Gabor Grothendieck

2015-Oct-21 20:11 UTC

head link

[R] Linear regression with a rounded response variable

This could be modeled directly using Bayesian techniques. Consider the
Bayesian version of the following model where we only observe y and X.  y0
is not observed.

   y0 <- X b + error
   y <- round(y0)

The following code is based on modifying the code in the README of the CRAN
rcppbugs R package.


library(rcppbugs)
set.seed(123)

# set up the test data - y and X are observed but not y0
NR <- 1e2L
NC <- 2L
X <- cbind(1, rnorm(10))
y0 <- X %*% 1:2
y <- round(y0)

# for comparison run a normal linear model w/ lm.fit using X and y
lm.res <- lm.fit(X,y)
print(coef(lm.res))
##        x1        x2
## 0.9569366 1.9170808

# RCppBugs Model
b <- mcmc.normal(rnorm(NC),mu=0,tau=0.0001)
tau.y <- mcmc.gamma(sd(as.vector(y)),alpha=0.1,beta=0.1)
y.hat <- deterministic(function(X,b) { round(X %*% b) }, X, b)
y.lik <- mcmc.normal(y,mu=y.hat,tau=tau.y,observed=TRUE)
m <- create.model(b, tau.y, y.hat, y.lik)

# run the Bayesian model based on y and X
cat("running model...\n")
runtime <- system.time(ans <- run.model(m, iterations=1e5L, burn=1e4L,
adapt=1e3L, thin=10L))
print(apply(ans[["b"]],2,mean))
## [1] 0.9882485 2.0009989


On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at jhu.edu>
wrote:
> Hi,
> I am dealing with a regression problem where the response variable, time
> (second) to walk 15 ft, is rounded to the nearest integer.  I do not care
> for the regression coefficients per se, but my main interest is in getting
> the prediction equation for walking speed, given the predictors (age,
> height, sex, etc.), where the predictions will be real numbers, and not
> integers.  The hope is that these predictions should provide unbiased
> estimates of the "unrounded" walking speed. These sounds like a
measurement
> error problem, where the measurement error is due to rounding and hence
> would be uniformly distributed (-0.5, 0.5).
>
> Are there any canonical approaches for handling this type of a problem?
> What is wrong with just doing the standard linear regression?
>
> I googled and saw that this question was asked by someone else in a
> stackexchange post, but it was unanswered.  Any suggestions?
>
> Thank you,
> Ravi
>
> Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> Associate Professor,  Department of Oncology
> Division of Biostatistics & Bionformatics
> Sidney Kimmel Comprehensive Cancer Center
> Johns Hopkins University
> 550 N. Broadway, Suite 1111-E
> Baltimore, MD 21205
> 410-502-2619
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

	[[alternative HTML version deleted]]

peter salzman

2015-Oct-21 20:15 UTC

head link

[R] Linear regression with a rounded response variable

here is one thought:

if you plug in your numbers into any kind of regression you will get
prediction that are real numbers and not necessarily integers, it may be
that you predictions are good enough with this approximate value of Y. you
could test this by randomly shuffling your data by +- 0.5 and compare the
results with the original result.

let me add another idea:

if data is not fully observed this falls under the umbrella of censored
data, in this case you have interval censoring. if you see 5 then the
observations is in interval [4.5, 5.5]
i'm not familiar with the field but i'd search for 'regression with
interval censoring'


peter


On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at jhu.edu>
wrote:
> Hi,
> I am dealing with a regression problem where the response variable, time
> (second) to walk 15 ft, is rounded to the nearest integer.  I do not care
> for the regression coefficients per se, but my main interest is in getting
> the prediction equation for walking speed, given the predictors (age,
> height, sex, etc.), where the predictions will be real numbers, and not
> integers.  The hope is that these predictions should provide unbiased
> estimates of the "unrounded" walking speed. These sounds like a
measurement
> error problem, where the measurement error is due to rounding and hence
> would be uniformly distributed (-0.5, 0.5).
>
> Are there any canonical approaches for handling this type of a problem?
> What is wrong with just doing the standard linear regression?
>
> I googled and saw that this question was asked by someone else in a
> stackexchange post, but it was unanswered.  Any suggestions?
>
> Thank you,
> Ravi
>
> Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> Associate Professor,  Department of Oncology
> Division of Biostatistics & Bionformatics
> Sidney Kimmel Comprehensive Cancer Center
> Johns Hopkins University
> 550 N. Broadway, Suite 1111-E
> Baltimore, MD 21205
> 410-502-2619
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Peter Salzman, PhD
Department of Biostatistics and Computational Biology
University of Rochester

	[[alternative HTML version deleted]]

Jim Lemon

2015-Oct-21 20:25 UTC

head link

[R] Linear regression with a rounded response variable

Hi Ravi,
And remember that the vanilla rounding procedure is biased upward. That is,
an observation of 5 actually may have ranged from 4.5 to 5.4.

Jim

On Thu, Oct 22, 2015 at 7:15 AM, peter salzman <peter.salzmanuser at
gmail.com>
wrote:
> here is one thought:
>
> if you plug in your numbers into any kind of regression you will get
> prediction that are real numbers and not necessarily integers, it may be
> that you predictions are good enough with this approximate value of Y. you
> could test this by randomly shuffling your data by +- 0.5 and compare the
> results with the original result.
>
> let me add another idea:
>
> if data is not fully observed this falls under the umbrella of censored
> data, in this case you have interval censoring. if you see 5 then the
> observations is in interval [4.5, 5.5]
> i'm not familiar with the field but i'd search for 'regression
with
> interval censoring'
>
>
> peter
>
>
> On Wed, Oct 21, 2015 at 10:53 AM, Ravi Varadhan <ravi.varadhan at
jhu.edu>
> wrote:
>
> > Hi,
> > I am dealing with a regression problem where the response variable,
time
> > (second) to walk 15 ft, is rounded to the nearest integer.  I do not
care
> > for the regression coefficients per se, but my main interest is in
> getting
> > the prediction equation for walking speed, given the predictors (age,
> > height, sex, etc.), where the predictions will be real numbers, and
not
> > integers.  The hope is that these predictions should provide unbiased
> > estimates of the "unrounded" walking speed. These sounds
like a
> measurement
> > error problem, where the measurement error is due to rounding and
hence
> > would be uniformly distributed (-0.5, 0.5).
> >
> > Are there any canonical approaches for handling this type of a
problem?
> > What is wrong with just doing the standard linear regression?
> >
> > I googled and saw that this question was asked by someone else in a
> > stackexchange post, but it was unanswered.  Any suggestions?
> >
> > Thank you,
> > Ravi
> >
> > Ravi Varadhan, Ph.D. (Biostatistics), Ph.D. (Environmental Engg)
> > Associate Professor,  Department of Oncology
> > Division of Biostatistics & Bionformatics
> > Sidney Kimmel Comprehensive Cancer Center
> > Johns Hopkins University
> > 550 N. Broadway, Suite 1111-E
> > Baltimore, MD 21205
> > 410-502-2619
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Peter Salzman, PhD
> Department of Biostatistics and Computational Biology
> University of Rochester
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

peter dalgaard

2015-Oct-22 00:11 UTC

head link

[R] Linear regression with a rounded response variable

> On 21 Oct 2015, at 19:57 , Charles C. Berry <ccberry at ucsd.edu>
wrote:
> 
> On Wed, 21 Oct 2015, Ravi Varadhan wrote:
> 
>> [snippage]
> 
> If half the subjects have a value of 5 seconds and the rest are split
between 4 and 6, your assertion that rounding induces an error of
dunif(epsilon,-0.5,0.5) is surely wrong (more positive errors in the 6 second
group and more negative errors in the 4 second group under any plausible model).
Yes, and I think that the suggestion in another post to look at censored
regression is more in the right direction.

In general, I'd expect the bias caused by rounding the response to quite
small, except at very high granularity. I did a few small experiments with the
simplest possible linear model: estimating a mean based on highly rounded data,
> y <- round(rnorm(1e2,pi,.5))
> mean(y)
[1] 3.12> table(y)y
 2  3  4  5 
13 63 23  1 

Or, using a bigger sample:
> mean(round(rnorm(1e8,pi,.5)))[1] 3.139843

in which there is a visible bias, but quite a small one: 
> pi - 3.139843[1] 0.001749654

At lower granularity (sd=1 instead of .5), the bias has almost disappeared.
> mean(round(rnorm(1e8,pi,1)))[1] 3.141577

If the granularity is increased sufficiently, you _will_ see a sizeable bias
(because almost all observations will be round(pi)==3):
> mean(round(rnorm(1e8,pi,.1)))[1] 3.00017


A full ML fit (with known sigma=1) is pretty easily done:
> library(stats4)
> mll <- function(mu)-sum(log(pnorm(y+.5,mu, .5)-pnorm(y-.5, mu, .5)))
> mle(mll,start=list(mu=3))
Call:
mle(minuslogl = mll, start = list(mu = 3))

Coefficients:
      mu 
3.122069 > mean(y)[1] 3.12

As you see, the difference is only 0.002. 

A small simulation (1000 repl.) gave (r[1,]==MLE ; r{2,]==mean)
> summary(r[1,]-r[2,])     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.004155  0.000702  0.001495  0.001671  0.002554  0.006860 

so the corrections relative to the crude mean stay within one unit in the 2nd
place. Notice  that the corrections are pretty darn close to cancelling out the
bias.

-pd
> 
> 
> HTH,
> 
> Chuck
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

R help - Oct 2015 - Linear regression with a rounded response variable

[R] Linear regression with a rounded response variable

[R] Linear regression with a rounded response variable

[R] Linear regression with a rounded response variable

[R] Linear regression with a rounded response variable

[R] Linear regression with a rounded response variable

[R] Linear regression with a rounded response variable

[R] Linear regression with a rounded response variable