On Mon, 2007-02-19 at 09:58 -0500, Pierre Lapointe
wrote:> Hello,
>
> I have a particular situation where a single "wrong" observation
is
> impacting the results of a traditional regression to the point that
> betas become unreliable. I need a way to calculate the most likely
> betas. Here's an example:
>
> set.seed(1)
> unknownbeta <- matrix(seq(100,500,100),25,5,byrow=TRUE)
> x <-matrix(runif(25*5),25)
> y <- rowSums(unknownbeta*x)
> summary(lm(y~0+x)) #gets back the unknown betas.
>
> #Now, let's introduce a single wrong data.
>
> unknownbeta[25,5] <-100
> y <- rowSums(unknownbeta*x)
> summary(lm(y~0+x)) #every beta changes.
>
> I need to find out what are the most likely betas in the second
> example. There is no obvious way to know that row 25 has wrong input.
> I would even be happy if the conclusion was that x1:x4 are 100, 200,
> 300 and 400 and that x5 is zero.
>
> Thanks
It is not clear what you mean by a "wrong" observation. Is the data
completely bad because it was improperly collected? Is this an
observation that has correct data, but is an "outlier" relative to the
other observations? Is the observation missing data, where values can be
reasonably imputed?
Are you in a setting where the observation MUST be included in the
regression rather than be deleted? For example an "Intent to Treat"
analysis in a clinical trial?
Depending upon the context, your options may range from simply removing
the single observation from the regression, considering some form of
weighting of the observations, to perhaps considering a robust
regression methodology and others.
This is not strictly an R question, but one of methodology.
Clarification of which is potentially impacted upon by "community"
standards and prior work within your particular discipline.
HTH,
Marc Schwartz