thr3ads.net - R devel - [Rd] Fix for bug in arima function [May 2015]

If this information is useful, please help other people find it:
Share via:

peter dalgaard

2015-May-21 12:36 UTC

[Rd] Fix for bug in arima function

On 21 May 2015, at 12:49 , Martin Maechler <maechler at
lynne.stat.math.ethz.ch> wrote:
>>>>>> peter dalgaard <pdalgd at gmail.com>
>>>>>>    on Thu, 21 May 2015 11:03:05 +0200 writes:
> 
>> On 21 May 2015, at 10:35 , Martin Maechler <maechler at
lynne.stat.math.ethz.ch> wrote:
> 
>>>> 
>>>> I noticed that the 3.2.1 release cycle is about to start.  Is
there any
>>>> chance that this fix will make it into the next version of R?
>>>> 
>>>> This bug is fairly serious: getting the wrong variance estimate
leads to
>>>> the wrong log-likelihood and the wrong AIC, BIC etc, which can
and does
>>>> lead to suboptimal model selection.  If it's not fixed,
this issue will
>>>> affect every student taking our time series course in Fall 2015
(and
>>>> probably lots of other students in other time series courses). 
When I
>>>> taught time series in Spring 2015, I had to teach students how
to work
>>>> around the bug, which wasted class time and shook student
confidence in R.
>>>> It'd be great if we didn?t have to deal with this issue
next semester.
>>>> 
>>>> Again, the fix is trivial:
>>>> 
>>>> --- a/src/library/stats/R/arima.R
>>>> +++ b/src/library/stats/R/arima.R
>>>> @@ -211,8 +211,10 @@ arima <- function(x, order = c(0L, 0L,
0L),
>>>> if(fit$rank == 0L) {
>>>> ## Degenerate model. Proceed anyway so as not to break old code
>>>> fit <- lm(x ~ xreg - 1, na.action = na.omit)
>>>> +            n.used <- sum(!is.na(resid(fit))) -
length(Delta)
>>>> +        } else {
>>>> +            n.used <- sum(!is.na(resid(fit)))
>>>> }
>>>> -        n.used <- sum(!is.na(resid(fit))) - length(Delta)
>>>> init0 <- c(init0, coef(fit))
>>>> ses <- summary(fit)$coefficients[, 2L]
>>>> parscale <- c(parscale, 10 * ses)
>>>> 
>>> 
>>> Yes, such a change *is* small in the source code.
>>> But we have to be sure about  its desirability.
>>> 
>>> In another post about this you mention "REML", and I
think we
>>> really are discussing if variance estimates should use a
>>> denominator of  'n'  or 'n - p' in this case.
>>> 
>>> 
>>>> The patch that introduced the bug (
>>>>
https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
>>>> ) was designed to change the initialization for the
optimization routine.
>>> 
>>>> The proposed fix leaves the deliberate part of the patch
unchanged (it
>>>> preserves the value of "init0").
>>> 
>>> I can confirm this... a change introduced in R 3.0.2.
>>> 
>>> I'm about to commit changes ... after also adding a proper
>>> regression test.
>>> 
> 
>> Be careful here! I was just about to say that the diagnosis is dubious,
and that the patch could very well be wrong!!
> 
>> AFAICT, the issue is that n.used got changed from being based on
lm(x~...) to lm(dx~...) where dx is the differenced series. Now that surely
loses one observation in arima(.,1,.), most likely unintentionally, but it is
not at all clear that the fix is not to subtract length(Delta) -- that code has
been there long before the changes in 3.0.2.
> 
> well... yes,  but as you say for the case of the original lm()
> fit where the resulting residuals and hence is.na(resid(.)) have
> been longer....
> 
>> I'd expect that a safer fix would be to add back the orders of the
the two differencing operations.
> 
> What I did check before replying is that the patch *does* revert to 'R
<= 3.0.1'
> behavior for simple 'xreg' cases.  
> 
> I do see changes in the S.Es of the regression coefficients, as
> they are expected. 
> 
> The few cases I've looked at where all giving results compatible
> with R <= 3.0.1 (or the bug triggered which was fixed in R 3.0.2),
> but I am happy for other examples where the
> degrees of freedom should be computed differently, e.g., by
> taking account the differencing orders as you suggest.
> 
> Seeing how relatively easy it still is to get the internal call
> to optim() to produce an error, I do wonder if there are such
> extensively tested arima(*, xreg = .) examples.
> 
> If we do not get more suggestions here, I'd like to commit to
> R-devel only.   This would still not mean that this is going to
> be in R 3.2.1 ... though it would be nice if others confirmed or
> helped with more references.

Hmm: Delta comes from the following computation at the start of arima()

    "%+%" <- function(a, b) .Call(C_TSconv, a, b)
....
    Delta <- 1.
    for(i in seq_len(order[2L])) Delta <- Delta %+% c(1., -1.)
    for(i in seq_len(seasonal$order[2L]))
        Delta <- Delta %+% c(1, rep.int(0, seasonal$period-1), -1)
    Delta <- - Delta[-1L]
    nd <- order[2L] + seasonal$order[2L]
    n.used <- sum(!is.na(x)) - length(Delta)

and C_TSconv is defined in C code to have a result of length(a)+length(b) -1 

So length(Delta) increases by 1 for each order of ordinary differencing and by
the number of periods for each seasonal differencing.

So length(Delta) really is the number of observations that get "lost"
by differencing. That makes sense, at least in the complete-data case. I still
worry that it could get things wrong if there are NAs in the middle of the
series, since say
> x
 [1]  5  4  3  5 NA  5  4  6  4  3> diff(x)
[1] -1 -1  2 NA NA -1  2 -2 -1> diff(x,,2)[1]  0  3 NA NA NA  3 -4  1



I suspect that what we really need is 

	fitI <- lm(x ~ xreg - 1, na.action = na.omit)
        fit <- if(length(dx) > ncol(dxreg))
            lm(dx ~ dxreg - 1, na.action = na.omit)
        else list(rank = 0L)
        if(fit$rank == 0L) {
            ## Degenerate model. Proceed anyway so as not to break old code
            fit <- fitI
        }
        n.used <- sum(!is.na(resid(fitI))) - length(Delta)
        init0 <- c(init0, coef(fit))

At least that would be the conservative change to get n.used indentical to what
it was in 3.0.1

> 
> Martin
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Patrick Perry

2015-May-21 16:50 UTC

head link

[Rd] Fix for bug in arima function

Thanks for your help, Martin and Peter.

I tried taking the value of ?n.used? from the C functions (ARIMA_CSS and
ARIMA_Like) instead of computing n.used on the R side.  Here is a patch, in case
you?re interested:

https://github.com/patperry/r-source/commit/8fed79a6d2d558ef34738624a2a4f9e795bcf8b9

I don't recommend applying this new patch without further follow-up.  The
patch highlights some strange behavior in the ARIMA_Like function, which
sometimes gives missing values the same weights as observations, resulting in
values of n.used that are too high.

See the commit message for more details.


Patrick


On Thursday, May 21, 2015 at 8:36 AM, peter dalgaard wrote:
>  
> On 21 May 2015, at 12:49 , Martin Maechler <maechler at
lynne.stat.math.ethz.ch (mailto:maechler at lynne.stat.math.ethz.ch)> wrote:
>  
> > > > > > > peter dalgaard <pdalgd at gmail.com
(mailto:pdalgd at gmail.com)>
> > > > > > > on Thu, 21 May 2015 11:03:05 +0200 writes:
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
> >  
> > > On 21 May 2015, at 10:35 , Martin Maechler <maechler at
lynne.stat.math.ethz.ch (mailto:maechler at lynne.stat.math.ethz.ch)> wrote:
> >  
> > > > >  
> > > > > I noticed that the 3.2.1 release cycle is about to
start. Is there any
> > > > > chance that this fix will make it into the next version
of R?
> > > > >  
> > > > > This bug is fairly serious: getting the wrong variance
estimate leads to
> > > > > the wrong log-likelihood and the wrong AIC, BIC etc,
which can and does
> > > > > lead to suboptimal model selection. If it's not
fixed, this issue will
> > > > > affect every student taking our time series course in
Fall 2015 (and
> > > > > probably lots of other students in other time series
courses). When I
> > > > > taught time series in Spring 2015, I had to teach
students how to work
> > > > > around the bug, which wasted class time and shook
student confidence in R.
> > > > > It'd be great if we didn?t have to deal with this
issue next semester.
> > > > >  
> > > > > Again, the fix is trivial:
> > > > >  
> > > > > --- a/src/library/stats/R/arima.R
> > > > > +++ b/src/library/stats/R/arima.R
> > > > > @@ -211,8 +211,10 @@ arima <- function(x, order =
c(0L, 0L, 0L),
> > > > > if(fit$rank == 0L) {
> > > > > ## Degenerate model. Proceed anyway so as not to break
old code
> > > > > fit <- lm(x ~ xreg - 1, na.action = na.omit)
> > > > > + n.used <- sum(!is.na(resid(fit))) - length(Delta)
> > > > > + } else {
> > > > > + n.used <- sum(!is.na(resid(fit)))
> > > > > }
> > > > > - n.used <- sum(!is.na(resid(fit))) - length(Delta)
> > > > > init0 <- c(init0, coef(fit))
> > > > > ses <- summary(fit)$coefficients[, 2L]
> > > > > parscale <- c(parscale, 10 * ses)
> > > > >  
> > > >  
> > > >  
> > > > Yes, such a change *is* small in the source code.
> > > > But we have to be sure about its desirability.
> > > >  
> > > > In another post about this you mention "REML", and
I think we
> > > > really are discussing if variance estimates should use a
> > > > denominator of 'n' or 'n - p' in this case.
> > > >  
> > > >  
> > > > > The patch that introduced the bug (
> > > > >
https://github.com/wch/r-source/commit/32f633885a903bc422537dc426644f743cc645e0
> > > > > ) was designed to change the initialization for the
optimization routine.
> > > > >  
> > > >  
> > > >  
> > > > > The proposed fix leaves the deliberate part of the
patch unchanged (it
> > > > > preserves the value of "init0").
> > > > >  
> > > >  
> > > >  
> > > > I can confirm this... a change introduced in R 3.0.2.
> > > >  
> > > > I'm about to commit changes ... after also adding a
proper
> > > > regression test.
> > > >  
> > >  
> >  
> >  
> > > Be careful here! I was just about to say that the diagnosis is
dubious, and that the patch could very well be wrong!!
> >  
> > > AFAICT, the issue is that n.used got changed from being based on
lm(x~...) to lm(dx~...) where dx is the differenced series. Now that surely
loses one observation in arima(.,1,.), most likely unintentionally, but it is
not at all clear that the fix is not to subtract length(Delta) -- that code has
been there long before the changes in 3.0.2.
> >  
> > well... yes, but as you say for the case of the original lm()
> > fit where the resulting residuals and hence is.na(resid(.)) have
> > been longer....
> >  
> > > I'd expect that a safer fix would be to add back the orders
of the the two differencing operations.
> >  
> > What I did check before replying is that the patch *does* revert to
'R <= 3.0.1'
> > behavior for simple 'xreg' cases.  
> >  
> > I do see changes in the S.Es of the regression coefficients, as
> > they are expected.  
> >  
> > The few cases I've looked at where all giving results compatible
> > with R <= 3.0.1 (or the bug triggered which was fixed in R 3.0.2),
> > but I am happy for other examples where the
> > degrees of freedom should be computed differently, e.g., by
> > taking account the differencing orders as you suggest.
> >  
> > Seeing how relatively easy it still is to get the internal call
> > to optim() to produce an error, I do wonder if there are such
> > extensively tested arima(*, xreg = .) examples.
> >  
> > If we do not get more suggestions here, I'd like to commit to
> > R-devel only. This would still not mean that this is going to
> > be in R 3.2.1 ... though it would be nice if others confirmed or
> > helped with more references.
> >  
>  
>  
>  
> Hmm: Delta comes from the following computation at the start of arima()
>  
> "%+%" <- function(a, b) .Call(C_TSconv, a, b)
> ....
> Delta <- 1.
> for(i in seq_len(order[2L])) Delta <- Delta %+% c(1., -1.)
> for(i in seq_len(seasonal$order[2L]))
> Delta <- Delta %+% c(1, rep.int(0, seasonal$period-1), -1)
> Delta <- - Delta[-1L]
> nd <- order[2L] + seasonal$order[2L]
> n.used <- sum(!is.na(x)) - length(Delta)
>  
> and C_TSconv is defined in C code to have a result of length(a)+length(b)
-1
>  
> So length(Delta) increases by 1 for each order of ordinary differencing and
by the number of periods for each seasonal differencing.
>  
> So length(Delta) really is the number of observations that get
"lost" by differencing. That makes sense, at least in the
complete-data case. I still worry that it could get things wrong if there are
NAs in the middle of the series, since say
>  
> > x
> [1] 5 4 3 5 NA 5 4 6 4 3
> > diff(x)
>  
> [1] -1 -1 2 NA NA -1 2 -2 -1
> > diff(x,,2)
>  
> [1] 0 3 NA NA NA 3 -4 1
>  
>  
>  
> I suspect that what we really need is  
>  
> fitI <- lm(x ~ xreg - 1, na.action = na.omit)
> fit <- if(length(dx) > ncol(dxreg))
> lm(dx ~ dxreg - 1, na.action = na.omit)
> else list(rank = 0L)
> if(fit$rank == 0L) {
> ## Degenerate model. Proceed anyway so as not to break old code
> fit <- fitI
> }
> n.used <- sum(!is.na(resid(fitI))) - length(Delta)
> init0 <- c(init0, coef(fit))
>  
> At least that would be the conservative change to get n.used indentical to
what it was in 3.0.1
>  
>  
> >  
> > Martin
>  
> --  
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk (mailto:pd.mes at cbs.dk) Priv: PDalgd at gmail.com
(mailto:PDalgd at gmail.com)
>  
>  


	[[alternative HTML version deleted]]

Patrick Perry

2015-May-28 03:19 UTC

head link

[Rd] Fix for bug in arima function

> I suspect that what we really need is  
>  
> fitI <- lm(x ~ xreg - 1, na.action = na.omit)
> fit <- if(length(dx) > ncol(dxreg))
> lm(dx ~ dxreg - 1, na.action = na.omit)
> else list(rank = 0L)
> if(fit$rank == 0L) {
> ## Degenerate model. Proceed anyway so as not to break old code
> fit <- fitI
> }
> n.used <- sum(!is.na(resid(fitI))) - length(Delta)
> init0 <- c(init0, coef(fit))
>  
> At least that would be the conservative change to get n.used indentical to
what it was in 3.0.1
Along the same lines, here?s a solution that avoids the extra call to lm:

fit <- if(length(dx) > ncol(dxreg))
lm(dx ~ dxreg - 1, na.action = na.omit)
else list(rank = 0L)
if(fit$rank == 0L) {
## Degenerate model. Proceed anyway so as not to break old code
fit <- lm(x ~ xreg - 1, na.action = na.omit)
}
isna <- apply(is.na(xreg), 1, any) | is.na(x)
n.used <- sum(!isna) - length(Delta)
init0 <- c(init0, coef(fit))

Martin Maechler

2015-May-28 08:17 UTC

head link

[Rd] Fix for bug in arima function

>>>>> Patrick Perry <pperry at stern.nyu.edu>
>>>>>     on Wed, 27 May 2015 23:19:09 -0400 writes:
 {@PP, you forgot this part:}
 >>>>> peter dalgaard <pdalgd at gmail.com>
 >>>>>     on Thu, 21 May 2015 14:36:03 +0200 writes:

    >> I suspect that what we really need is  
    >> 
    >> fitI <- lm(x ~ xreg - 1, na.action = na.omit)
    >> fit <- if(length(dx) > ncol(dxreg))
    >>     lm(dx ~ dxreg - 1, na.action = na.omit)
    >> else list(rank = 0L)
    >> if(fit$rank == 0L) {
    >>    ## Degenerate model. Proceed anyway so as not to break old code
    >>    fit <- fitI
    >> }
    >> n.used <- sum(!is.na(resid(fitI))) - length(Delta)
    >> init0 <- c(init0, coef(fit))
    >> 
    >> At least that would be the conservative change to get n.used
indentical to what it was in 3.0.1

(Sorry for not taking up the thread ..)
That's definitely conservative and hence safest from that point
of view.  

On the other hand, to me, it did look a bit strange or "ugly" to
always
perform to different lm()s and the coefficients of one and the
residuals of the other.

    > Along the same lines, here?s a solution that avoids the extra call to
lm:

    > fit <- if(length(dx) > ncol(dxreg))
    >    lm(dx ~ dxreg - 1, na.action = na.omit)
    > else list(rank = 0L)
    > if(fit$rank == 0L) {
    >    ## Degenerate model. Proceed anyway so as not to break old code
    >    fit <- lm(x ~ xreg - 1, na.action = na.omit)
    > }
    > isna <- apply(is.na(xreg), 1, any) | is.na(x)
    > n.used <- sum(!isna) - length(Delta)
    > init0 <- c(init0, coef(fit))

That is indeed nicer ... and with logic closer to the current
code. I have very slightly changed it, e.g., using  anyNA(.),
and tested it with some more examples... and this does look good
to me in the sense that it is "internally more consistent".

To get this going and "exposed to CRAN", I'm committing it
(with the regression tests, and other necessary "entries") to
R-devel only, but with the intent to port to R-patched in a
couple of days.

Martin

Possibly Parallel Threads

Search for more possibly parallel threads

R devel - May 2015 - Fix for bug in arima function

[Rd] Fix for bug in arima function

[Rd] Fix for bug in arima function

[Rd] Fix for bug in arima function

[Rd] Fix for bug in arima function

Possibly Parallel Threads