Javier López-de-Lacalle
2016-Nov-02  21:48 UTC
[R] Number of used observations in stats::arima with NAs
When missing observations (NAs) are present in the data and the regular or
differencing filters are applied in
stats::arima, the number of used observations (nobs) reported by nobs() is
bigger than the actual number of non-NA observations. This value is used to
compute the covariance matrix of parameter estimates and by BIC(), so this is
relevant for model selection.
Example: the same number of observations are used in both fits below, but nobs()
reports different values:
    x <- round(log(AirPassengers),2)
    length(x)
    #[1] 144
    x[c(36,52)] <- NA
    # fit ARIMA(0,1,0) to the original series
    fit1 <- stats::arima(x, order=c(0,1,0), include.mean=FALSE,
method="ML")
    nobs(fit1)
    #[1] 141
    # fit ARIMA(0,0,0) to the differenced series
    fit2 <- stats::arima(diff(x), order=c(0,0,0), include.mean=FALSE,
method="ML")
    nobs(fit2)
    #[1] 139
In "stats::arima", nobs is obtained as:
    n.used <- sum(!is.na(x)) - length(Delta)
where "length(Delta)" is the number of observations missed due to
differencing. The original data in "x" are used instead of the
differenced data. This ignores the fact that a missing observation generates two
NAs in the differenced series:
    diff(c(1,2,3,NA,5,6,7,8))
    #[1]  1  1 NA NA  1  1  1
"stats::arima" counts the NAs in the original data, but this does not
always match the number of NAs in the differenced series, which is the series
that is used to compute the log-likelihood. Although there is only *one* NA
observation, the contribution to the log-likelihood of *two* observations will
be missed due to this NA, hence the number of contributions to the likelihood is
not "sum(!is.na(x)) - length(Delta)" but
"sum(!is.na(diff(x)))".
The more NAs are in the data, the bigger will be the effect of this issue.
If the argument "order[2]=1", "seasonal$order[2]=0" and the
NA is the first observation, then my comment above is not an issue. It is not
either an issue when "seasonal$order[2]=1" and the NAs are in the
first "seasonal$period" observations. But in general, I believe the
way nobs is computed requires adjusting.
I would propose (reusing code from stats::arima):
    dx <- x
    if(d > 0L) {
      dx <- diff(dx, 1L, order[2L])
    }
    if(seasonal$period > 1L & seasonal$order[2L] > 0) {
      dx <- diff(dx, seasonal$period, seasonal$order[2L])    }
    n.used <- sum(!is.na(dx))
If the argument "xreg" is NULL, "dx" is not needed and the
following may be more efficient than the above option (I checked it with just a
few examples, so the above may be a safer option):
    d <- order[2L]
    D <- seasonal$order[2L]
    S <- seasonal$period
    n.used <- length(x) - sum(is.na(x)[seq_len(S)]) -       2*(d + D) *
sum(is.na(x)[-seq_len(S)]) - length(Delta)
If "xreg" is not NULL, "dx" is required in other part of the
code, so it is convenient to use it here as well; replacing "x" by
"dx" and "xreg" by "dxreg":
    isna <- is.na(dx) | apply(dxreg, 1L, anyNA)
    n.used <- sum(!isna)
Are there any reasons for counting the number of available observations in the
original series rather than in the differenced series? Should this be fixed?
Thanks
Javier L?pez-de-Lacalle
https://jalobe.com