eugen pircalabelu
2007-Dec-11 19:40 UTC
[R] question regarding arima function and predicted values
Good evening! I have a question regarding forecast package and time series analysis. My syntax: x<-c(253, 252, 275, 275, 272, 254, 272, 252, 249, 300, 244, 258, 255, 285, 301, 278, 279, 304, 275, 276, 313, 292, 302, 322, 281, 298, 305, 295, 286, 327, 286, 270, 289, 293, 287, 267, 267, 288, 304, 273, 264, 254, 263, 265, 278) library(forecast) arima(x, order=c(1,1,2), seasonal=list(order=c(0,1,0), period=12))->l auto.arima(x)->k sd(l$resid) sd(k$resid) predict(l,n.ahead=1) predict(k,n.ahead=1) 1. I understand that auto.arima will find the best time series model choosing the smaller AIC, BIC and AICc from competing models, but my model finds a smaller AIC than that of the auto.arima. but the sd of the residuals for my model is somehow bigger. Why? Am I missing something? Now the sd of the residuals for my model is somehow bigger, as well as the se for the predicted value. What model would you choose between this two and why? 2. This question is more theoretical m<-sample(c(10:20),10,replace=T) f<-sample(c(10:20),10,replace=T) t<-m+f s<-rbind(m,f,t) s Let's say I have a panel sample at disposal and consider m to be the monthly average quantity of juice consumption for the male part of the sample and f to be the monthly average quantity of juice consumption for the female part of the sample, and t the average quantity of juice consumption for the whole sample. For the mean of the whole sample i have a confidence interval of say +/-2 each month (say I have a sample of 2000 individuals). If I try to come up with a confidence interval only for the male population (which in my sample is say 1000) it would certainly by bigger, because i now have a male sample of 1000 for determining the mean consumption for the whole male population. So my confidence interval is bigger for mean male consumption than for the whole sample (because N declines from 2000 to 1000). Now if I tried to predict the the next month's consumption for both my time series (male and whole sample) the prediction would not "care" that when establishing the mean consumption i used first 2000 people and then 1000. Am I right? Imagine that each month (from 10 that I sampled above) has such a confidence interval of +/-3. Now how would a future prediction would incorporate this fact: that my mean consumption is not measured via a Census, but using a sample, and that the number is an estimation of the real consumption, within a confidence interval? Is there a good reference text for this incorporation of the confidence interval of past values in determining the future values ? Thank you and have a great day! --------------------------------- [[alternative HTML version deleted]]
Pfaff, Bernhard Dr.
2007-Dec-12 09:53 UTC
[R] question regarding arima function and predicted values
>Good evening! > >I have a question regarding forecast package and time series analysis. >My syntax: > >x<-c(253, 252, 275, 275, 272, 254, 272, 252, 249, 300, 244, >258, 255, 285, 301, 278, 279, 304, 275, 276, 313, 292, 302, >322, 281, 298, 305, 295, 286, 327, 286, 270, 289, 293, 287, >267, 267, 288, 304, 273, 264, 254, 263, 265, 278) >library(forecast) >arima(x, order=c(1,1,2), seasonal=list(order=c(0,1,0), period=12))->l >auto.arima(x)->k >sd(l$resid) >sd(k$resid) >predict(l,n.ahead=1) >predict(k,n.ahead=1) > >1. I understand that auto.arima will find the best time series >model choosing the smaller AIC, BIC and AICc from competing >models, but my model finds a smaller AIC than that of the >auto.arima. but the sd of the residuals for my model is >somehow bigger. >Why? Am I missing something? >Now the sd of the residuals for my model is somehow bigger, as >well as the se for the predicted value. What model would you >choose between this two and why? >Hello Eugen, in a nutshell, I would not use neither of these models, but an ARMA(1, 0, 1) fitted to the log(x). Now, to your questions. If you use the "trace = TRUE" argument in auto.arima(), you will see that your model specification (l) is not tested. Why is this? Because, you supply a vector and the frequency is 1 (i.e. frequency(x). If you now spot at the code in auto.arima() it is clear that seasonal differences are not tested for. Try this instead: x <- ts(x, frequency = 12) k <- auto.arima(x, D = 1, trace = TRUE) logLik(k) k$aic Hence, this yields an ARIMA(1, 0, 1)(2, 1, 0)[12] as an "optimal" model specification, which yields an even "better" result than your l model. However, the results you report for l and k can be attributed to over-fitting / over-differencing. If you examine your series more closely: plot(x) acf(x) pacf(x) library(urca) ur.kpss(x) plot(ur.za(x)) i.e. the traditional approach for the identification stage in the Box-Jenkins approach, you will detect, that 1) The series seems not to be stationary with respect to its variance, but is not "trending". 2) ACF and PACF tapers off slowly and neither has a single spike nor gives the PACF hindsight of seasonality. 3) Your series is stationary with a structural break. Therefore, one can use the log-transform of x for variance stabilisation and specify an ARMA(1, 0, 1)-model: xl <- log(x) m <- arima(xl, order=c(1, 0, 1)) m Best, Bernhard>2. This question is more theoretical > > m<-sample(c(10:20),10,replace=T) > f<-sample(c(10:20),10,replace=T) > t<-m+f > s<-rbind(m,f,t) > s > >Let's say I have a panel sample at disposal and consider m to >be the monthly average quantity of juice consumption for the >male part of the sample and f to be the monthly average >quantity of juice consumption for the female part of the >sample, and t the average quantity of juice consumption for >the whole sample. For the mean of the whole sample i have a >confidence interval of say +/-2 each month (say I have a >sample of 2000 individuals). If I try to come up with a >confidence interval only for the male population (which in my >sample is say 1000) it would certainly by bigger, because i >now have a male sample of 1000 for determining the mean >consumption for the whole male population. So my confidence >interval is bigger for mean male consumption than for the >whole sample (because N declines from 2000 to 1000). Now if I >tried to predict the the next month's consumption for both my >time series (male and whole sample) the prediction would not >"care" that when establishing the > mean consumption i used first 2000 people and then 1000. Am I right? >Imagine that each month (from 10 that I sampled above) has >such a confidence interval of +/-3. Now how would a future >prediction would incorporate this fact: that my mean >consumption is not measured via a Census, but using a sample, >and that the number is an estimation of the real consumption, >within a confidence interval? >Is there a good reference text for this incorporation of the >confidence interval of past values in determining the future >values ? > >Thank you and have a great day! > > > > > >--------------------------------- > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. >***************************************************************** Confidentiality Note: The information contained in this ...{{dropped:10}}