Sarima Arima. Continuing the previous time series discussion, here I will discuss one of the time series analysis methods that was quite famous in the 1990s until now, namely Arima Sarima. Arima actually stands for autoregressive integrated moving average. While Sarima is the same as Arima, it was initially added with the word seasonal.” Based on the origin of its name, Arima is actually a combination of autoregressive and moving averages. The model that is built later is based on these two criteria. Whether the model is pure autoregressive, pure moving average, or a combination of both We will determine it based on the ACF and PACF data patterns.
Arima and Sarima are time series analysis methods, just like trend, moving average, or naive analysis that I have explained before. An important thing to consider when analyzing time series data is the accuracy of the model. Although Arima and Sarima seem to be more modern than other methods, because the equation models are complicated and seem high-class, the equations obtained must still be compared with other analyses. A good model is the one with the smallest MSE or MSD value. So, even though you use Arima and Sarima in analyzing timeseries data, the MSE results obtained are still not better than trends and moving averages; it is like using a chainsaw to cut grass in your yard.
There is no best tool, but the right tool!
Stages of Sarima Arima
The stages that must be passed in this Arima and Sarima analysis are:
Identify the data that we will process, whether it contains trends or is seasonal. Arima Sarima requires the processed data to be stationary. Stationary, in my terms, means that the data fluctuates within a certain value, showing no upward or downward trend. If it turns out that the data shows a trend, we do differencing. First differencing or, if needed, second differencing. The goal? To make the data stationary.
Examples of stationary and non-stationary data graphs
Once we have confirmed that the data is stationary, we start estimating the Arima or Sarima model. The estimation method is by observing the ACF and PACF. What are these? ACF stands for Autocorrelation Function, which indicates the autoregressive value. PACF stands for Partial Autocorrelation Function, which indicates the moving average value. In determining whether the data contains autoregressive (AR) or moving average (MA), look at the pattern or behavior of the ACF and PACF. The thing to understand in this section is that you must understand which patterns are said to be cut off and which patterns are said to be dying down.
The cut-off pattern occurs when the data approaches the value of 0 in the initial lags or the image is seen to immediately decrease drastically (cut-off). While dying down is usually seen to decrease, slowly approaching the value of 0, There is no certainty as to how many lags are included in the initial lag; some explain 5 initial data points, some explain 10 initial data points. But usually, we see the pattern from the picture and can already tell whether it decreases slowly or drastically or is cut off.
Usually in the cut off pattern, the |T| value is immediately insignificant at lag 2 or 3, while the dying down pattern |T| value is significant in the early lags. The data is significant when |T| > 2 for Arima and |T| > 1.25 for Sarima.
The ACF and PACF patterns also reflect the stationarity of the data. Data that is not stationary usually has a dying-down pattern with significant |T| values in almost all lags (dominant). So if you encounter data from both PAF and ACF dying down with significant |T| values, you should re-identify and perform the differencing stage.
Example of a dying down pattern
Example of cut off pattern
If you already understand differencing or differences, depending on which software you use, then you already understand this Arima Sarima technique. Next is running the software and determining the model; then, of course, we evaluate the model. Model evaluation will be the same as other time series techniques, namely MSE (mean square error) issued by Minitab software. In addition, Arima analysis also evaluates iteration, which requires convergence; forecasting residuals must be random; the model must be the simplest; the estimated parameters must be significantly different from zero; and stationarity conditions must be met.
Confused? Let’s practice directly to determine the model.
After knowing the model, the last step, according to the research objectives, is to do data forecasting. Forecasting is not only done for the future but can also be used to determine what the value should be using the model we obtained. The purpose? To determine the error value for a particular case, for example, the climate that I have used in the article The Effect of Climate Change on Food Crop Production,
These are the stages that you must go through in the Arima and Sarima methods later. It looks like it is not yet routine because the previous explanation is a definition, so I don’t have to repeat it and can give a routine example at the Arima stage.
Let’s go straight to the data and practicum. Oh yeah, you can also download the data I used if you want to follow along and learn with me in this tutorial. Please download the raw data here:
Arima has the ARIMA = (p, d, q) model. p is the autoregressive or ACF value. D is the differencing value. And q is the moving average, or PACF value. I don’t explain how the math equation works; you can download it through a search engine and use it as a reference. This article is a tutorial on how to process data using Arima and Sarima. In addition, I am still confused about how to write the symbols and math formulas that are quite complicated here.
We use a time series plot to see if this data has a trend pattern or not. Choose stat – timeseries – time series plot.
Select simple, then we enter the variable X3 in the X3 column in the series column. The result is as follows:
It can be seen that the X3 variable has a trend element. Then we have to do differences on the data. But before that, what is a difference? How can it be understood manually?
Differencing is a data processing step by calculating the difference value between Yt and Yt-1. So we actually calculate the data with the previous data. Because what we process is difference data, it is certain that the trend element will disappear. However, there are some cases where the trend data has not disappeared after differencing. So what we do is differencing the data that has been differenced before. This stage means it has a difference value of 2, meaning the data is differenced twice. What if it hasn’t? So far I haven’t encountered differencing more than two. But if that is the case, you can try it.
Back to the data. We do differencing. You can do it manually or with the help of minitab. Since I have minitab open, I used it for this differencing step. Choose stat – time series – differences
Enter the variable X3 into the series column by double-clicking on the variable, fill in C2 in store differences in:, and enter the value I in the lag column. This value means we differenced 1 lag before. Then click OK
The difference value will appear in column C2, I give the variable name dif_1. Then, we will see which ACF and PACF patterns are dying down and which are cut off. What is dying down? I explained it above… 🙂
Click stat – time series – autocorrelation. Enter the variable dif_1 into the series column, the number of lags we fill in 15 (can be 20 or 10 which clearly does not exceed the amount of data you have), click OK. Then the ACF image will appear.
Click stat – time series – partial autocorrelation. Enter the variable dif_1 into the series column, the number of lags we fill 15 (can be 20 or 10 which clearly does not exceed the amount of data you have) do not choose the default nnumber of lags because the amount of data displayed is usually only a little, about 5 and it does not reflect the pattern, click OK. Then the PACF image will appear.
The resulting image is as follows:
On the left is the ACF, and on the right is the PACF. It appears that the ACF immediately cuts off at the third lag, while the PACF looks like it is dying down because it only cuts off at the fifth lag. We can conclude that the ACF cuts off and the PACF has a dying-down pattern.
The basis used to determine the model is as follows:
If the ACF shows a dying-down pattern and the PACF shows a cutoff, then it can be said that the ARIMA model is pure AR.
If the ACF shows a cut-off pattern and the PACF shows dying down, then it can be said that the ARIMA model is pure MA.
If the ACF and PACF show signs of dying down, it can be said that the ARIMA model is a combination of AR and MA.
ARIMA (p, d, q) model. Since the model in this tutorial is pure MA, we can give p = 0, d = 1 (we do the first differencing), and q = 1. This pattern shows pure MA.
You could try ARIMA (0, 1, 2), also a pure MA, on the model later and then look at the MSE value. We are now at the model estimation stage. We can try several models and then compare them by increasing the AR or MA values and then compare the MSE values and other indicators. However, it is not recommended that you directly increase or decrease both (AR and MA).
Back to the data, we will process with the ARIMA (0,1,1) model. Click stat – time series – ARIMA. Then fill in the variable dif_1 in the series, fill in 0 in the autoregressive, value 1 in the difference, and 1 in the moving average column. Then click OK.
The result we get is:
1. The residuals are already random. This is evidenced by the Box-Ljung statistical indicator, which has a p-value > 0.05.
2. The conditions of invertibility and stationarity are met. This is indicated by the coefficient obtained (in this case, the MA coefficient) being 1. The values of the MA and AR coefficients must be < 1.
3. The iteration process has reached convergence. Indicated by the sentence “relative change in each estimate less than 0.0010.”
4. The MSE obtained is 7.903. Whether this is the lowest MS value, we will prove by comparing other models.
5. Is it the simplest model? We can prove it by processing ARIMA without difference. The result cannot be estimated because the data does not converge. This means that the ARIMA model with differencing 1 is the simplest model, we just need more estimation to fulfill point number 4.
To fulfill point number 4, I ran ARIMA (0,1,2) and ARIMA (1,1,1). The results obtained are:
The resulting MS is lower at 6.066. but the data does not converge and the conditions of invertibility or stationarity are not met because the MA coefficient>1
The resulting MS is smaller, namely 7.364. but the data does not convergence. So we can conclude that ARIMA (0,1,1) is the simplest model that meets the criteria for model evaluation.
In the last stage we will take the forecasting value. Do ARIMA (0,1,1) before clicking OK, click forecast first, fill in the number of forecasts that will be requested, starting from which data (Origin), and placed in any column (Storage forecast, lower limits, and upper limits).
And the results are as follows:
Or it can be seen in the worksheet that the forecast, lower limit and upper limit values have been filled in.
The thing that distinguishes SARIMA from ARIMA is its seasonality. For that, we also have to know how many repeating patterns of seasonality we have. Is it every 3 months? Every year? And so on. Because this will determine the steps we use later.
Example of seasonal data:
It can be seen that the graph repeats the pattern within 12 lags, or it can be said that it repeats the annual pattern.
In principle, the sarima model is the same as the arima model; what distinguishes it is the difference process or difference that will be distinguished between regular difference and seasonal difference, or in other languages, seasonal difference.
I have data that can be downloaded here if you want to learn with me.
The data of n = 60 that I got from random and I suspected contained seasonal elements so it was suitable for this exercise.
The first step is to look at the ACF. I entered the number of lags as 35. I obtained the following:
It appears that the data is not stationary. Then I did the first regular difference in column C2, I named it difreg_1. Then I look at the ACF pattern of difreg_1. I get the following:
If we look at the |T| value the dominance is already <2, but on lag 12, 24 has a value that is >2. This is the sign that we have to do seasonal differencing. By way of the lag value when doing the difference which was previously worth 1, we now enter a value of 12. This value of 12 is adjusted for the seasonal cycle that occurs, it can be quarterly or so on. How to determine it you can see the T value in this regular difference.
When doing the seasonal difference, the variable that is differenced is the difreg_1 variable with a lag value of 12. This difference is called the first regular difference and the first seasonal difference. I put it in column C3 and name it difmus_1. Next, I look at the ACF pattern of the difmus_1 variable again. The result is:
The determination of dying down or PACF for Sarima is as follows:
the pattern is said to cut off when :
- the correlation coefficient is insignificant at lag 2 or less for non-seasonal lags. It is said to be insignificant if |t| < 2 for non-seasonal, and
- insignificant at lag + 2 or less for seasonal lags. It is said to be insignificant if |t| < 1.25.
When viewed from the characteristics obtained, the ACF pattern is dying down. This means that there is an AR element in the seasonal value.
Then we look at the PACF value on variable X (data before differentiation). Then the PACF on the value of difreg_1, and difmus 1. The results I obtained respectively are:
When viewed from the PACF pattern, it can be said that the PACF has a cut off value because the value of lag +2 or less significant < 1.25.
The estimated model is: [ARIMA (0,1,0) (1,1,0)12]
the first bracket is the non-seasonal value, and the second bracket is the seasonal value with lag 12. The trick is to click stat – time series – ARIMA.
Then put a checklist on the fit seasonal model and fill it in as shown below:
And the results are as follows:
The MS or MSE value obtained is 9.409. then we try with the comparison model [ARIMA (0,1,0) (1,1,1)12] the results obtained by the MS value are greater and not convergence. So that the first model is still good, namely
[ARIMA (0,1,0) (1,1,0)12]
Next, please explore with the data you have. Thank you.