Important! These are Regression’s requirements: Linearity and collinearity

Linearity and collinearity. The word multiple regression is certainly familiar to people who are involved in statistics. By looking at the computer results in the form of the F test and T test and then looking at the P value, we begin to interpret the computer results by analyzing the topic at hand.

We also know the role of R-squared, which reflects whether the model has described the whole or not compared to the variables outside the model (or errors).

However, I think there are still many people who use the regression tool and forget the steps before reading the interpretation of the SPSS results. Usually they skip straight to the data input stage and run the application and read it without ever knowing whether the data they use is appropriate to use regression as an analytical tool. If we look at various sources, there are several assumptions that must be met for multiple regression. In this paper, I will discuss two first: the linearity and multicollinearity tests. Incidentally, I have used it in the article entitled “Price Elasticity and the Effect of Imports on Soybean Production”.

Linearity and collinearity: Linearity test

Multiple regression is basically a linear function. If you don’t believe me, look at the standard pattern of multiple regression:

Y = a + b1X1 + b2X2 +…

From the raw form, it is clear that multiple regression is a linear function. So that the data we will use must also be linear in accordance with the purpose of using multiple regression.

The linearity test is a test used to determine whether a linear equation is suitable for use in existing data. The linearity test is determined if it meets the provisions of nR2 < X2 (chi-square). (n = the amount of data). while the variable being tested is a new variable where X is X squared from the initial variable and Y is the error of the model obtained from regression.

The value of R2 is obtained from: R2 = SSR / SST,
SSR is the sum of the squares of the deviation between the predicted data and the average of the actual data. Meanwhile, SST is the sum of the squares of the deviation between the actual data and the average of the actual data. The SSR value is not too difficult. The easy way is to regress the data first, then take the error data, also referred to as the difference between predicted data and actual data, and then each error value is squared and summed. The SST value is also not difficult. Determine the average value of Y and each data Y minus the average Y. Each result is squared first, and then each error value is squared and summed. Each result is squared first, then summed.
X2 (chi-square) table I assume everyone has and can read the table.
Then, what will we do if our data is not linear?
It can be done by “ironing”, meaning that we “Ln” the Y data and X data. Then try the linearity test again.
But before going there, you should use a histogram plot of the Y values to find out the distribution of the data. Perhaps there is data that needs to be removed because it does not follow the linearity pattern like other data.
Or if the data is time series data, you can use time limits for some models. For example, cut every 10 years or every 15 years.


The multicollinearity test is conducted to test whether there is a strong correlation between the independent variables. If there are two independent variables where the two independent variables are strongly correlated, then the regression equation is sufficiently represented by one of the variables.

For example, if we use the land area variable, then we should not use the land productivity variable as the other variable. From the formula used, which states that land productivity is the division of production results by land area alone, it can be seen that the two variables have a strong relationship. This means that the production variable is more suitable to be used with the land area variable. Avoid using formulas with other variables!
The state of multicollinearity can be seen from the VIF value. An independent variable does not have a multicollinearity value if the VIF value is ≤ 10.
The steps in SPSS are:

Select anallyze-regression-linear

Linearity and collinearity.

Enter Y (dependent) and X (independent) values

Click the statistics button (red circle)

Select estimates, durbin-watson, model fit, and collinearity diagnostics.

Let SPSS do the work and see the results. If the VIF value is below 10, then there is no problem with multicollinearity.

Then the question is, What if there is a multicollinearity problem? Especially for those who like to use many variables, up to dozens. Of course, it is also difficult for us to know the relationship between variables just through logical thinking. We can use correlation analysis to connect independent variables. Correlation analysis can be spearman, bivariate, or others according to the variable data group (parametric or nonparametric). If we find that the correlation between the two independent variables is strong, as shown by the correlation value above 0.75, then we must eliminate one of the variables.

Thank you.


Leave a Reply

Your email address will not be published. Required fields are marked *