Eliminate Variables in Regression. Regression is the most popular statistical technique that explains the relationship between an independent variable and its dependent variable, either simultaneously or individually. Those of you who are used to using it must be familiar with R-square or R-square adj, where this one indicator explains the goodness of the model issued by regression, whether multiple, linear, binary, logistic, geometry, or others. All of them use r-squared as an indicator of goodness of fit, although sometimes the terms are different.
R-square indicates that the independent variables together are able to predict the dependent variable based on the r-square value. While the rest (1 minus r-square) is the error value that cannot be explained by the variables in the regression equation model. Of course, we expect an r-square value close to 1 or 100% so that our research is successful and we can immediately interpret the results and start writing the discussion.
The r-square criterion, also commonly referred to as the coefficient of determination, is limited by the values of 0.75, 0.5, and 0.25. r-squared values equal to 0.75 to 1 indicate that the model issued is a strong model, while the r-square value between 0.5 and 0.74 is a medium model, and the rest, 0.25 to 0.5, is a weak model.
Many researchers or students are frustrated when they encounter a small r-square value or the resulting model turns out to be weak. Several ways are usually done so that the r-square value increases at least in the medium class or if possible in the strong model area. One of the techniques to increase the r-square value is the elimination technique or commonly known as variable unloading to determine which variables have a real effect on the dependent variable.
Why should it be eliminated?
The purpose of variable elimination is not without reason. One of the main reasons is to detect the relationship between the independent variables themselves, commonly referred to as multicollinearity. Since many independent variables are used, it is likely that there are some independent variables that have a strong relationship with each other. This affects the goodness of the model, so the r-square value will be low.
So the elimination technique was carried out by trying to eliminate the variables used and recording the resulting r-square value. We then use the highest r-square value for the next process without forgetting to check the classic test.
The problem occurs when the variables we use are more than 5, 10, or even dozens of variables. How many combinations do we have to enter into the multiple regression model equation, which we then check one by one for its goodness of fit? Very tiring.
When I was a student, I eliminated variables one by one and made all possible combinations of the variables I used during my research. Apparently, after I graduated, I realized that there are features of analytical tools in both Minitab and SPSS that help us directly find the best combination of the variables we use.
The benefits of using these tools are time savings, high accuracy, and help in decision-making. Entering a combination of variables takes quite a long time because you have to try all combinations to find out which combination is the most appropriate. You may forget to try a combination of variables that turns out to have the highest r-square value of the combinations you have done before. This makes the accuracy of your model questionable. How much time would you save by not doing these experiments? After knowing the results, you can immediately decide whether you need to collect data again, by entering new variables, for example, or whether you will continue the research with the existing results.
I will explain some of the variable elimination techniques presented by Minitab and SPSS. As for which one is better and simpler, you can decide for yourself after reading this article. Both programs have their own advantages and disadvantages. Minitab, in my opinion, is simple and fast at taking models. This software is also relatively lighter than SPSS. However, you may also need to do the analysis in SPSS if you need more in-depth results after knowing the overview in Minitab.
The data I used is fictitious data that I made myself. You can download the training data here:
We will practice eliminating variables in regression. Here I use nine independent variables, including dummy or category variables (0 and 1).
Eliminate Variables in Regression (Minitab)
Copy the training data to the worksheet in Minitab; for comparison, we will regress all variables first. Click stat – regression – regression – fit regression model. Enter y in the response column, then X3 into categorical predictors because this variable is a dummy or binary variable. The rest are entered into the continuous predictors column. Press OK
And the result was:
It can be seen that the model has an R-square value of 87.77% or an R-sq (adj) of 84.95%. Is there another combination that is better than this one?
We will try to use another alternative combination: click stat, regression, regression, and best subsets.
Enter the Y variable in the response variable and other variables (independent variables) into the free predictors. For the predictors in the all models column, you can enter variables that must be present in this equation, meaning these variables cannot be discarded. For this time, leave it blank. Then click OK.
You will see the results in the session.
From the nine independent variables you used, there are 17 models offered by Minitab. We can choose which model to use. It can be seen that this regression model can only produce an r-sq of 87.8% and an r-sq (adj) of 86.0%. You should use r-sq (adj) because r-sq (adj) is relatively more stable against the addition or subtraction of new variables in the model. Thus, the best model is one that contains independent variables: X1, X3, X4, X6, and X8. You can regress this combination like in the previous step.
Now, what if there are variables that should not be discarded? Let’s say it is the main variable in your research discussion. So, we use the minitab function to forcibly include this variable in all equations. This is one of the advantages of the minitab feature, which is not available in SPSS. For example, in this exercise, the variable that should not be discarded is X2, so the settings are as follows:
The steps are the same as before; only the X2 variable moves to the predictors in the all models column. click OK. The results are as follows:
It can be seen that by requiring the variable X2 into the equation, there are 15 combinations of equations that provide different R-sq (adj) values. The largest R-sq (adj) value is 85.7%, and even then there are two combinations, namely the combination of X1, X3, X4, X6, X8, and the combination of X1, X3, X3, X6, X7, X8. Of course, both of these combinations have X2 in them.
That was easy, wasn’t it? Imagine if you had to eliminate them one by one and test your combinations. Of course, it’s not as simple as this.
SPSS has several different techniques, including the ones I will explain here: backward, forward, and stepwise.
You copy the Excel data to the SPSS worksheet. Don’t forget to set the name of each variable in the variable view, with X3 being nominal and the others including scales, according to the characteristics of the variables.
It is one of the regression techniques that displays the best model by entering all variables, then SPSS eliminates insignificant variables one by one, then reprocesses with no insignificant variables, continuously, until a model is found that is appropriate to represent the model. The colloquial term is the backward walk technique. The method is: click analyze – regression – linear.
Enter the Y variable in the dependent column and the other variables in the independent column. Then choose method: backward. (usually if the regression process is direct without variable elimination using the enter method). Then click OK.
The result was:
It can be seen that SPSS eliminated variable X5 in the second regression process, then eliminated X9, X2, X7, and X6 in the next regression process respectively. This elimination is based on the probability of F criterion.
Then the table below explains the summary model that comes out of this backward technique:
It can be seen that the final result has an R-sq (adj) value of 85.7% with the combination of X4, X8, X3, and X1.
The table below consists of an annova table with a significance level of F for each combination.
It can be seen that all models are significant below 0.05. The next table is a T test, or the value of each coefficient.
Due to the limitations of the laptop screen, I cannot display the whole thing. The T value will be significant if it has a sig value <0.05. meaning that the variable has a significant effect on the dependent variable. For example, in model 1, the significant variables are: X1, X3, X4, and X8. Likewise, how to read the second model, three, and so on. The collinearity explains whether or not there is an element of multicollinearity or a relationship between independent variables. Multicollinearity occurs if the VIF value is > 10 or the tolerance is < 0.1.
This technique is the opposite of the backward technique, where SPSS starts from 0, meaning that it enters one by one the variables that are considered significant in influencing the dependent variable. Then gradually re-enter other variables until finally a model is found that is the best model.
The method is almost the same as the backward method, except that when entering variables, we choose the forward method.
The result was:
In the forward method, the combination of variables selected as the best model is X4, X3, and X8, with an R-sq (adj) value of 84.8%. Furthermore, SPSS also provides annova, coefficients, and multicollinear diagnoses for both include and exclude variables. How to read this table is the same as above (in the backward explanation).
This method is almost similar to the forward method, the difference is that Forward uses the probability of F <= 0.05 criterion to select variables that enter the model, while stepwise uses the probability of F <= 0.05 criterion to select variables that enter the model, and can also remove the combination if the probability of F >= 0.1.
That is, if the forward method selects variables by forward method, or the initial variable selection will still be used and look for the next variable. As for stepwise, it will reconsider the combination of variables, especially the variables that were initially selected.
In simple language: if X1 is selected at the beginning by the forward method, then X1 cannot be eliminated again. But if it is stepwise, there is a possibility of elimination with the inclusion of new variables, because the new combination will be rechecked for its F value.
The result obtained in this exercise happens to be the same as the forward result.
These are the techniques and steps to eliminate variables in regression, which are hopefully useful. Don’t forget to help spread it to colleagues or friends who need it.
In addition to variable elimination, I have also explained how to eliminate respondents based on outlier data and respondent elimination based on the concept of R square.