Dummy variable in regression are slightly different from other variables, both in data processing and when reading regression results. Linear regression, or multiple regression, is a function that explains the relationship between independent variables and dependent variables. One dependent variable (Y) is usually influenced by several independent variables (X). For example, the production variable is influenced by land area, fertilizer, labor, and capital.
Regression has several requirements that must be met. Because regression is included in parametric statistics, of course the variables in it have an interval or ratio scale. In addition, the data to be used must also meet the classical assumption rules. However, of the several variables we use, one or two may be nominal or ordinal scale variables. Nominal or ordinal scale variables in regression are commonly known as dummy variables.
To make it easier to understand, I give an example of dummy variables in regression. Let’s say we want to know the effect of gender on income spent at the mall. We create a gender variable with a value of 0 for men and 1 for women. Another example is the effect of farmers’ participation in farmer group membership on income. We create a farmer group membership variable with a value of 0 for farmers who are not members and 1 for farmers who are members of a farmer group. The identification of this dummy is similar to when we describe the different levels of the experiment and the control.
Dummy variables in regression are different from logistic regression. Nominal scale variables in logistic regression are located in the dependent variable, or Y value. while the dummy referred to here is a nominal or ordinal scale variable in the independent variable (X value). Can dummies be in logistic regression?
Giving values of 0 and 1 also has its own techniques. To make it easy to interpret the results of the regression output, a value of 1 should be given to respondents who are expected to have an influence on the value of Y. For example, in the example of farmer membership above, I have a hypothesis that this membership has an influence on farmer income. So I give value 1 to farmers who are members of farmer groups. Later, the coefficient on this variable will be the difference between farmers who are not members and farmers who are members of farmer groups. If you give the value upside down, there is actually nothing wrong, but it is likely that the coefficient value that comes out will be negative. There is nothing wrong with the calculation result; it’s just that you need to understand how to explain the negative value.
I will immediately practice the use of dummy variables in regression in the Minitab application.
I have training data that can be downloaded here:
The data is engineering data that I randomly obtained through Excel. There are 5 independent variables, and one of them, X2, is a dummy variable in regression.
Let’s open the menu. I am using Minitab 17.
We enter the data in the minitab sheet.
Then click stat – regression – regression – fit regression model
Responses: we enter variable Y; continuous predictors: we enter X1, X3, X4, X5. While the variable X2 is a dummy variable in regression, we enter categorical predictors.
The R-sq value in the model has a value of 65.09%, meaning that 65% of the data processed can be explained by the minitab result model. It can be said that this model is sufficient to represent the existing data.
Judging from the p value, among the five variables, only X2 has a value below 0.05. This means that only x2 significantly affects the value of Y. Judging from the VIF value, variables X1 and X4 have a value above 10, meaning that these two variables have multicollinear problems (I have discussed this in the classic assumption test).
My assumption is that the output above has been fixed according to the classical assumption test. I will explain the dummy variable output according to the theme of this article.
In the coefficient column, the value of 1 in the X2 variable has a value of 3876. This means that respondents who have a value of 1 have a significantly higher 3876 Y than respondents who are worth 0. This can also be obtained from the regression equation at the bottom, which is as follows:
The value of the regression model when X2 is 0 is 5468 + 2.89X1 + 19.0X3 + 5.74X4 + 1.49X5. while the regression model when X2 is 1 is: 9344 + 2.89X1 + 19.0X3 + 5.74X4 + 1.49X5. The coefficient value of 3876 is obtained from the difference between the two models, assuming X1, X3, X4, and X5 have the same value.
So it can be concluded that the X2 variable, which is worth 1, has a Y value 3876 higher than the X2 variable, which is worth 0.
There are difference when reading other regression coefficients. Because if the variable is a continuous variable or interval and ratio scale, the variable coefficient will be read every additional unit of the independent variable will increase the dependent variable by the coefficient value.
Different minitabs, different SPSS I also gave the steps in SPSS because SPSS is also widely used. In SPSS, nominal and ordinal scale variables have been separated since they were first inputted, while the process is the same as if you were doing multiple or linear regression. SPSS will recognize the dummy variable after you indicate that it is on a nominal scale.
Let’s open SPSS, then copy the data to the SPSS sheet.
In the variable view tab, I marked the information on X2 that the variable is a dummy variable or nominal scale. Look at the measure table in the image below.
Then click analyze – regression -linear. Then enter Y in the dependent column, and all X variables into the independent column and click OK.
The result was:
The results obtained are the same as the results issued by Minitab. You can see the coefficient value in the coefficient table in column B, with a value of 3876. However, SPSS does not provide a model for the two values of X2, as Minitab does.