How to Eliminate Respondent Data to Get a Better Regression Model

Eliminate Respondent. It is still discussing regression; this time I will try to give a suggestion on the steps or efforts that must be taken if all the normal regression steps have been taken but the results are not as expected or hypothesized. You can use the technique of eliminating respondent data that can disturb the distribution of other data, which is called outlier data.

Actually, results that do not match expectations are also results. Researchers are not allowed to manipulate data because it violates their code of ethics and morals as researchers. So, whatever the results or conclusions obtained from an observation, report or write them down as they are. There is no need to be afraid that the results you get are different from the results of other studies. In fact, the possibility of different results that you get will make the variation of research results more diverse, with possibly different conditions.

Let me take a case example, which I actually experienced. I was once part of a team researching plantation crop-cattle integration. Theoretically, and mostly reported by previous researchers, the integration of plantation crops with cattle is more profitable than farmers who only cultivate plantation crops or cattle alone (not integrated). However, in reality, my observations and surveys show that the efforts of integrated farmers are not significantly different from those of non-integrated farmers. Of course, the thing that needs to be underlined is that it is in a different situation. Now, it is in this different situation that I need to discuss why the results can be different from others. You need to know that in socio-economic research, the same research conducted in different areas can produce different conclusions.

Outlier Data

Outlier data is data that deviates greatly from the rest of the data. This outlier data greatly affects the statistics of the data set. Take an example: in a village, there are 40 families. 38 families work as farmers, teachers, and laborers. While two other family members work as members of the council and owners of well-known factories in Indonesia. If we take the asset data, of course we will see that there are two data that deviate greatly from the rest of the data set. This is called outlier data. We will find out how this outlier data affects the goodness of the resulting regression model.

To anticipate you obtaining outlier data, lecturers usually recommend you obtain respondent data beyond the minimum data that must be met to process regression. Suppose the regression requires 30 respondents; you should look for respondents beyond 30, for example, 35 or even 40 respondents. This is useful to anticipate if you have to eliminate respondent data. However, the determination of this respondent must still refer to the sampling technique. Hopefully one day I can share information about sampling determination.

You must be able to distinguish between sampling data and population data. You can do this respondent data elimination technique if the data you use uses sampling. If your data is in the form of population, of course you are not allowed to eliminate respondent data, even if it is only one person.

Respondent Data Elimination Technique Exercise

We practice directly so that we can easily understand how the respondent’s outlier data, even though it is only a little, actually affects quite a lot on the regression results.

I used SPSS with the raw data can be downloaded here:

I ran a multiple regression on the data, and the results I obtained are as follows:

data pencilan mempengaruhi regresi

The R Square data is very good. The F value or annova test is also very good. However, when looking at the coefficients of the independent variables, only X1 is significant with a significant p-value at 0.00. while X2 is not significant because it has a p value of 0.89 or greater than 0.05.

Assume that in this exercise in theory and my hypothesis that X2 should significantly affect the value of Y.

As the title of this article suggests, I will examine the data of 35 respondents/data. First I will check the distribution of X2 data.

The method is: on the SPSS menu select graphs – legacy dialogs – histogram

data pencilan mempengaruhi regresi

Enter the variable X2 and press OK

eliminasi data responden

On the SPSS output sheet, a histogram image will appear. To complete the image description, we double-click on the image so that the chart editor appears in SPSS. Then select the element, show data label.

regresi berganda

Select count in the window that appears, then click apply and select close. In the histogram image, a number will appear that describes the frequency.

memilih data responden

It is still incomplete if we add a normal distribution to the histogram by selecting element – show distribution curve.

eliminasi data pada regresi

Then select normal on the distribution curve tab in the window or windows that appear. Click apply and then close. Close the chart editor. The results we get are as follows:

histogram Eliminate Respondent

It can be seen that there are three data points that can be categorized as outliers in the figure. One data point at X2 value 150 and then two data points at X2 between 200 and 250 The three data points are also outside the normal distribution.

So, I deleted or eliminated the respondent’s data. Delete both the Y, X1, and X2 values. Or if in SPSS we cut the row on the data that we will eliminate, In this case, I eliminated three data points, so my data became 32 data points (the initial number was 35 data points).

Not stopping here, I also did the same thing for variable X1, and I obtained the following histogram graph results:

histogram di SPSS regression

In the figure, there are also two outliers, namely data between 0 and 20,000. I also eliminated this data because it is outside the other data sets. So I eliminated 5 data points in total, leaving 30 data points.

I then regressed the remaining data again. The results are as follows:

hasil eliminasi data

My current result is that both independent variables have a significant effect on the dependent variable. The p value for X2 is now 0.025 or less than 0.05.

You can also use minitab to create a histogram. Menu graph – histogram, then select histogram with fits. The results are as follows for variable X2:

tehnik eliminasi data responden

This respondent data elimination technique is rarely publicized because it is one of the secrets of data processors. In addition, because it is very close to the practice of manipulating data, the lecturer does not give this material in front of the class. The more you understand data patterns and regression, the greater the temptation to change the data. This is not justified. Researchers are expected to be honest. It reminds me of a slogan: “Researchers can be wrong, but they can’t lie”.


Leave a Reply

Your email address will not be published. Required fields are marked *