CHAPTER 06.05: ADEQUACY OF REGRESSION MODELS: Check Four: Assumption of Random Errors

 

In this particular . . . in this particular segment, we're going to talk about how to figure out whether a particular model meets assumption of random errors, and what that means is that we have several things which we have to look, that once we have developed the regression model, that it meets that assumption of random errors.  These are the four things which we have to consider when we have to see that whether a model meets the assumption of random errors.  The first thing which we have to figure out is to see that whether the residuals are both negative as well as positive, the residuals are the difference between the observed and the predicted values, so again, I'll mention that this is, when we're talking about residuals, it's the difference between observed and predicted values, those have to be both negative as well as positive.  We also have to see that the variation of the residuals as a function of the independent variable is random. And then we also have to see that the residuals, whether the residuals follow a normal distribution.  And then the third . . . the fourth part of this fourth check is, is there any autocorrelation between the data points which have been given to you?  So keep in mind that, although we are bundling it all as one check, check number 4, but it does have four components to it, one, two, three, and four components, and we're going to take an example to go through all those four components.  Now, before I go any further, in the example which we have chosen so far, we have only six data points, but what we are doing is that now we are choosing 22 data points.  This is mainly done to illustrate what is going on in the whole process. So the reason why we are showing you all the 22 data points now, as opposed to six data points which we had before, is to be able to illustrate it a little bit better, because if we only take six data points, we will not be able to illustrate the concept as well as before.  So there is . . . there is, so far as the discussion of it is concerned, you can take the previous three checks which we have made, and you can illustrate it by using these 22 data points, the algebra will be a little bit longer, that's the only reason why I didn't show you with all the 22 data points. So only for this check number 4 we're taking 22 data points, but you can, as a homework exercise, take the first three checks, and deal with this particular data which is given to you here. So again, what we are doing here is that we are given 22 data points and we are plotting the data.  We are plotting the data, those are given by the green dots which you are seeing there, and then the red line which you are seeing is the model itself. So this model which you are seeing right here is nothing but this red line which is shown here. So from that perspective, it looks . . . the straight line looks like a reasonable approximation to the data, although you are finding out that a parabola, or a second-order polynomial, might give it a better estimate.  So what we are doing is now we are calculating the residuals, which is the . . . residuals is the difference between observed values and predicted values, and what we are finding out here is that, yes, we have negative residuals, this is a negative residual, and we have positive residuals as well.  So that takes care of the first check which we wanted to see that whether we are getting enough positive and negative residuals or not, which does seem to be the case for this example. Then what we want to be able to do is we want to be able to see that how the histograms of the residuals look like, and what we have done is that we have basically figured out that, hey, if we have the residual between this number and this number here, how many times does that residual occur, and so on and so forth, and what you are finding out here is that the residuals histogram which you are seeing here does not follow the normal distribution. The normal distribution is not followed, so that would tell you that this particular model is not adequate. The next one, we have to figure out that, hey, let's check for whether there's any autocorrelation.  Now, do we have any autocorrelation in this data?  In order to be able to figure out whether we have autocorrelation in the data or not, we have to figure out how many times does the sign of the residual change once you go from . . . through the consecutive data points, and then what we have to check is that whether this value, which is based, n is the number of data points which you have, whether the number of times that the . . . that the residual is changing sign falls in this particular range or not.  So in our case, we have 22 data points which are given to us, so when we substitute n equal to 22 into this formula right here, we find out that q needs to be somewhere between 5.9 and 15.1.  So the q needs to be somewhere between 5.9 and 15.0 . . . 15.1, which means that is the number of times the residual changing sign between 6 and . . . 6 and 16 . . . 6 and 15, let's suppose, is it between 6 and 15?  So let's go ahead and see that, is that the case? That is not the case.  The reason why that is not the case is because you're finding out that, look at this, the residual is not changing sign.  Only now the residual is changing sign right here.  The residual is not changing sign as you are going through all these data points here, and you are finding out now the residual is changing sign, but again, it's not changing sign as you go down the row there.  So what that means is that the residual is changing sign only two places, so q is equal to 2.  So what you are finding out that the q, the value of q which you are getting is not between 5.9 and 15.08, which basically tells you that there is autocorrelation in the data.  And that's the end of this segment.