Assumptions of the Classical Linear Regression Model Introduction:
To forecast the value of a dependent variable based on the values of one or more independent variables, statisticians utilize the Classical Linear Regression Model (CLRM) based on “Assumptions of the Classical Linear Regression Model”. It is one of the statistical models that is employed the most frequently and is predicated on the idea that the connection between the dependent and independent variables is linear. In other words, the change in the dependent variable follows the change in the independent variable in a linear fashion.
Additionally, the CLRM assumes that the model’s errors are normally distributed, which means that they are dispersed at random along the regression line. Additionally, it presumes that the errors are unrelated to one another or that they are independent of one another. Finally, the CLRM assumes that the variance of the mistakes is constant, indicating that the values of the independent variables have no impact on the variance of the errors. The validity and dependability of the CLRM depend on these presumptions.
Detection, Illness, and Removal of Multicollinearity
A linear regression model is predicated on four assumptions:
Linearity: There is a linear relationship between X and the mean of Y.
Homoscedasticity occurs when the variance of the residual is the same for all values of X.
Independence: Observations are distinct from one another.
Normality: Y is normally distributed for any fixed value of X.
Second assumption of CLRM: None of the independent variables have a linear relationship with any other independent variables.
Above is the second assumption of the Classical Linear Regression Model (CLRM) and we detect it by replacing the independent one with one as the dependent variable and running a regression to determine the coefficient of determination R2 and if R2 such model shows zero value variance inflation factor would be 1. Formula, VIF = Therefore, any value greater than 1 shows the problem of VIF and the statistician says, greater than 5 shows the severity and some say greater than or equal to 10 will be considered as the severity of the problem. In this report, we will follow the greater than or equal to 10 VIF as severity. The Illness of the model is the “Inefficiency of the coefficient.”
Removals of Multicollinearity
Four methods are used to remove multicollinearity:
1. If the variance inflation factor (VIF) is less than 10. Leave the method alone
2. If VIF is greater than 10, then use the following methods:
A. Exclude the variable (however, only the control variable can be excluded)
B. Change the measure (e.g., use growth rate FDI instead of FDI in dollar terms)
C. Increase the sample size
Detection, Illness, and Removals of Autocorrelation
Fourth assumption of CLRM: Error term observations are independent of each other OR they are not correlated to each other.
Above is the 4th assumption of the Classical Linear Regression Model (CLRM) and if the error term correlates to its previous values the problem of autocorrelation exists. Error is a mistake and mistakes should be random, no one commits mistakes deliberately or intentionally. Suppose Z was an important variable but due to a lack of literature review or without reading/understanding theory, researchers forget to include Z into the model and as a result, this omission Z will become part of the error term. In this case when the error term shows positive autocorrelation like +ve, +ve, +ve ……Or -ve, -ve, -ve…… and when shows consistent error value in terms of +ve, -ve, +ve, -ve……we call it negative autocorrelation. So, we must investigate whether such values of error terms are there.
Illness
Issues in the significance
Measure
We use the DW method to determine the autocorrelation problem by using the DW method and the range of DW values is 0 to 4, If the DW value is 2 then there is no autocorrelation.
If DW is less than 2 then there is positive autocorrelation
If DW is greater than 2 then there are negative autocorrelation exits.
Severity test
We use serial correlation LM test to check the severity of the problem,
The hypothesis is that Ho: there is no autocorrelation, if the p-value is greater than 0.05, Accept Null hypothesis that there is no autocorrelation.
If the p-value is greater than 0.05, Reject the Null hypothesis that there is autocorrelation.
Removals
- Addition of relevant variables. Remember it is not necessary that adding one variable will solve the problem, you may need to add 2 or 3 or more than that variable into the model. How do we know the relevant variables? Firstly, we need to know the theory, and secondly need to review the literature.
- Cochrane Orcutt test
- AR (1) test
- HAC test
Detection, Illness, and Removal of Heteroscedasticity
Sixth assumption of CLRM: The error term has a constant variance.
Above is the 6th assumption of the Classical Linear Regression Model (CLRM) and according to it, there should be constant variance among different groups. In this example we have taken data from 31 countries which consist of low-income, medium-income, and high-income countries, it is cross-sectional data, i.e., data pertains to the year 1992 and various entities (31 countries) having respective GDPs and Consumption. The Illness of this model is that when variance is non-constant (i.e., heteroskedasticity exists) significance becomes doubtful.
Understand the Concept of the Functional Form of Regression
As we know OLS’s first assumption that “model should be linear in parameter” is applicable to parameters not on a variable, e.g., Y = α + β1X +e and Y = α + β1X + β2X2 preceding are linear in variables while following is non-linear in a variable, since it is non-linear in variable but linear in parameters, so OLS still estimate it. However, if the functional form of the regression equation is nonlinear then we need to apply a non-linear form like TO = 6L2 – 0.4L3, TP is the total productivity of labor and L is the number of laborers used in production.
Various forms of non-linear Forms of Regression
U-shaped inverted curve (it has a peak) or U-shaped non-inverted curve (it has a trough)
If B1 > 0, B2 < 0 {it is U-shaped inverted curve (it has a peak)}
If B2 < 0, B1 > 0 {it is U-shaped non-inverted curve (it has a trough)}
Exponential Growth like y = B1 XB2 + Ut we can transform it by taking the log of both sides of the equation.
Since the log form equation has already been discussed in the previous report on the Log in Regression Model, therefore, we will concentrate on the U shape inverted curve.
How do we know what is the functional form of the regression equation?
Firstly, theory guides us to use the non-functional form, like the Laffer curve says that the relationship between tax revenue and the tax rate is non-functional.
Secondly, there is a test known as the Ramsey RESET test that says if we take a square of estimated “y” as an independent variable and if it shows a significant value, then we have to apply a non-linear form of a regression model.
STATIONARY SERIES AND UNIT ROOT TEST
The concept of stationary and non-stationary series includes the ADF test (Augmented Dickey-Fuller test). The following are typical characteristics of stationary series,
Constant mean
Constant variance
Autocovariance should not depend on the time
The following three methods are used to detect whether the series is stationary or non-stationary
Graphical method
Autocorrelation Function (ACF)
DF or Dickey and Fuller test
Autoregressive Distributed lag model (ARDL) and Granger Causality with EViews
There are two techniques through which we understand the relationship between or among the variables namely,
- Interdependence techniques (like Covariance & Correlation).
- Dependence techniques (Like regression in which we know which is dependent on whom, causation is also a dependence technique). If we regress some values with their lag values the effect that will arise due to such lag values is considered causation, i.e. In such cases dependent variable will be an effect of some cause. If we have a model in which independent variables and dependent variables are the same but independent variables are coming with lag values and if they regress with each other, then we call it autoregressive. Researchers quite often use the Autoregressive Distributed lag model (ARDL) to understand the causality and its effect on the dependent variable. To determine the causality, we have performed the Granger causality test in EViews.
Understanding the Dummy Variables
In this example data are taken from France’s daily average salary area-wise and gender-wise at different age levels, we have 222 observations, where Salary is a dependent variable and Area and Age are independent variables According to this data there are two scenarios,
- How much age and area cause a change in salary irrespective of gender?
- How much do age and area have an impact on salary by considering gender as a dummy variable– Male and female?
Use of Log-in Regression Model
The mathematical function of the log is used for two reasons:
- To normalize data if an outlier exists in the data
- To covert observations into percentages
Multicollinearity and Autocorrelation
Heteroscedasticity
Functional Form of Regression
Functional Form of Regression
STATIONARY SERIES AND UNIT ROOT
Distributed lag model (ARDL) and Granger
Understanding the Dummy Variables
LOG FORM