in machine learning
In the discipline of machine learning, regression analysis is
a key concept of the Classical Linear
Regression Model (CLRM).. It’s classified as supervised learning because the algorithm is
taught both input and output labels. Estimating how one variable influences
the other, aids in the establishment of a link between the variables.
Assume you’re in the market for a car and have decided that
gas mileage will be a deciding factor in your purchase. How would you go about
predicting the miles per gallon of some prospective rides? Because you know the
car’s many characteristics (weight, horsepower, displacement, and so on),
regression is a viable option. You can use regression techniques to identify
the relationship between the MPG and the input data by plotting the average MPG
of each automobile given its features. The regression function might be written
as $Y = f(X)$, with Y representing the MPG and X being the input features like weight, displacement, horsepower, and so on. The desired function is $f$,
and this curve lets us determine if buying or not buying is helpful. Regression
is the name for this technique.
R is a statistical computing and graphics language and
environment. It is a GNU project that is similar to the S language and
environment established by John Chambers and colleagues at Bell Laboratories
(previously AT&T, now Lucent Technologies). R can be thought of as a more
advanced version of S. Although there are some significant differences, much of
the code built for S works in R without modification.
R is highly extendable and offers a wide range of statistical
(linear and nonlinear modeling, classical statistical tests, time-series
analysis, classification, clustering, etc.) and graphical tools. The S
programming language is frequently used for statistical methods research, while
R provides an Open-Source option for getting involved.
One of R’s advantages is how simple it is to create
well-designed publication-quality graphs, complete with mathematical symbols
and calculations when needed. The defaults for small design choices in visuals
have been carefully chosen, but the user retains complete control.
R is accessible in source code form as Free Software under
the provisions of the Free Software Foundation’s GNU General Public License. It
compiles and operates on a wide range of UNIX and related systems (including
FreeBSD and Linux), as well as Windows and MacOS.
Only half of the job is done when it comes to creating a
linear regression model. The model must correspond to the assumptions of linear
regression in order to be used in practice, and there are 10 assumptions on
which a linear regression model is based. These ten assumptions are:
- The regression model is linear in parameters
- The mean of residuals is zero
- Homoscedasticity of residuals or equal variance
- No autocorrelation of residuals
- The X variables and residuals are uncorrelated
- The number of observations must be greater than the number of Xs
- The variability in X values is positive
- The regression model is correctly specified
- No perfect multicollinearity
- Normality of residuals
The regression model is linear in
According to Assumption 1, the
dependent variable must be a linear mixture of the explanatory variables and
error terms. Assumption 1 demands that the stated model be linear in terms of
parameters, but not in terms of variables. Equations 1 and 2 provide a model
that is linear in terms of both parameters and variables. It’s worth noting
that Equations 1 and 2 depict the same model in slightly different notation.
(1) Y = XB + E
(2) yi = B0+B1xi1+B2xi2+…+BKxiK+ei
For OLS to work the
specified model must be linear in parameters. Note that if the true
relationship between and is nonlinear it is not
possible to estimate the coefficient in any meaningful
way. Equation 3 shows an empirical model which is of
(3) yi = B0+(B1)2
CLRM’s basic assumption is that the
model’s parameters must be linear. OLS is unable to estimate Equation 3 in any
way that is useful. However, assumption 1 does not necessitate a linear model
in terms of variables. In Equation 4, OLS will generate a significant estimate
(4) yi = B0+B1
The approach of ordinary least
squares (OLS) allows us to estimate models with linear parameters, even if the
variables are nonlinear. On the contrary, even if the variables are linear, it
is impossible to estimate models with nonlinear parameters.
Finally, every OLS model should
include all relevant explanatory factors, and all explanatory variables
included in the model should be relevant. The omitted variables bias is caused
by not incorporating all relevant variables. In regression analysis, this is a
very important problem.
The mean of residuals is zero
When it comes to verifying a
regression model, residual analysis is crucial. The ith residual is the difference
between the observed value of the dependent variable, yi, and the
value predicted by the estimated regression equation, ŷi.
These residuals, computed from the available data, are treated as estimates of
the model error, ε. As such, they are used by statisticians to validate the
assumptions concerning ε. Therefore, let’s check the mean of the residuals. If
it is zero (or very close), then this assumption is held true for that model. This
is default unless you explicitly make amends, such as setting the intercept
term to zero.
tested this assumption with our data in “R” and discovered that the
mean of residuals is close to zero, indicating that the assumption is correct
for our model.
Homoscedasticity of residuals or
The variance of error terms is
similar throughout the independent variable values, according to this
assumption. A plot of standardized residuals vs expected values can be used to
determine if points are evenly distributed across all independent variable
values. At least two independent variables, which might be nominal, ordinal, or
interval/ratio level variables, are required in multiple linear regression.
Once the regression model is built, we
would like to place two plots side by side using the set par (mfrow=c
(2, 2)) in “R”, then, plot the model using plot(lm. mod). This produces
four plots. The top-left and bottom-left plots show how
the residuals vary as the fitted values increase.
The first graph shows residuals
against fitted values. The plot of residuals vs anticipated values can be used
to verify the linearity and homoscedasticity assumptions. Our residuals would
take on a definite shape or a recognizable pattern if the model did not match
the linear model assumption. It’s bad if your plot resembles a parabola, for
example. Your residual scatterplot should resemble the night sky, with no
discernible patterns. If the linearity assumption is met, the red line running
through your scatterplot should be straight and horizontal, not curved. We
check to see if the residuals are evenly distributed around the y = 0 line to
see if the homoscedasticity assumption is met.
What did we come up with? Three data
points with high residuals were automatically flagged by R. (observations
Toyota Corolla, Fiat, and Pontiac Firebird). Aside from that, our residuals
appear to be non-linear, as the curving red line shown through our residuals
demonstrates (they were quadratic and resemble the shape of a parabola). Our
data also appear to be heteroscedastic, as they are not uniformly distributed
around the y = 0 line.
The residuals are used to test the
normality assumption, which can be done with a QQ-plot by comparing the
residuals to “ideal” normal data along the 45-degree line.
What did we come up with? The same three
data points with high residuals were automatically detected by R. (observations
Toyota Corolla, Fiat and Pontiac Firebird). However, aside from those three
data points, observations in the QQ-plot do not lie well along the 45-degree
line, implying an abnormality exists.
The third plot is a scale-location plot
(square rooted standardized residual vs. predicted value). This is useful
for checking the assumption of homoscedasticity. In this
particular plot we are checking to see if there is a pattern in the residuals.
If the red line you see on your plot is flat and horizontal with equally and
randomly spread data points (like the night sky), you’re good. If your red line
has a positive slope to it, or if your data points are not randomly spread out,
you’ve violated this assumption.
The fourth plot helps us find influential
cases, if any are present in the data. Note: Outliers may or may not be
influential points. Influential outliers are of the greatest concern. Depending
on whether they are included or excluded from the analysis, they may have an
impact on the results. If you’re good and don’t have any influential examples,
you’ll hardly notice a dashed red curve, if at all (Cook’s distance is
represented by the red dashed curved line). You’re fine if you don’t notice a
red Cook’s distance curving line, or if one is just barely visible in the
corner of your plot but none of your data points fall within it. If some of
your data points transcend the distance line, you’re not doing so well/you have
significant data points. In this circumstance, there is a clear pattern that
can be seen. Heteroscedasticity exists as a result. Now let’s have a look at a
The points now appear to be random, and
the line appears to be flat, with no upward or downward trend. As a result, the
homoscedasticity requirement can be accepted.
No autocorrelation of residuals
This is applicable especially for time
series data. Autocorrelation is the correlation of a time Series with lags of
itself. When the residuals are autocorrelated, it means that the current value
is dependent of the previous (historic) values and that there is a definite
unexplained pattern in the Y variable that shows up in the disturbances.
Below, are 3 ways you could check for
autocorrelation of residuals.
Using acf plot
The X-axis corresponds to the lags of the
residual, increasing in steps of 1. The very first line (to the left) shows the
correlation of residual with itself (Lag0), therefore, it will always be equal
If the residuals were not autocorrelated,
the correlation (Y-axis) from the immediate next line onwards will drop to a
near-zero value below the dashed blue line (significance level). Clearly, this
is not the case here. So we can conclude that the residuals are autocorrelated.
Using runs test
With a p-value < 2.2e-16, we reject
the null hypothesis that it is random. This means there is a definite pattern
in the residuals.
Using the Durbin-Watson test.
Durbin Watson examines whether the
errors are autocorrelated with themselves. The null states that they are not
autocorrelated (what we want). This test could be especially useful when you
conduct a multiple (times series) regression. For example, this test could tell
you whether the residuals at time point 1 are correlated with the residuals at
time point 2 (they shouldn’t be). In other words, this test is useful to verify
that we haven’t violated the independence assumption.
So, Durbin Watson also confirms our finding.
Add lag1 of residual as an X variable to
the original model to ratify it. The slide function in the Data Combine package
can be used to accomplish this.
Let’s see if this strategy is able to
solve the problem of residual autocorrelation.
The correlation values drop below the
dashed blue line from lag1 itself, unlike the “acf” plot of lmMod
(i.e., “R” syntax for evaluating linear model). As a result, autocorrelation
cannot be proven.
0.3362 is the p-value. It is impossible
to reject the null hypothesis that it is random. We can’t rule out the null
hypothesis with a p-value of 0.3362. As a result, we may be confident that
residuals are not autocorrelated.
We can’t rule out the null hypothesis
that actual autocorrelation is zero because of the high p value of 0.667. As a
result, this model meets the condition that residuals should not be
If the assumption of autocorrelation of
residuals is not satisfied after adding lag1 as an X variable, you might want
to try adding lag2 or be creative in creating relevant derived explanatory
variables or interaction terms. This is more like a work of art than a computer
The X variables and residuals are
How to check?
Do a correlation test on the X variable
and the residuals.
linear trend of the results is provided by regression; residuals are the
randomness that is “leftover” after fitting a regression model.
Because the linear trend has been removed by the regression, the correlation
between the explanatory variable(s) and the residuals is/are 0. When plotting
residuals against explanatory factors, however, you may notice patterns; such
patterns indicate that there is more going on than a straight line, such as
the p-value is so high, the null hypothesis that the true correlation is 0
cannot be ruled out. As a result, the assumption is correct for this model.
The number of observations must be
greater than the number of Xs
We can see it directly by looking
at the statistics, and it’s straightforward to follow.
The variability in X values is
A statistical measurement of the
dispersion between values in a data collection is known as variance. Variance
expresses how far each number in the set deviates from the mean, and thus from
every other number in the set. This symbol is frequently used to represent
variation: 2. Analysts and traders use it to gauge market volatility and security.
This implies that the X values in a sample cannot all be the same (or even
nearly the same).
How to check?
variance in the X variable above is much larger than 0. So, this assumption is
The regression model is correctly
If the regression
equation contains all of the
required predictors, including any necessary transformations and interaction
factors, the regression model is correctly stated. That is, the model has no
missing, redundant, or superfluous predictors. Of course, this is the best-case
scenario, and it’s what we’re hoping for! This means that the model equation
should be specified accurately if the Y and X variables have an inverse
No perfect multicollinearity
A variance inflation factor is a
method that may be used to determine the degree of multicollinearity in a
dataset. When a person wishes to assess the effect of numerous variables on a
specific result, they employ a multiple regression. The dependent variable is
the result that the independent variables—the model’s inputs—have an effect on.
When one or more independent variables or inputs have a linear relationship or
correlation, multicollinearity exists. There is no perfect linear relationship
between explanatory variables. If the variance inflation factor of a predictor
variable were 5.27 (√5.27 = 2.3), this means that the standard error
for the coefficient of that predictor variable is 2.3 times larger than if that
predictor variable had 0 correlation with the other predictor variables.
How to check?
Using Variance Inflation factor
(VIF). But, What is VIF?
VIF is a metric that is
calculated for each X variable in a linear model. If a variable’s VIF is high,
it suggests that the information in that variable has already been explained by
other X variables in the model, implying that that variable is more redundant.
As a result, the smaller the VIF (2) the better. VIF is calculated as VIF=1 /
(1Rsq), where Rsq is the Rsq term for the model with the specified X as
response versus all other Xs that were used as predictors.
How to rectify this?
- Either iteratively remove the X var
with the highest VIF or,
- See the correlation between all variables and
keep only one of all highly correlated pairs.
VIF should not exceed 4 for any of the X variables, according to the convention.
That is, we will not allow any of the RSq of the Xs (the model that was
generated with that X as a response variable and the remaining Xs as
predictors) to exceed 75% => 1/(1-0.75) => 1/0.25 => 4.
Normality of residuals
The functions qqnorm and qqplot
in R can be used to make Q-Q graphs. The qqnorm command generates a Normal Q-Q
graphic. R shows data in sorted order versus quantiles from a conventional
Normal distribution when you feed it a vector of data. Consider the trees data
set that is with R. It gives dimensions for the girth, height, and volume
of 31 felled black cherry trees. Height is one of the variables. Can we assume
that our sample of Heights is drawn from a normally distributed population? The
residuals should be spread normally. If the estimates are computed using the
greatest likelihood method (rather than OLS), the Y and Xs are also regularly
This can be visually checked
qqnorm() plot (top right plot).
assumption is tested using the qqnorm() plot in the top-right corner. It is a
fully normal distribution if all points fall exactly on the line. However, some
departure is to be expected, especially near the ends (see upper right),
although the deviations should be minor, if not negligible.
Check Assumptions Automatically
Validation of Linear Models Assumptions is abbreviated as gvlma. Many
assumptions underpin linear regression analysis. We will not be able to accept
the regression results if we disregard them and these assumptions are not met.
Fortunately, R offers a plethora of packages that can take care of a lot of the
heavy job for us. A simple function can be used to test the assumptions of our linear
regression. Fit a basic regression model first: The gvlma () function from
gvlma can be used to verify the key assumptions of a linear model.
Three of the
assumptions have been proven false. This is likely due to the fact that there
are only 50 data points in the dataset, and even two or three outliers can
degrade the model’s quality. As a result, the most urgent solution is to
eliminate the outliers and rebuild the model. To come to your own decision,
glance at the diagnostic plot below. The data points 23, 35, and 49 are shown
as outliers in the above plot. Let’s take them out of the data and rebuild the
model from scratch.
the fact that the adjustments appear tiny, they are getting closer to adhering
to the assumptions. There’s still one more item to explain. That is to say, the
plot is in the lower right corner. It’s a graph showing standardized residuals vs
leverage. The amount of influence each data point has on the regression is
measured by leverage. The figure also depicts Cook’s distance values, which
show how much the fitted values would change if a point were removed.
regression can be dramatically distorted if a point far from the centroid has a
big residual. The red smoothed line should stay near to the mid-line for a
reasonable regression model, and no point should have a significant cook’s
distance (i.e. should not have too much influence on the model.)
RPubs is a web-based publishing platform for R Markdown publications. If you come across an interesting article, you might wish to copy the script and attempt duplicating it on your own computer. The RPubs package can assist you in copying and pasting the script (or the result) without having to do so manually. HTML document, depicting all assumptions of Classical Linear Regression Model in machine learning with RPubs.