**Regression**

in machine learning

in machine learning

In the discipline of machine learning, regression analysis is

a key concept of the Classical Linear

Regression Model (CLRM).. It’s classified as supervised learning because the algorithm is

taught both input and output labels. Estimating how one variable influences

the other, aids in the establishment of a link between the variables.

Assume you’re in the market for a car and have decided that

gas mileage will be a deciding factor in your purchase. How would you go about

predicting the miles per gallon of some prospective rides? Because you know the

car’s many characteristics (weight, horsepower, displacement, and so on),

regression is a viable option. You can use regression techniques to identify

the relationship between the MPG and the input data by plotting the average MPG

of each automobile given its features. The regression function might be written

as $Y = f(X)$, with Y representing the MPG and X being the input features like weight, displacement, horsepower, and so on. The desired function is $f$,

and this curve lets us determine if buying or not buying is helpful. Regression

is the name for this technique.

**What**

is R?

is R?

R is a statistical computing and graphics language and

environment. It is a GNU project that is similar to the S language and

environment established by John Chambers and colleagues at Bell Laboratories

(previously AT&T, now Lucent Technologies). R can be thought of as a more

advanced version of S. Although there are some significant differences, much of

the code built for S works in R without modification.

R is highly extendable and offers a wide range of statistical

(linear and nonlinear modeling, classical statistical tests, time-series

analysis, classification, clustering, etc.) and graphical tools. The S

programming language is frequently used for statistical methods research, while

R provides an Open-Source option for getting involved.

One of R’s advantages is how simple it is to create

well-designed publication-quality graphs, complete with mathematical symbols

and calculations when needed. The defaults for small design choices in visuals

have been carefully chosen, but the user retains complete control.

R is accessible in source code form as Free Software under

the provisions of the Free Software Foundation’s GNU General Public License. It

compiles and operates on a wide range of UNIX and related systems (including

FreeBSD and Linux), as well as Windows and MacOS.

Only half of the job is done when it comes to creating a

linear regression model. The model must correspond to the assumptions of linear

regression in order to be used in practice, and there are 10 assumptions on

which a linear regression model is based. These ten assumptions are:

*The regression model is linear in parameters**The mean of residuals is zero**Homoscedasticity of residuals or equal variance**No autocorrelation of residuals**The X variables and residuals are uncorrelated**The number of observations must be greater than the number of Xs**The variability in X values is positive**The regression model is correctly specified**No perfect multicollinearity**Normality of residuals*

**Assumption 1**

*The regression model is linear in
parameters*

According to Assumption 1, the

dependent variable must be a linear mixture of the explanatory variables and

error terms. Assumption 1 demands that the stated model be linear in terms of

parameters, but not in terms of variables. Equations 1 and 2 provide a model

that is linear in terms of both parameters and variables. It’s worth noting

that Equations 1 and 2 depict the same model in slightly different notation.

(1) Y = XB + E

(2) y_{i} = B_{0}+B_{1}x_{i}1+B_{2}x_{i}2+…+B_{K}x_{i}K+e_{i}

For OLS to work the

specified model must be linear in parameters. Note that if the true

relationship between and is nonlinear it is not

possible to estimate the coefficient in any meaningful

way. Equation 3 shows an empirical model which is of

quadratic nature.

(3) y_{i} = B_{0}+(B_{1})^{2}

xi1+B_{2}x_{i}2+…+B_{K}x_{i}K+e_{i}

CLRM’s basic assumption is that the

model’s parameters must be linear. OLS is unable to estimate Equation 3 in any

way that is useful. However, assumption 1 does not necessitate a linear model

in terms of variables. In Equation 4, OLS will generate a significant estimate

of B_{1}.

(4) y_{i} = B_{0}+B_{1
}(x_{i}1)^{2}+B_{2}x_{i}2+…+B_{K}x_{i}K+e_{i}

The approach of ordinary least

squares (OLS) allows us to estimate models with linear parameters, even if the

variables are nonlinear. On the contrary, even if the variables are linear, it

is impossible to estimate models with nonlinear parameters.

Finally, every OLS model should

include all relevant explanatory factors, and all explanatory variables

included in the model should be relevant. The omitted variables bias is caused

by not incorporating all relevant variables. In regression analysis, this is a

very important problem.

**Assumption 2**

*The mean of residuals is zero*

When it comes to verifying a

regression model, residual analysis is crucial. The *i*th residual is the difference

between the observed value of the dependent variable, *yi*, and the

value predicted by the estimated regression equation, *ŷi*.

These residuals, computed from the available data, are treated as estimates of

the model error, ε. As such, they are used by statisticians to validate the

assumptions concerning ε. Therefore, let’s check the mean of the residuals. If

it is zero (or very close), then this assumption is held true for that model. This

is default unless you explicitly make amends, such as setting the intercept

term to zero.

We

tested this assumption with our data in “R” and discovered that the

mean of residuals is close to zero, indicating that the assumption is correct

for our model.

**Assumption 3**

*Homoscedasticity of residuals or
equal variance*

The variance of error terms is

similar throughout the independent variable values, according to this

assumption. A plot of standardized residuals vs expected values can be used to

determine if points are evenly distributed across all independent variable

values. At least two independent variables, which might be nominal, ordinal, or

interval/ratio level variables, are required in multiple linear regression.

Once the regression model is built, we

would like to place two plots side by side using the set par (mfrow=c

(2, 2)) in “R”, then, plot the model using plot(lm. mod). This produces

four plots. The top-left and bottom-left plots show how

the residuals vary as the fitted values increase.

The first graph shows residuals

against fitted values. The plot of residuals vs anticipated values can be used

to verify the linearity and homoscedasticity assumptions. Our residuals would

take on a definite shape or a recognizable pattern if the model did not match

the linear model assumption. It’s bad if your plot resembles a parabola, for

example. Your residual scatterplot should resemble the night sky, with no

discernible patterns. If the linearity assumption is met, the red line running

through your scatterplot should be straight and horizontal, not curved. We

check to see if the residuals are evenly distributed around the y = 0 line to

see if the homoscedasticity assumption is met.

What did we come up with? Three data

points with high residuals were automatically flagged by R. (observations

Toyota Corolla, Fiat, and Pontiac Firebird). Aside from that, our residuals

appear to be non-linear, as the curving red line shown through our residuals

demonstrates (they were quadratic and resemble the shape of a parabola). Our

data also appear to be heteroscedastic, as they are not uniformly distributed

around the y = 0 line.

The residuals are used to test the

normality assumption, which can be done with a QQ-plot by comparing the

residuals to “ideal” normal data along the 45-degree line.

What did we come up with? The same three

data points with high residuals were automatically detected by R. (observations

Toyota Corolla, Fiat and Pontiac Firebird). However, aside from those three

data points, observations in the QQ-plot do not lie well along the 45-degree

line, implying an abnormality exists.

The third plot is a scale-location plot

(square rooted standardized residual vs. predicted value). This is useful

for checking the assumption of **homoscedasticity**. In this

particular plot we are checking to see if there is a pattern in the residuals.

If the red line you see on your plot is flat and horizontal with equally and

randomly spread data points (like the night sky), you’re good. If your red line

has a positive slope to it, or if your data points are not randomly spread out,

you’ve violated this assumption.

The fourth plot helps us find **influential
cases**, if any are present in the data. Note: Outliers may or may not be

influential points. Influential outliers are of the greatest concern. Depending

on whether they are included or excluded from the analysis, they may have an

impact on the results. If you’re good and don’t have any influential examples,

you’ll hardly notice a dashed red curve, if at all (Cook’s distance is

represented by the red dashed curved line). You’re fine if you don’t notice a

red Cook’s distance curving line, or if one is just barely visible in the

corner of your plot but none of your data points fall within it. If some of

your data points transcend the distance line, you’re not doing so well/you have

significant data points. In this circumstance, there is a clear pattern that

can be seen. Heteroscedasticity exists as a result. Now let’s have a look at a

different model.

The points now appear to be random, and

the line appears to be flat, with no upward or downward trend. As a result, the

homoscedasticity requirement can be accepted.

**Assumption 4**

*No autocorrelation of residuals*

This is applicable especially for time

series data. Autocorrelation is the correlation of a time Series with lags of

itself. When the residuals are autocorrelated, it means that the current value

is dependent of the previous (historic) values and that there is a definite

unexplained pattern in the Y variable that shows up in the disturbances.

Below, are 3 ways you could check for

autocorrelation of residuals.

Using acf plot

The X-axis corresponds to the lags of the

residual, increasing in steps of 1. The very first line (to the left) shows the

correlation of residual with itself (Lag0), therefore, it will always be equal

to 1.

If the residuals were not autocorrelated,

the correlation (Y-axis) from the immediate next line onwards will drop to a

near-zero value below the dashed blue line (significance level). Clearly, this

is not the case here. So we can conclude that the residuals are autocorrelated.

Using runs test

With a p-value < 2.2e-16, we reject

the null hypothesis that it is random. This means there is a definite pattern

in the residuals.

**Using the Durbin-Watson test.**

Durbin Watson examines whether the

errors are autocorrelated with themselves. The null states that they are not

autocorrelated (what we want). This test could be especially useful when you

conduct a multiple (times series) regression. For example, this test could tell

you whether the residuals at time point 1 are correlated with the residuals at

time point 2 (they shouldn’t be). In other words, this test is useful to verify

that we haven’t violated the **independence** assumption.

So, Durbin Watson also confirms our finding.

Add lag1 of residual as an X variable to

the original model to ratify it. The slide function in the Data Combine package

can be used to accomplish this.

Let’s see if this strategy is able to

solve the problem of residual autocorrelation.

The correlation values drop below the

dashed blue line from lag1 itself, unlike the “acf” plot of lmMod

(i.e., “R” syntax for evaluating linear model). As a result, autocorrelation

cannot be proven.

0.3362 is the p-value. It is impossible

to reject the null hypothesis that it is random. We can’t rule out the null

hypothesis with a p-value of 0.3362. As a result, we may be confident that

residuals are not autocorrelated.

We can’t rule out the null hypothesis

that actual autocorrelation is zero because of the high p value of 0.667. As a

result, this model meets the condition that residuals should not be

autocorrelated.

If the assumption of autocorrelation of

residuals is not satisfied after adding lag1 as an X variable, you might want

to try adding lag2 or be creative in creating relevant derived explanatory

variables or interaction terms. This is more like a work of art than a computer

algorithm.

**Assumption 5**

*The X variables and residuals are
uncorrelated*

**How to check?**

Do a correlation test on the X variable

and the residuals.

The

linear trend of the results is provided by regression; residuals are the

randomness that is “leftover” after fitting a regression model.

Because the linear trend has been removed by the regression, the correlation

between the explanatory variable(s) and the residuals is/are 0. When plotting

residuals against explanatory factors, however, you may notice patterns; such

patterns indicate that there is more going on than a straight line, such as

curvature, etc.

Because

the p-value is so high, the null hypothesis that the true correlation is 0

cannot be ruled out. As a result, the assumption is correct for this model.

**Assumption 6**

*The number of observations must be
greater than the number of Xs*

We can see it directly by looking

at the statistics, and it’s straightforward to follow.

**Assumption 7**

*The variability in X values is
positive*

A statistical measurement of the

dispersion between values in a data collection is known as variance. Variance

expresses how far each number in the set deviates from the mean, and thus from

every other number in the set. This symbol is frequently used to represent

variation: 2. Analysts and traders use it to gauge market volatility and security.

This implies that the X values in a sample cannot all be the same (or even

nearly the same).

**How to check?**

The

variance in the X variable above is much larger than 0. So, this assumption is

satisfied.

**Assumption 8**

*The regression model is correctly
specified*

**If the regression
equation **contains all of the

required predictors, including any necessary transformations and interaction

factors, the regression model is correctly stated. That is, the model has no

missing, redundant, or superfluous predictors. Of course, this is the best-case

scenario, and it’s what we’re hoping for! This means that the model equation

should be specified accurately if the Y and X variables have an inverse

relationship:

Y=β1+β2∗(1/X)

**Assumption 9**

*No perfect multicollinearity*

A variance inflation factor is a

method that may be used to determine the degree of multicollinearity in a

dataset. When a person wishes to assess the effect of numerous variables on a

specific result, they employ a multiple regression. The dependent variable is

the result that the independent variables—the model’s inputs—have an effect on.

When one or more independent variables or inputs have a linear relationship or

correlation, multicollinearity exists. There is no perfect linear relationship

between explanatory variables. If the variance inflation factor of a predictor

variable were 5.27 (√5.27 = 2.3), this means that the standard error

for the coefficient of that predictor variable is 2.3 times larger than if that

predictor variable had 0 correlation with the other predictor variables.

*How to check?*

*How to check?*

Using Variance Inflation factor

(VIF). But, What is VIF?

VIF is a metric that is

calculated for each X variable in a linear model. If a variable’s VIF is high,

it suggests that the information in that variable has already been explained by

other X variables in the model, implying that that variable is more redundant.

As a result, the smaller the VIF (2) the better. VIF is calculated as VIF=1 /

(1Rsq), where Rsq is the Rsq term for the model with the specified X as

response versus all other Xs that were used as predictors.

**How to rectify this?**

Two ways:

- Either iteratively remove the
*X*var

with the highest VIF or, - See the correlation between all variables and

keep only one of all highly correlated pairs.

The

VIF should not exceed 4 for any of the X variables, according to the convention.

That is, we will not allow any of the RSq of the Xs (the model that was

generated with that X as a response variable and the remaining Xs as

predictors) to exceed 75% => 1/(1-0.75) => 1/0.25 => 4.

**Assumption 10**

*Normality of residuals*

The functions qqnorm and qqplot

in R can be used to make Q-Q graphs. The qqnorm command generates a Normal Q-Q

graphic. R shows data in sorted order versus quantiles from a conventional

Normal distribution when you feed it a vector of data. Consider the trees data

set that is with R. It gives dimensions for the girth, height, and volume

of 31 felled black cherry trees. Height is one of the variables. Can we assume

that our sample of Heights is drawn from a normally distributed population? The

residuals should be spread normally. If the estimates are computed using the

greatest likelihood method (rather than OLS), the Y and Xs are also regularly

distributed.

This can be visually checked

using the `qqnorm()`

plot (top right plot).

This

assumption is tested using the qqnorm() plot in the top-right corner. It is a

fully normal distribution if all points fall exactly on the line. However, some

departure is to be expected, especially near the ends (see upper right),

although the deviations should be minor, if not negligible.

**Check Assumptions Automatically**

Global

Validation of Linear Models Assumptions is abbreviated as gvlma. Many

assumptions underpin linear regression analysis. We will not be able to accept

the regression results if we disregard them and these assumptions are not met.

Fortunately, R offers a plethora of packages that can take care of a lot of the

heavy job for us. A simple function can be used to test the assumptions of our linear

regression. Fit a basic regression model first: The gvlma () function from

gvlma can be used to verify the key assumptions of a linear model.

Three of the

assumptions have been proven false. This is likely due to the fact that there

are only 50 data points in the dataset, and even two or three outliers can

degrade the model’s quality. As a result, the most urgent solution is to

eliminate the outliers and rebuild the model. To come to your own decision,

glance at the diagnostic plot below. The data points 23, 35, and 49 are shown

as outliers in the above plot. Let’s take them out of the data and rebuild the

model from scratch.

Despite

the fact that the adjustments appear tiny, they are getting closer to adhering

to the assumptions. There’s still one more item to explain. That is to say, the

plot is in the lower right corner. It’s a graph showing standardized residuals vs

leverage. The amount of influence each data point has on the regression is

measured by leverage. The figure also depicts Cook’s distance values, which

show how much the fitted values would change if a point were removed.

The

regression can be dramatically distorted if a point far from the centroid has a

big residual. The red smoothed line should stay near to the mid-line for a

reasonable regression model, and no point should have a significant cook’s

distance (i.e. should not have too much influence on the model.)

*RPubs is a web-based publishing platform for R Markdown publications. If you come across an interesting article, you might wish to copy the script and attempt duplicating it on your own computer. The RPubs package can assist you in copying and pasting the script (or the result) without having to do so manually. HTML document, depicting all assumptions of Classical Linear Regression Model in machine learning with RPubs.*