Measuring Errors and What They Inference in Linear Regression

This is a replica of a self portrait when I was trying to understand Root Mean Squared Error

I just remember sitting in data science boot camp and it being drilled into our heads to check the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE), and the R squared (R2) whenever we were handling regression modeling. When the subject was first introduced there was extensive explanation of the material. A couple months after that, I have to be honest, I would sometimes just calculate these numbers trying to check accuracy and not have the deepest level of understanding about what each term truly means. Pun intended.

After spending time creating even a simple regression model, it is important to have a metric unit that allows one to measure how well that model did. In the case of Linear Regression this is where our errors come in. This is an evaluation metric that is unique to linear regression, multiple regression, polynomial regression, and logistic regression.

Errors versus residuals is an important first point that we should bring up. Often treated interchangeably (even by notable websites), they are not.

Errors are the deviation of the observed value from the population mean. An important aspect of this definition is population mean. [1] Population mean refers to the entirety of all possible people in a group, and is usually considered unobservable or unobtainable.

Residuals are the difference of an observed value from and observable sample mean of the population. A sample is representative of a population and is used to make inferences when data cannot be collected on everyone. I stress that it is important to note the fact that these terms are often used interchangeably when people mean residuals. EVEN THOUGH it is called mean squared error and root mean squared error, you are in fact, computing the residual.

For example, if we say the average weight of all 21 year old females who are 5'5" is 130 lbs and we randomly select a 21 year old female that is 5'5" and 125 lbs, then we have an error of -5 lbs. It is impossible, however to obtain the height and weight of every single 21 y/o female in the world. Therefore the data point we just collected is a residual and not an error.

Mean Squared Error (MSE)

Calculating the Mean Squared Error can tells us how the regression model performed. It is also the basis for how a regression model determines a line of best fit. Mean squared error by nature will apply heavier penalties to broader deviations.

This method measures the distance of the actual points from the regression line (just remember we may be measuring residuals even though it is used interchangeably with error). These measurements are squared in order to remove any negative signs and then the average is found.

To calculate the MSE by hand: When calculating by hand you will find the regression line that you believe to have the best fit for the data points you are talking about. Use the equation with all of your X values to find the Y values of the regression line. Subtract the regression model Y value from the actual Y value at that X point. Once you have done this for every data point find the average of those numbers and you have the MSE.

Formula for Mean Squared error

Why square as opposed to absolute value? When I was first learning I used to wonder why square it as opposed to just taking the absolute value? It turns out that someone has thought of this. There is a way to compute and evaluate linear regression using absolute value and it is known as L norm using Regularization. [4]. The simplest way that it was put to me once is that when we square the numbers we are still only dealing with polynomial expressions at the deepest level. Absolute value and how to manipulate it can be more challenging to understand.

Just as the L norm uses regularization and absolute value, The Ordinary Least Squares method is a common regression model used to find the line of best fit for a set of points using the mean squared error. It works by minimizing the distance of all residuals from the regression line.

Root Mean Squared Error (RMSE)

To understand the concept of Root Mean Squared Error it might be fortuitous to revisit the principles of standard deviation, as RMSE is the standard deviation of residuals.

Remembering back to beginning statistics, standard deviation is based on the empirical rule which states that 99.7% of a normally distributed dataset can be found within 3 standard deviations of the mean. Below is a depiction of a normal distribution with purple, green, and yellow representing the 3 deviations.

Formula for Standard Deviation
Formula for Root Mean Squared Error

Looking at the two formulas above we can see the similarity between them. The RMSE formula is an adjusted standard deviation formula, a combination between standard deviation and MSE. Being the standard deviation of residuals, the ideal is to wind up with a normal distribution among residuals. The distribution of our residuals is going to be reflective of the distribution of our original data. Looking below from a project on Housing data there is a strong normal distribution with a bit of a left skew just like the data had.

Coefficient of Determination / R2

R squared or the Coefficient of Determination is another goodness of fit test for a linear regression model that takes into account all of the variables and features and their relationship with the regression line. [5] The R squared value is a number between 0 and 100, but is usually calculated to the one hundredth decimal place, i.e. 85% would be 0.85.

To calculate this, R-squared is the percentage of the dependent variable variation that a linear model explains. The variance explained by the model can also be called the effect size.

ANOVA, or Analysis of Variance testing is another common way to calculate the variance explained by the model.

Conclusion

It is an understatement to say that these three major metrics are vital to understanding the performance of your linear regression model. There are so many small moving parts with all three metrics that it can be difficult to remember how they were derived. Breaking down the formulas and rereading some of the concepts helped me to gain a more robust insight into the model I was creating.

References

[1]https://en-academic.com/dic.nsf/enwiki/258028

[2]https://www.healthline.com/health/womens-health/average-weight-for-women#relationship-between-weight-and-height

[3]https://www.healthline.com/health/womens-health/average-weight-for-women#relationship-between-weight-and-height

[4]http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/

[5]https://statisticsbyjim.com/regression/interpret-r-squared-regression/

Just a normally distributed millennial with a left skew. https://github.com/ozbunae