Ordinary Least Squares 2 Ways

Just the phrase ‘Ordinary Least Squares’ used to give me so much anxiety. It sounds so official. If you can’t tell from my other blog posts by now, I enjoy taking words and phrases that seem intimidating and making them accessible. I once heard someone say “Oh don’t stress about it, it’s just linear regression”. This isn’t entirely true. Let’s take some time to break it down.

In this article I expect you know what y=mx + b is, what rise over run is, and what simple linear regression is. If not, brush up on basic linear algebra and come meet me back here!

What is it?

Take the simple example scatter plot below. We have seven points that clearly move in a positive linear trend. It isn’t a perfectly straight line, however. When x is equal to 5 y is equal to 12.7. We would expect with the pattern we see for y to be greater than 12.7 when x moves on to 6. In this case it drops down to 11. Moving on to x equals 7, y spikes to 16 and continues an overall positive linear relationship between our points.

This leaves the question, how can we create a regression line that best describes all of these points? Especially once you want to start using this information to make predictions about the future.

When would you need these equations?

Solving Using a System of Equations:

Don’t let this formula scare you.

In the chart below I’ve listed out all of the x and y values in the scatter plot and created two additional columns that we will need. The xy column is all of the respective x and y values multiplied together for each coordinate. The column is each x value from our set squared. When you see the i in the formula above, this is what it represents. It acts as a place holder for each value that corresponds to it’s respective variable. We then want to find the sum of all the values for each category. So, the sum of all x values, the sum of all y values, the sum of all xy values, and the sum of all values.

The bottom row is the sum of each column.
Values we need to solve the algebra.

Let’s go ahead and plug in the numbers into our equation. Looking at the numbers plugged into the equation below it’s way less scary than what you first saw right? I love making that chart because it helps keep everything organized. Obviously if you had a large dataset you would never solve this by hand, but I find it useful to know what is happening in the computer when you run the software.

If you are not familiar with systems of equation using the addition method we are going to add the two equations together. This is kind of useless though unless we can cancel out one of the unknown variables (a or b). We do this by choosing an equation to multiply a number by that gives us the same number for the same variable but opposite operand. In this case, 7 *4 = 28 so we can multiply the ENTIRE bottom equation by -4 to give us a -28 that will cancel out the positive 28 on top. Viola! We can solve for a.

Solving Using Basic Algebra:

It is also really important to note with this method that once you have found your slope you cannot just plug it into a linear formula to get the intercept. It doesn’t work because you calculated the slope using ALL of the points in the set, not just one. To compliment this, you need a formula for the intercept that also uses ALL of the points in the set. In my mind it is easier to remember a system of equations as opposed to two separate ones. It is also easier to see the relationship between your variables in a system of equations.

We are going to use the same calculations from the table in the first example to plug into our formulas.

Solving OLS in Python Code

Below is the regression line that our formula creates.





Just a normally distributed millennial with a left skew. https://github.com/ozbunae