Ordinary Least Squares 2 Ways

Just the phrase ‘Ordinary Least Squares’ used to give me so much anxiety. It sounds so official. If you can’t tell from my other blog posts by now, I enjoy taking words and phrases that seem intimidating and making them accessible. I once heard someone say “Oh don’t stress about it, it’s just linear regression”. This isn’t entirely true. Let’s take some time to break it down.

In this article I expect you know what y=mx + b is, what rise over run is, and what simple linear regression is. If not, brush up on basic linear algebra and come meet me back here!

What is it?

OLS or Ordinary Least Squares is system of equations (traditionally) for finding the line of best fit between a series of points in a two dimensional space. It is a way to identify the linear trend that best describes a series of points. It is called Ordinary Least Squares because it is the original method that was devised for finding a regression line by minimizing the sum of the squared values of the residuals. [1]

Take the simple example scatter plot below. We have seven points that clearly move in a positive linear trend. It isn’t a perfectly straight line, however. When x is equal to 5 y is equal to 12.7. We would expect with the pattern we see for y to be greater than 12.7 when x moves on to 6. In this case it drops down to 11. Moving on to x equals 7, y spikes to 16 and continues an overall positive linear relationship between our points.

This leaves the question, how can we create a regression line that best describes all of these points? Especially once you want to start using this information to make predictions about the future.

When would you need these equations?

Technically, never. When you get to the point that you have dozens, hundreds, or thousands of points it is just silly to solve this by hand. I am just a big proponent of knowing the math that drives the code or software that you are using.

Solving Using a System of Equations:

In this method we are going to use a system of equations to solve for the regression line that best describes our set of points. I am going to use the addition method to solve the system. If you are not familiar with systems of equations don’t let it frighten you, it really is pretty simple. Let’s first start by gathering all of the values we need to plug into the formula. In this formula let a represent the slope and b represent the intercept; with n being the number of points we have in the set.

Don’t let this formula scare you.

In the chart below I’ve listed out all of the x and y values in the scatter plot and created two additional columns that we will need. The xy column is all of the respective x and y values multiplied together for each coordinate. The column is each x value from our set squared. When you see the i in the formula above, this is what it represents. It acts as a place holder for each value that corresponds to it’s respective variable. We then want to find the sum of all the values for each category. So, the sum of all x values, the sum of all y values, the sum of all xy values, and the sum of all values.

The bottom row is the sum of each column.
Values we need to solve the algebra.

Let’s go ahead and plug in the numbers into our equation. Looking at the numbers plugged into the equation below it’s way less scary than what you first saw right? I love making that chart because it helps keep everything organized. Obviously if you had a large dataset you would never solve this by hand, but I find it useful to know what is happening in the computer when you run the software.

If you are not familiar with systems of equation using the addition method we are going to add the two equations together. This is kind of useless though unless we can cancel out one of the unknown variables (a or b). We do this by choosing an equation to multiply a number by that gives us the same number for the same variable but opposite operand. In this case, 7 *4 = 28 so we can multiply the ENTIRE bottom equation by -4 to give us a -28 that will cancel out the positive 28 on top. Viola! We can solve for a.

Solving Using Basic Algebra:

If you don’t like using systems of equations, or find them outside of your comfort zone, this is a second method for finding the same answer using basic algebraic expressions. The aspect that I don’t like about this method is that it involves two separate equations to solve as opposed to one system of equations.

It is also really important to note with this method that once you have found your slope you cannot just plug it into a linear formula to get the intercept. It doesn’t work because you calculated the slope using ALL of the points in the set, not just one. To compliment this, you need a formula for the intercept that also uses ALL of the points in the set. In my mind it is easier to remember a system of equations as opposed to two separate ones. It is also easier to see the relationship between your variables in a system of equations.

We are going to use the same calculations from the table in the first example to plug into our formulas.

Solving OLS in Python Code

Obviously there are built in functions in python to automatically find the slope and intercept. There are even built in functions to automatically find the regression line and plot it. Lets do the long code using numpy just for fun!

Below is the regression line that our formula creates.


Although the equations look a little intimidating when you are first learning OLS, it actually all goes back to simple algebra. Having a strong foundation for algebra is the key to understanding statistics. Knowing the commands to automatically run statistical models like linear regression is practical and useful. Not understand the magic behind them can be a fatal flaw. Try solving the equations on your own without looking at my work and see if you can do it!





Just a normally distributed millennial with a left skew. https://github.com/ozbunae