Just the phrase ‘Ordinary Least Squares’ used to give me so much anxiety. It sounds so official. If you can’t tell from my other blog posts by now, I enjoy taking words and phrases that seem intimidating and making them accessible. I once heard someone say “Oh don’t stress about it, it’s just linear regression”. This isn’t entirely true. Let’s take some time to break it down.

In this article I expect you know what y=mx + b is, what rise over run is, and what simple linear regression is. …

The Who, Where, When, What, Why, & How of Multicollinearity.


Multicollinearity is a phenomenon unique to multiple regression that occurs when two variables that are supposed to be independent in reality have a high amount of correlation.[1] The key to this argument is that they should be independent. Having variables that are not independent creates a bias. Especially if the correlation is extremely high, it can cause an analyst to misinterpret results. That is a very quick overview, so let’s break it down a little more.


Where would this occur in the real world? Collecting similar features allows a company…

In data science we talk a lot about adding a narrative to something, usually in the context of data visualization. This is what I love about time series, our narrative comes from the literal passage of time, just as an actual story would. Housing data is a common practice and real life scenario in which we set a time, a place, and so forth. Sometimes the story our data tells can be at random. Not easy to predict what’s going to happen next, just like in a story. Luckily having the measurement of Unit Root helps determine that randomness.

Covering Ground


This is a replica of a self portrait when I was trying to understand Root Mean Squared Error

I just remember sitting in data science boot camp and it being drilled into our heads to check the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE), and the R squared (R2) whenever we were handling regression modeling. When the subject was first introduced there was extensive explanation of the material. A couple months after that, I have to be honest, I would sometimes just calculate these numbers trying to check accuracy and not have the deepest level of understanding about what each term truly means. Pun intended.

After spending time creating even a simple regression model, it…

Creating visualizations for you data is essential. In another post I did I take an in depth look at EDA according to the National Institute of Standards of Technology which can be found here.

After talking about the importance of EDA, it become a syntactical issue. In this article I plan walking through different techniques and tricks for customizing plots in Python.

As you read, keep referring back to this table I created for myself in an intro to Stats class. Actually my professor created it as a list, but I made a table out of it and keep it…

In a recent project on Broadway Grosses I used machine learning to predict when a Broadway show would close based on features like previous week’s grosses. The idea is that when we look at this graph we see a visible decline in gross with the red marking the end of the production’s life. We did this in a data set that had 5 years worth of Broadway grosses and marked the last 6 weeks of every show that had closed with a 1 and everything else with a 0. Making this a binary classification problem.

Support Vector Machine

Support Vector Machine was one…


The confusion matrix is a quintessential part of our work as data scientists. Our bread and butter; it is a form of visualizing the performance of our model. Tackling this remains relatively simple for two classes, but as our matrix balloons calculations can become muddy.

Technical terminology associated and reviewed in this article includes true positives, true negatives, false positives, and false negatives which in turn yields the true and false positive rates as well as true and false negative rates. We can also evaluate metrics like accuracy, precision, recall and F1 scores.

We are going to hit three main…

Exploratory Data Analysis, or EDA for short, critically establishes the initial relationships between variables and features. After the data has been cleaned thoroughly, it is the first proper insight into what the data will tell us. It is also how we give closure to a project as well. Many times it provides a small arsenal of visual representation used to strip the data of its pretentious jargon, friendly to stakeholders. The outline of this blog post is supposed to give an outline of:

  1. History of EDA.
  2. Understand the purpose of EDA.
  3. Give examples of EDA extracted from actual data science…

Andrew Ozbun

Just a normally distributed millennial with a left skew. https://github.com/ozbunae

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store