How to Clean a Really Dirty Column

My cohort mate and I recently decided to tackle an incredibly dirty data set. While in school together, we concurred that most of the data we worked with was safe. Many times there were very few missing values to work with. If missing values were present, they could usually be dropped or interpolated in some way. Although this is pupil friendly, it is highly unrealistic.

Tales of a Learning Experience

Learning anything new is just that, a learning process. You don’t know until you know. It’s always insightful to look back at work from when you are first learning something. Laughing at yourself with that internal dialogue of “What the heck was I thinking?? This is terrible!!”

Making the terrifying technical interview less terrifying.

I can’t tell you how often this happened when I was first starting.

I am brand new to coding, data analytics, and data science. I am starting my initial interviews to land that coveted entry level job. The biggest source of stress is performing a coding exercise while an interviewer watches. It makes me sweat and I blank out on basic technical skills like how to write a for loop properly. The exercises are usually very formulaic, requiring some sort of function that will execute tasks like sorting through a list of numbers or replacing multiples of a certain number with hokey words and phrases.

Just the phrase ‘Ordinary Least Squares’ used to give me so much anxiety. It sounds so official. If you can’t tell from my other blog posts by now, I enjoy taking words and phrases that seem intimidating and making them accessible. I once heard someone say “Oh don’t stress about it, it’s just linear regression”. This isn’t entirely true. Let’s take some time to break it down.

The Who, Where, When, What, Why, & How of Multicollinearity.


Multicollinearity is a phenomenon unique to multiple regression that occurs when two variables that are supposed to be independent in reality have a high amount of correlation.[1] The key to this argument is that they should be independent. Having variables that are not independent creates a bias. Especially if the correlation is extremely high, it can cause an analyst to misinterpret results. That is a very quick overview, so let’s break it down a little more.


Where would this occur in the real world? Collecting similar features allows a company…

In data science we talk a lot about adding a narrative to something, usually in the context of data visualization. This is what I love about time series, our narrative comes from the literal passage of time, just as an actual story would. Housing data is a common practice and real life scenario in which we set a time, a place, and so forth. Sometimes the story our data tells can be at random. Not easy to predict what’s going to happen next, just like in a story. Luckily having the measurement of Unit Root helps determine that randomness.

Covering Ground


This is a replica of a self portrait when I was trying to understand Root Mean Squared Error

I just remember sitting in data science boot camp and it being drilled into our heads to check the Mean Squared Error (MSE), the Root Mean Squared Error (RMSE), and the R squared (R2) whenever we were handling regression modeling. When the subject was first introduced there was extensive explanation of the material. A couple months after that, I have to be honest, I would sometimes just calculate these numbers trying to check accuracy and not have the deepest level of understanding about what each term truly means. Pun intended.

Creating visualizations for you data is essential. In another post I did I take an in depth look at EDA according to the National Institute of Standards of Technology which can be found here.

In a recent project on Broadway Grosses I used machine learning to predict when a Broadway show would close based on features like previous week’s grosses. The idea is that when we look at this graph we see a visible decline in gross with the red marking the end of the production’s life. We did this in a data set that had 5 years worth of Broadway grosses and marked the last 6 weeks of every show that had closed with a 1 and everything else with a 0. Making this a binary classification problem.

Support Vector Machine

Support Vector Machine was one…


The confusion matrix is a quintessential part of our work as data scientists. Our bread and butter; it is a form of visualizing the performance of our model. Tackling this remains relatively simple for two classes, but as our matrix balloons calculations can become muddy.

Andrew Ozbun

Just a normally distributed millennial with a left skew.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store