In a recent project on Broadway Grosses I used machine learning to predict when a Broadway show would close based on features like previous week’s grosses. The idea is that when we look at this graph we see a visible decline in gross with the red marking the end of the production’s life. We did this in a data set that had 5 years worth of Broadway grosses and marked the last 6 weeks of every show that had closed with a 1 and everything else with a 0. Making this a binary classification problem.
Support Vector Machine
Support Vector Machine was one of the statistical models implemented. Support Vector Machine is an algorithm that solves complex regression problems by creating a line of best fit so to speak. This line is known as a hyperplane and creates a decision boundary between the two classes. That being said, SVM can perform binary as well as multi-class classification.
However when I ran the model the results were incredibly puzzling. The machine had a high accuracy, but only for one specific class. This is known as the accuracy paradox. This is why it is so important to look at a full classification report and confusion matrix. Keep in mind that the machine’s goal is to accurately predict the outcome. Therefore to cut its losses, it makes the choice to predict everything as one class to achieve the highest overall accuracy.
When looking into the the Open vs. Closing class distribution it is extremely imbalanced. Class imbalance is considered a common issue that causes problems for machine learning models, especially SVM. The problem in this case is that our class will always be imbalanced because the productions stay open more often than they close. 
Realizing that this was a possible class imbalance issue led me to use the go to in my tool kit, SMOTE. SMOTE stands for Synthetic Minority Oversampling TEchnique. It essentially works the same as bootstrapping techniques from stats class. By creating new samples out of the old ones, we can even out the class imbalance from our feature and hopefully create a less biased model. Unfortunately it didn’t work. It continued to behave in the same manner, just flipping which class it chose to predict.
Kernel, C, and Gamma
The additional solution to this problem came down to adjusting the hyperparameters Kernel, C, and Gamma. It is important to know that a perfect SVM is an overfit model, therefore something called a soft margin is used which allots for some misclassification with the reward of a generally better model.
Kernel: Commonly this is referred to as the kernel trick. Kernels are a metric for similarity between input variables. There are four arguments for kernel as listed in the sklearn documentation: linear, polynomial, rbf, and sigmoid.
Linear is created in straight lines and do not work in multidimensional levels. Linear is technically the same as polynomial where the degree=1.
Polynomial is a way of learning how features interact by using combinations of previously existing variables. (i.e. x, y, xy, x**2)
RBF(default), or Radial Bias Function was finally the kernel trick that helped improve the performance of the SVM model. As the name implies, RBF takes into account the relative Euclidean distance between two points. In other words, what is the decision bias based on the radial distance between two points. RBF is also referred to sometimes as Gaussian as it uses a summation of the curves around the points and creates a topography used to create the decision boundary.
Sigmoid is an activation function. It deciphers decisions like on/off , 1/0, ext. Not ideal for what we are doing here, from what I have read it tends to predict things as either one class or the other and will need further adjustments if used.
Gamma: Related to the RBF kernel, low gamma allows the radius of decision to be larger. A high gamma truncates the permitted radius.
C: The C hyperparameter penalizes the model for each time it misclassifies something. If the C hyperparameter is low it means it is assigning a low penalty to misclassification making the decision boundary more robust. If the C hyperparameter is high it applies more penalty to a misclassification, narrowing the gamut of the decision boundary.
Euclidean Distance: Since I brought it up, the Euclidean distance is the straight line distance between two points in Euclidean space. Euclidean space is a geometry fundamental defining nonnegative planes from a multidimensional space.
**Note that gamma is set to “scale”. From the sklearn documentation: if
gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma. Essentially, gamma= ‘scale’ tailors its setting to the dataset and is the most optimal coefficient option for kernel= ‘rbf’.
We can see in the confusion matrix below that it is predicting both classes with some margin of error. Allowing this margin of error formulates a more practical model than one that can only interpret one class. It’s important to keep in mind that this confusion matrix came from a combined use of our hyperparameters AND SMOTE.
The final accuracy score of this model was 0.72 or 72%. Comparative to Logistic Regression which came in at 0.65, it outperformed. However, Random Forest came in at 0.90, and XGBoost at 0.95, making them our ultimate winners. Although it didn’t win first place and the accuracy was technically lower, a more pragmatic model was created. We did this by not only implementing SMOTE, but also tuning our hyperparameters to align with our data. The SMOTE handled the class imbalance and the hyperparameters governed how our data was organized. Without the combined use of these two specifications our SVM would have been a total and utter failure.