Creating an Elegant Plot
Creating visualizations for you data is essential. In another post I did I take an in depth look at EDA according to the National Institute of Standards of Technology which can be found here.
After talking about the importance of EDA, it become a syntactical issue. In this article I plan walking through different techniques and tricks for customizing plots in Python.
As you read, keep referring back to this table I created for myself in an intro to Stats class. Actually my professor created it as a list, but I made a table out of it and keep it posted on the bulletin board in my office. It has proved to be invaluable. The only caveat to this is that the more and more I go through my data journey I hear and read people say “Oh god, never use a pie chart”. I have to agree on this tiny little detail. I have rarely see a pie chart used in a professional study. After creating a ton of my own visualizations, I personally dislike using them as well. They just aren’t the most effective way to convey the story you want to tell.
Basic Charts and Graphs
Matplotlib and Seaborn will be the two most quintessential libraries used by a data scientist or analyst in Python.
Matplotlib
Matplotlib is the gold standard for creating any sort of statistical visualization, based on the infamous MATPLOTLAB. An extension of Numpy, its simple object oriented programming can be embedded easily into applications and GUIs. It does have a rudimentary feel to it, making matplotlib alone appear less refined than adding additional libraries. Below is a simple bar plot in matplotlib just showing class imbalance, more likely used for technical analysis than showing a stakeholder.[1]
Seaborn
Seaborn is a statistical graphics library that builds on top of matplotlib and pandas data frames. Even basic seaborn plots can add a little more sophistication to a visualization. It is important to note when working with seaborn that the sns call can be used in tandem with the plt call. Below is a basic seaborn barplot call with plt.show() for cleaning up the output. Also note the ability to call color palettes as opposed to basic colors.
Seaborn Palettes
Seaborn is infamous for their wide array of predetermined palettes, which tend to look seamless and pleasing. Coming in a wide variety of color spectrums, this is a feature that makes seaborn highly customizable. If you take a look at the seaborn documentation, there are arguments that you can add to the palette call that will customize your graph even further.
A Few Odd Graphs
Histogram
When first learning statistics I could not figure out the difference between a histogram and a bar chart. I think the simplest way to put it is that a histogram measures the frequency of continuous variable and a bar chart measures the frequency of a categorical variable. The graph below was used to check for a normal distribution in housing prices.
Bubble Plot
A bubble plot is used to describe three quantitative or continuous variables. I find that they are particularly useful when it is difficult to understand the visualization from looking at just two variables alone. Adding a tertiary variable always helps. Look at the bottom of the graph and the color and size of the smaller pink bubbles. They start at the bottom and fizzle out at the top and curve. In contrast the large purple bubbles have a high concentration at the top and scoop downward underneath the majority of the visualization.
Why is all this? The answer I came up with is that although homes built before 1940 are smaller they maintain a higher resale value. Newer homes have far more space but don’t seem to hold their value.
Labelling and Refining
The example bar chart that I am using for this section is a segmented bar chart which is best for two categorical variables. In our case a tertiary variable was shown. This is based on foreign and total world wide grosses for movies. When the foreign amount is stacked on the world wide gross, the remaining section of the bar becomes the domestic gross. This results in our three continuous variables: foreign gross, domestic gross, and total world wide gross. Each has a valuable story to tell.
You can achieve this by not designating the two different bar charts to different axes calls. Its important to note however, that when you do this there will be specifications like both bar charts needing to have the same features in order to appease matplotlib.
- Storing the plot in f and ax. Remembering that this is an object oriented programming language we can store all of our executional routes in a single object that can be manipulated. creating subplots in this initial call can be helpful to not only create subplots, but in the case of this graph stack to graphs on top of each other.
- The color options in seaborn are endless. Not only can you choose from a variety of palettes as well as colors but you can adjust the literal hue or tone of the colors. Using set_color_codes in this example I set it to options like “muted” and “pastel”.
- Going back to using our ax call this can be so multifunctional. Here we are using it to set labels to the overall outcome of the chart, but this is also where we could call what axes it is at. (See in subplots below).
- This is a picky detail that I add to almost any visualization that is not small integers. Rotating the labels on a 45 is one quick and easy way to always have pretty visuals.
- A little bonus, this call automatically saves the image as a png to your main jupyter notebook page so it can be used reports like a README.
Subplots
Subplots are an effective way to visualize what is occurring in multiple features simultaneously. Let’s quickly look over two different ways to achieve subplots; one in seaborn and one in plain matplotlib. The example below was pulled from a project where I was trying to predict if a customer would accept a bank loan or not. Using a side by side boxplot this is a great example of one categorical and one continuous variable.
The trick to subplots in seaborn is specifying which axes each plot is on. There is no need to predicate the sns call with anything, but inside the list of arguments you need to specify ax =. Remember that python starts all lists of of objects with 0. So call to [0,0] would be the first row in the first column within the axes grid of the subplots.
In this second example of subplots, seaborn was not implemented. Instead just plain matplotlib was used. Notice the difference in how to call subplots to their respective positions. plt.subplot(number of rows, number of columns, index)
I want to point out in this particular example that the first argument used is 2 even though we have 1 row. Why? The best answer that I can give is that to get my visuals to look the way I wanted it became a game of balancing the figsize and creating rows and columns out of that fig size. If you were to change the figsize or number of rows here the graph would become distorted. You will have to play around with this yourself and eventually you will devise a system.
A note on variability of imports.
import matplotlib.pyplot as plt vs from matplotlib import pyplot
One thing that drove me nuts when I was first starting out with creating visualizations is sometimes (after going through many error messages) python would want me to use import matplotlib.pyplot as plt vs from matplotlib import pyplot.
The answer is that on a basic level all of the ways shown below are interchangeable. [2] Matplotlib is the library and pyplot is the interface. How you store and use them theoretically does not matter to Python. You do have 5 different methods listed below within your tool kit. Just don’t pull a rookie like me and try calling plots in different incarnations in the same notebook. if you define it as plt use plt. If you define it as pyplot use pyplot.d
Plotly Express
Plotly Express is a graphical library whose main feature (other than producing seamless visuals) is that the visualization becomes interactive. It is based out of its parent library Plotly. Plotly Express heavily reduces the amount of code needed to create the visualizations.
Below is the plotly express graph that was made to demonstrate the weekly grosses of the Broadway show Matilda. In the upper right hand corner you can see that there are different icons that make certain applications possible without code like zooming in or taking a photo of it. When you hover the mouse over the data we get a pop up of the exact week and what the grosses were. Extremely user friendly and usable in google docs.
Plotly is extremely robust and complex library. For a more in depth look at how to use plotly, please visit here.
Conclusion
Regardless of which library or packages you use to visually interpret your data, having something that is visually pleasing to the eye is important. I believe that you will always have those quick visualizations you do that are meant for you personally to understand the data on a deeper level. On the other side of the coin you will have stakeholders and other non technical business partners that need a clear and concise visual of the story you are trying to tell.
Resources
[2]https://www.quora.com/What-does-import-matplotlib-pyplot-as-plt-really-mean