Exploratory Data Analysis, or EDA for short, critically establishes the initial relationships between variables and features. After the data has been cleaned thoroughly, it is the first proper insight into what the data will tell us. It is also how we give closure to a project as well. Many times it provides a small arsenal of visual representation used to strip the data of its pretentious jargon, friendly to stakeholders. The outline of this blog post is supposed to give an outline of:
- History of EDA.
- Understand the purpose of EDA.
- Give examples of EDA extracted from actual data science projects.
- Address labelling and tidy graphs.
History of EDA
The concept of Exploratory Data Analysis can be credited to John Tukey in succession to his 1977 book which coined the term and sported it as its title, Exploratory Data Analysis. According to the NIST Engineering Statistics Handbook, Tukey went on to publish several other technical books in relation to the use and implementation of EDA. NIST is the National Institute of Standards and Technology. They are a globally recognized organization.
The Handbook makes it clear that EDA is not the same as Statistical graphics and the terms should not be used interchangeably. Statistical graphics being characterized as the individual techniques that comprise the representation of more robust topics.
Understanding the purpose and benefits of EDA
According to our trusty handbook EDA has 7 main benefits:
- Uncover underlying structure.
- Extract important variables.
- Detect outliers and anomalies.
- Test underlying assumptions.
- Develop parsimonious models.
- Determine optimal factor settings.
- Maximize insight into a data set.
More plainly, the computer can quite literally process the computations of large sets of data through complicated algorithms on a scale of linear algebra that is hard for us to actually visualize in our heads. EDA provides a way for us to visually study esoterically large amounts of data and understand its architecture. The concept of EDA should be thought of as a philosophy for your initial approach to working with data.
Examples of EDA in use.
Uncover underlying structure.
Visualizing an underlying structure in our data can help us with understanding noisy obstacles or underlying trends such Seasonality in a time series model.
Extract important variables.
Piggybacking onto the previous concept, now that different elements that cause noise have been visualized we can look at what to remove from the model in order to get a more accurate prediction. In this case we are trying to extract something because we do not want it but the same concept could be applied for extracting needed information that appeared during EDA.
Detect outliers and anomalies.
Anomalies in the world of Data Analytics can be used as a description for outliers which are not false data but heavily deviate from the normal scope of business data.
As circled above, we are able to look at this very rough set of histograms and immediately see that there is something off with the ‘bedrooms’ column. It is generally accepted that for most people, homes range between 2 and 4 bedrooms, give or take. The fact that there is a home that has more than 20 bedrooms is an anomaly. When looking at the data further we saw the max number of bedrooms was 32. Although this is probably a mansion and a completely real property, it is unrealistic to add it into our analysis.
Test underlying assumptions.
When posed with checking what features about a person (recorded by the bank) would influence one’s decision to accept the offer for a personal loan, I had my assumptions. Using this as an implemented modeling technique identified a customized target audience for the bank.
My original assumptions were that mortgage holders would be common among people accepting a personal loan. The idea being that mortgage holders would have expenses such as remodeling. After looking at the feature importances visually organized by most influential we can see this is not the case.
Develop parsimonious models.
To be parsimonious literally means to frugal or stingy. Applied to Data Science a parsimonious model is said to be one with concise coding and minimal parameter settings but optimal output.
How do we do we define a path to that with EDA?
Below is extracted from the Time Series Analysis that was mentioned earlier. Outlined in code is a way to visually present the graph based on the results of the Dickey Fuller Test which measures performance of a Time Series model. By doing this, one line of code can be used to frequently call updated or modified versions of our model included color coding and labeling.
Determine optimal factor settings.
What do we mean when we say factor? Factors are the variables that scientists control during an experiment in order to determine their effect on the response variable. In this case the factors are the algorithms and models that we as data scientists use on our test and train groups in order to test their accuracy. Using visual representations of this can explain why we made decisions to less technical audiences.
Below is an image from a decision tree model that I built for the bank loan project. A literal diagram of a tree provides insight to problems in the model such as overfitting.
Maximize insight into a data set.
As a summation of everything that has been discussed in these components, this provides the best umbrella explanation for why we do EDA into one sentence.
Many times we are speaking to a nontechnical audience. Even if our audience is versed in coding they may not be so in data science. Images provide a concise explanation of more verbose concepts. For stake-holders and nontechnical team members this is how we express ourselves without going through the nitty gritty of algorithms and parameter tuning.
Addressing labels, color, and general tidiness.
Putting in the extra effort to make sure that a graph or chart is tidy and pleasing to the eye can sometimes take longer than the function that you ran. It is meticulous work and requires extreme detail orientation.
You will notice that all of the graphs and visualizations in this post, aside from the histogram chart under ‘Detecting outliers and anomalies’, have labels and colors that are pleasing to the eye. Bar charts flow in a quantitative order. Spacing between labeling and data points is clean and even.
Note that said set of histograms is sloppy. It is simply just the code: df.hist(). This is almost always one of my go to tools, but why am I bashing it? It is simply this, for us as analysts we can look at that and immediately have insight and intuition on what to do with the data and how to start visualizing it. It takes seconds to type and run but offers so much. However it is crude and has not been developed. The labeling bleeds, the charts are squished, the x and y ticks are haphazard and offer little insight.
Proceed with caution is all I will say. Sometimes it is nice to have the quick df.hist() to graze and gain quick perspective. It is also nice to always have a few well polished graphs that well represent your project and offer insight to your work ethic.