The Art of EDA (Exploratory Data Analysis)

The Idea of EDA

Originally, I thought exploratory data analysis (EDA) was the standard way of providing graphs and numerical statistics to back up an original hypothesis. After a project and some further reading, my view has definitely changed. EDA is actually explorative. This is one of the main ways that data scientists fit the statistics to the data rather than fit the data to the statistics. When I was working on my project I realized how open ended the statistics could be for my API data; I was allowed free range in how I wanted to present the data. So I let the data talk for itself. John W. Tukey had a similar mindset as he beileved that there was too much confirmatory data analysis². My overall goal in completing EDA is to figure out all the different relationships my data results have; to uncover all of what is hidden in my data. Having so much free range in EDA can be daunting because as much as you want the data to speak for itself, you want it still to be meaningful. To do so, a great strategy plan is required.

My Strategy

My strategy going into my project was to put the data through some graphs and see any patterns or useful information I could point out. I would have some numerical summary tables and present those in graphical visuals to show off any outliers or patterns. This was a primitive strategy to tackle EDA, which showed me that I have a long way before I could consider myself skilled at EDA. I think the most important strategy is to know your data before the analysis; you should know what the columns mean, what type of data it is, and where your gaps in data are. Some of the things I look for in my own EDA process is NA values, dsitribution patterns, outliers, and relationships between variables. This Shopify walkthrough shows a far more organized strategy that I will be implementing in my next EDA project.

The plan is to get a better understanding of the data before starting EDA, and no this is not simply understanding the statistical items to use. First step would be to examine the columns of the database and know what each one stands for and maybe how it was collected. If I know how the data was collected then I might be able to explain missing or NA values in the dataset. So after understanding the meaning of the columns I would turn my attention to the missing values within the dataset. Even data with a lot of missing values can be important to keep in the EDA¹. After touching base on the missing values, then I would add in some numeric and graphical presentations. This step is when the data starts to speak for itself. Outliers really start to show themselves on graphs. Once my data is graphed or has numerical summaries I can begin to apply more statistics to evaluate relationships and patterns. After that, my EDA process would be complete and I can look forward to presenting my findings!

Useful Links

1. Shopify Technique

2. Wikipedia Breakdown

3. Engineering Standard Handbook Definition

4. IBM Definition

Project 2 A Vignette

Variable Selection For Regression