Using R for Exploratory Data Analysis (EDA) — Analyzing Golf Stats
Whether you’re a Data Architect, Data Engineer, Data Analyst, or Data Scientist, we have to work with unfamiliar datasets when starting a new data project. It’s a little like having a blind date with a new dataset. You’ll want to get to know more about it before feeling comfortable.
So, how do we get there? The answer is Exploratory Data Analysis (EDA).
Exploratory Data Analysis is a term for initial analysis and findings done with data sets, usually early on in an analytical process.
As a data professional, we’ll sleep much better having gone through this process. Much time is wasted in future steps if this step is ignored.. re-work is needed to resolve data issues well after architecture foundations and data processing pipelines have been built.
With all the 4 majors done, and Tiger coming off his historic Master’s victory, why not look at some golf statistics. I found high-level stats for 2019 on the espn.com website http://www.espn.com/golf/statistics. I put the contents into a Googlesheet for easy access. As you will soon see, this is a very basic dataset but will allow us to focus on the EDA process.
Below are sample rows of our dataset.
We’ll load the dataset into R using the “googlesheets” library. (Googlesheet filename is “golf_stats_espn” and the sheetname is “2019_stats”).
library(googlesheets)googlesheet <- gs_title(“golf_stats_espn”)
df_2019 <- googlesheet %>% gs_read(ws = “2019_stats”)
EDA Step 1 : Data Validation and Data Quality
The str() function will do a sanity check on the structure and show sample data for each variable.
If you have experience with R, you are probably familiar with the summary() function. It works well, but a more complete function is the skim() function from the “skimr” package. It breaks down the variables by type with relevant summary information, PLUS a small histogram for each numeric variable.
Looks good except for the “AGE” variable. With R and many other analytics tools, data-types are assigned automatically as contents are read in, referred to as “schema-on-read”. For “AGE”, a character-type was assigned to our variable rather than a numeric-type. However, I’d like to treat AGE as a numeric-type to potentially apply numeric functions downstream. Why did R create the AGE variable as a character-type?
Let’s do more EDA and run a table() function on the AGE variable.
The table() function shows distinct values for the variable on the top line, and the number of occurrences in the line below it. For the AGE variable, we can see 1 occurrence of “–”. R had no choice but to define the variable as a “character” type.
We can fix that by using the “DPLYR” package. DPLYR specializes in “data wrangling”. You can efficiently accomplish a lot with this package — dataframe manipulation, transformations, filtering, aggregations, etc.
The R command below creates a new dataframe after performing the actions described by the bullets.
df_2019_filtered <- df_2019 %>% # create new dataframe
mutate(AGE_numeric = !(is.na(as.numeric(AGE)))) %>%
filter(AGE_numeric == TRUE) %>%
mutate(AGE = as.numeric(AGE))
- mutate → creates a new boolean variable identifying whether it’s value is numeric
- filter → uses the new boolean variable created in the mutate line above it to filter off non-numeric
- mutate → replaces the “AGE” variable; now defined as a numeric variable
Below is our new dataframe.
One more adjustment, I’m going to rename the column “RK” to a better name.
df_2019_filtered <- rename(df_2019_filtered, “RANK_DRV_ACC” = “RK”)
Let’s pause here for a minute.
Handling Dirty Data
- For this article, the missing AGE rows were filtered out. In a real analytics project, we’ll have to look at the best course of action to take (filter the row, replace the character data, replace with NULL, etc).
Exploratory Data Analysis (EDA) — Part 2
With our dataset examined and cleaned…
Part 2 leans more toward Data Analysts and Data Scientists. You may be surprised at the insights that can be derived during this phase, even on this very basic dataset.
We’ll use the “DataExplorer” library to learn more about our dataset. The plot_histogram() function will return a separate bar chart for each of our numeric variables. It shows the frequency (number of occurrences) for each value in the variable.
For example, the “GREENS_REG” variable contains values roughly between 55 and 75. Based on the bar chart, we see most golfers are hitting Greens about 65–70% of the time.
The boxplot (box and whisker diagram) displays the distribution of data for a variable. The box shows us a “five number summary” — minimum, first quartile, median, third quartile, and maximum.
The plot_boxplot() function below created 5 bins/partitions. We’ll focus first on the Yards-per-Drive (YDS_DRIVE) variable.
plot_boxplot(df_2019_filtered, by = “YDS_DRIVE”)
We can see some very interesting correlations look at the “AGE” variable compared with the “YDS per DRIVE”.
- The older guys don’t hit as far.
- There is one outlier. Someone in their mid-50’s is still hitting it quite far compared to the others in that age group.
Next, let’s do another boxplot from the “AGE” perspective.
plot_boxplot(df_2019_filtered, by = “AGE”)
- The oldest group (48–55) in the upper left corner has a very low “Driving Accuracy” (DRIVING_ACC). I would expect the older players to hit the ball shorter, but more accurate… that is not true, the data does not lie.
- The older group (48–55) also struggles with putting. They have the highest average putts per hole (PUTT_AVG).
Exploratory Data Analysis (EDA) — Part 3
Let’s take this analysis to another level.
The functionality in this library gets closer to where a Data Scientist spends time.
The “ggcorrplot” provides us with a “heatmap” showing the significance (or lack of significance) between the relationships. A human cannot possibly stare at a spreadsheet and determine patterns/relationships between columns and rows of data.
Let’s put the “ggcorrplot” library to work. Sometimes it’s all about working smarter, not harder!
library(ggcorrplot)ggcorrplot(corr, type = “lower”, outline.col = “black”,
ggtheme = ggplot2::theme_gray,
colors = c(“#6D9EC1”, “white”, “#E46726”))
1 = highly related ; 0 = no relationship ; -1 = inverse relationship
Running that function was much easier than staring at a spreadsheet, attempting to see relationships between rows and columns!
1 . The most significant relationship between our variables is “Yards per Drive” and “Greens Hit in Regulation”.
2 . On the opposite side, the most significant inverse relationship exists between “Age” and “Yard per Drive”.
If you like circles better than squares, below is the same data using the circle method.
ggcorrplot(corr, type = “lower”, outline.col = “black”,
ggtheme = ggplot2::theme_gray,
colors = c(“#6D9EC1”, “white”, “#E46726”))
Although the EDA process is critically important, it is only the beginning of a typical data analytics project lifecycle. Most likely an organization’s valuable data will not be coming from a googlesheet, but will more likely be buried in disparate databases, or coming from a 3rd party vendor, or possibly even an IoT data stream.
Leveraging R and the EDA process will help pave the way toward a successful analytics project.
Jeff Griesemer is a Sr. Analytics Developer at OCTANE.