Curve Your Enthusiasm: Baseball Analysis

Written in

by

Of all of the wide, vast amounts of data available for public use and analysis, I was hardly expecting to be excited about and want to work on a baseball dataset, daily.

I’ve done a machine learning model and algorithm so far, but what I haven’t done is take an analysis and really show something with it, present findings with just an analysis, and that was what my goal was with the curveball dataset and analysis that I was working on.

The initial goal was to show the efficiency of curveballs, then it evolved to showing that efficiency by comparison to fastballs.

It has honestly been a lot of fun digging around, finding other ways to show efficiency and comparisons.

I grew up in St. Louis, which is one of the biggest baseball cities in the US. Despite that, it never really caught on for me, I was never a superfan, it was just a passing interest. When I saw the sheer volume and types of data available on baseballsavant.com, though, I was both floored by the enormity of it and absolutely excited just to dig.

One of my favorite analyses to do is time series, and my favorite EDA technique is the heatmap, but I don’t think I’ll ever have a preferred field/industry for data, ever. Because what it really always comes down to for me is the types of insights I can make with all the data available.

That being said, in the beginning I thought I needed to do things from scratch all the time with my data (IE: creating my own, a la the grocery project I created), but it was such an eye opener to realize there’s all this readily available data just sitting there, waiting for you to get into it.

The goal again for this is to show the efficiency of curveballs, and now, curveballs versus fastballs, which are the more prevalent type of pitch. The way that I’ve gone about the “efficiency” question is by focusing specifically the “whiffs” category, which means swing and a miss in one. I’m having a hard time finding out what kind of visualization to use specifically, but so far it’s interesting to see the findings that come out of this.

One of the findings being that where the curveballs had ~369 entries, the fastballs had ~640, so just nearly double that.

In some of the treemaps I made, you can see– despite the overplotting– that there’s a chunk of pitchers who do the curveballs readily, and are a good amount of those pitches. I assume that this is because of how difficult it is to throw one, but I’m not sure. I’d love to dig through and figure that out, too.

In the pairplot I did of both datasets, the fastball whiffs are a lot more varied than the curveball ones.

I think I’m still hung up on which type of pitch IS more effective, at the end of the day, and how to show that. If I just looked at total pitches to whiffs, which I’m currently doing, I’m not sure I’d get the whole picture.

There’s also hits which I haven’t really focused on that much at all. Hits vs whiffs is a good one to look into, too. I’ll have to poke at it some more, I think I’m on to something with that. How often do players hit the curveballs vs fastballs, versus swinging and missing them?

You can view my analysis in jupyter notebooks so far, as well as the data I’m using at this github link! Please don’t hesitate to let me know any thoughts you have on this, I love hearing from other people: getting more insight and different perspectives is such a privilege.

Next week, I think I’ll share an update on my grocery receipt project, since there’s been progress. Until then, thank you for reading, and I hope all of your data cleaning is swift and simple!

Tags

Leave a comment