Data Science How-To: Using Apache Spark for Sports Analytics

February 1, 2017 Chris Rawles

Apache Spark has become a common tool in the data scientist’s toolbox, and in this post we show how to use the recently released Spark 2.1 for data analysis using data from the National Basketball Association (NBA). All code and examples from this blog post are available on GitHub.

Analytics have become a major tool in the sports world, and in the NBA in particular analytics have shaped how the sport is played. The league has skewed towards taking more 3-point shots due to their high efficiency as measured by points per field goal attempt. In this post we evaluate and analyze this trend in the NBA using season statistics data going back to 1979 along with geospatial shot chart data. The concepts in this post -- data cleansing, visualization, and modeling in Spark -- are general data science concepts and are applicable for other tasks beyond analyzing sports data. The post concludes with the author’s general impressions about using Spark and with tips and suggestions for new users.

For the analyses, we use Python 3 with the Spark Python API (PySpark) to create and analyze Spark DataFrames. In addition, we utilize both the Spark DataFrame’s domain-specific language (DSL) and Spark SQL to cleanse and visualize the season data, finally building a simple linear regression model using the spark.ml package -- Spark’s now primary machine learning API.

Finally, we note that the analysis in this tutorial can be run with a distributed Spark setup running on a cloud service such as Amazon Web Service (AWS) or on a Spark instance running on a local machine. We have tested both and have included resources for getting started on either AWS or a local machine at the end of this post.

The Code

Using data from Basketball Reference, we read in the season total stats for every player since the 1979-80 season into a Spark DataFrame using PySpark. DataFrames are designed to ease processing large amounts of structured tabular data on the Spark infrastructure and are now in fact just a type alias for a Dataset of Row.

We can also view the column names of our DataFrame:

print(df.columns)

['_c0', 'player', 'pos', 'age', 'team_id', 'g', 'gs', 'mp', 'fg', 'fga', 'fg_pct', 'fg3', 'fg3a', 'fg3_pct', 'fg2', 'fg2a', 'fg2_pct', 'efg_pct', 'ft', 'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts', 'yr']

Using our DataFrame, we can view the top 10 players, sorted by number of points in an individual season. Notice we use the toPandas function to retrieve our results. The corresponding result looks cleaner for display than when using the take function.

df.orderBy('pts',ascending = False).limit(10).toPandas()[['yr','player','age','pts','fg3']]

yr	player	age	pts	fg3
1987	Jordan,Michael	23	3041	12
1988	Jordan,Michael	24	2868	7
2006	Bryant,Kobe	27	2832	180
1990	Jordan,Michael	26	2753	92
1989	Jordan,Michael	25	2633	27
2014	Durant,Kevin	25	2593	192
1980	Gervin,George	27	2585	32
1991	Jordan,Michael	27	2580	29
1982	Gervin,George	29	2551	10
1993	Jordan,Michael	29	2541	81

Next, using the DataFrame domain specific language (DSL), we can analyze the average number of 3-point attempts for each season, scaled to the industry standard per 36 minutes (fg3a_p36m). The per 36 minutes metric provides an estimate of a given player’s stats projected to 36 minutes, an interval corresponding to an approximate full NBA game with adequate rest, while also allowing comparison across players that play different numbers of minutes.

We compute this metric using the number of 3-point field goal attempts (fg3a) and minutes played (mp).

Alternatively, we can utilize Spark SQL to perform the same query using SQL syntax:

Now that we have aggregated our data and computed the average attempts per 36 minutes for each season, we can query our results into a Pandas DataFrame and plot it using matplotlib.

We can see a steady rise in the number of 3 point attempts since the shot's introduction in the 1979-80 season, along with a blip in number of attempts during the period in the mid 90's when the NBA moved the line in a few feet.

We can fit a linear regression model to this curve to model the number of shot attempts for the next 5 years. Of course, this assumes a linear nature of the rate of increase of attempts and is likely a naive assumption.

Firstly, we must transform our data using the VectorAssembler function to a single column where each row of the DataFrame contains a feature vector. This is a requirement for the linear regression API in MLlib. We first build the transformer using our single variable `yr` and transform our season total data using the transformer function.

We then build our linear regression model object using our transformed data.

yr	fga_pm	fg3a_pm	features	label
1980	13.49321407	0.410089262	[1980.0]	0.410089262
1981	13.15346947	0.3093759891	[1981.0]	0.3093759891
1982	13.20229631	0.3415114296	[1982.0]	0.3415114296
1983	13.30541336	0.3314785517	[1983.0]	0.3314785517
1984	13.14301635	0.3571099981	[1984.0]	0.3571099981

Next, we want to apply our trained model object model to our original training set along with 5 years of future data. Containing this time period, we build a new DataFrame, transform it to include a feature vector, and then apply our model to make a prediction.

We can then plot our results:

Analyzing Geospatial Shot Chart Data

In addition to season total data, we process and analyze NBA shot charts to view the impact the 3-point revolution has had on shot selection. The shot chart data was originally sourced from NBA.com.

The shot chart data contains xy coordinates of field goal attempts on the court for individual players, game date, time of shot, shot distance, a shot made flag, and other fields. We have compiled all individual seasons where a player attempted at least 1000 field goals attempts from the 2010-11 through the 2015-16 season.

As before we can read in the CSV data into a Spark DataFrame.

We preview the data.

df.orderBy('game_date').limit(10).toPandas()[['yr','name','game_date','shot_distance','x','y','shot_made_flag']]

yr	name	game_date	shot_distance	x	y	shot_made_flag
2011	LaMarcus Aldridge	2010-10-26	1	4	11	0
2011	Paul Pierce	2010-10-26	25	67	246	1
2011	Paul Pierce	2010-10-26	18	165	83	0
2011	Paul Pierce	2010-10-26	24	159	186	0
2011	Paul Pierce	2010-10-26	24	198	148	1
2011	Paul Pierce	2010-10-26	23	231	4	1
2011	Paul Pierce	2010-10-26	1	-7	9	0
2011	Paul Pierce	2010-10-26	0	-2	-5	1
2011	LaMarcus Aldridge	2010-10-26	21	39	211	0
2011	LaMarcus Aldridge	2010-10-26	8	-82	23	0

We can query an individual player and season and visualize their shots locations. We built a plotting function plot_shot_chart (see the GitHub repo) that is based on Savvas Tjortjoglou's example.

As an example, we query and visualize Steph Curry's 2015-16 historic shooting season using a hexbin plot, which is a two-dimensional histogram with hexagonal-shaped bins.

The shot chart data is rich in information, but it does not specify if the shot type is a 3-point attempt or a corner 3. We solve for this by building User Defined Functions (UDF), which identify the shot type given the xy coordinates of the shot attempt.

Here we defined our shot labeling functions using standard Python functions utilizing numpy routines.

We then register our UDFs and apply each UDF to the entire dataset to classify each shot type:

We can visualize the change in the shot selection over the past 6 years using all of our data from the 2010-11 season up until the 2015-16 season. For visualization purposes, we exclude all shot attempts taken inside of 8 feet as we would like to focus on the midrange and 3 point shots.

Over the years, there is a notable trend towards more three-pointers and fewer midrange shots.

Finally, we evaluate shot efficiency as a function of shot distance.

We then plot our results.

Among the top scorers in the league, close 3-point attempts are among the most efficient shots in the league, on par with shots taken close to the basket. It's no wonder that accurate 3-point shooting is among the most coveted talents in the NBA today!

Conclusion

Lastly as a seasoned data scientist, SQL user, and Python junkie, here are my two cents on getting started with Spark. The Spark ecosystem and documentation are continually evolving and it is important to use the newest Spark version. A first time user will notice there are multiple ways to solve a problem using different languages (Scala, Java, Python, R), different APIs (Resilient Distributed Dataset (RDD), Dataset, DataFrame), and different data manipulation routines (DataFrame DSL, Spark SQL). Many choices are up to the users and others are guided by the documentation. Since Spark 2.0 for example, the DataFrame is now the primary Spark API for Python and R users (rather than the original and still useful RDD). In addition, the DataFrame-based spark.ml package is now the primary machine learning API in Spark replacing the RDD-based API. Bottom line: the platform is evolving and it pays to stay up to date.

In this post, we’ve demonstrated how to use Apache Spark to accomplish key data science tasks including data exploration, visualization, and model building. These principles are applicable to other data science tasks and datasets, and we encourage you to check out the repository and try it on your own!

Additional Resources

Install Spark on a local machine (Mac OS X Yosemite)
AWS and PySpark with Anacona - Quick Start
Install Spark on AWS

About the Author

Chris Rawles is a senior data scientist at Pivotal in New York, New York, where he works with customers across a variety of domains, building models to derive insight and business value from their data. He holds an MS and BA in geophysics from UW-Madison and UC Berkeley, respectively. During his time as a researcher, Chris focused his efforts on using machine-learning to enable research in seismology.

An Introduction to Look-Aside Caching

Learn the basics of look-aside caching, how it works, when to use it and how it differs from inline caching...

Compounding Open Source Cloud Foundry Value: The Pivotal Difference

This post articulates the add-on time saving and agility services PCF layers atop CF Release, as well as th...

Data Science How-To: Using Apache Spark for Sports Analytics

The Code

Conclusion

Additional Resources

About the Author

Previous

Next

Related content in this Stream

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.