The National Football League (NFL) is increasingly turning to Amazon Web Services' SageMaker machine learning toolkit to create new, more complex statistics to share with fans in real time, and the teams themselves after the game.
The NFL is the USA's most popular sporting league, with hundreds of millions of people globally tuning in to watch games every season, including the four set to be played in the UK next year.
Statistics are seen as another tool in the league's belt to engage with fans and Next Gen Stats (NGS) was established in 2015 when the league started to install Zebra Technologies RFID chips into the footballs and player's pads, helping collect 1TB of GPS data per season.
Speaking at AWS re:Invent in Las Vegas this week, Michael Chi, director of engineering for NFL Next Gen Stats explained: "It's our responsibility to continuously find ways to enhance the game and make it more accessible for fans, and with Next Gen Stats we do that with data."
Now that department is looking to AWS machine learning tools like SageMaker to enable its engineers, who aren't necessarily data scientists, to create more advanced Next Gen Stats like coverage, so who is covering who on a given play, and completion percentage, its latest statistic which predicts the likelihood of a pass being caught by the receiver.
This data is shared in real time with broadcasters like ESPN to enrich their coverage with advanced statistics. It can also be shared on social media or with the media. Coaches and staff then get access via a research tool after the game to get performance metrics.
Now, the Next Gen Stats department is looking to develop more complex statistics to share with fans using machine learning techniques.
SageMaker, which was announced at re:Invent last year, is a fully managed AWS platform to help customers easily build, train, and deploy machine learning models without necessarily being experts.
Chi explained that his team predominantly deal in two types of statistic: derived metrics and rule-based stats.
Take air yards - how far the ball travels in the air after it leaves the quarterback's hand - this is a simple derived metric. If you have tracking data and know where the receiver and line of scrimmage are, it's a simple statistic to track.
Rule based is for more difficult and algorithmic statistics such as 'coverage' - so which defender is covering a receiver and how well they are doing it.
Chi outlined exactly why this is more complex to work out. "Start with the receiver and which defender is lined up against them," he explained.
"Then take a step further and over the course of the play define which defender was closest on average, then you might ask which defender was closest at the time of the throw. If all three match you would be pretty confident in the stat."
However, "that doesn't always play out," he added. Say a receiver is lined up close to a fellow receiver, or the defence is playing zone, not man to man coverage, and this is where these algorithms "could break down".
By using machine learning techniques with SageMaker, Chi and his team were able to "use data and train a model to learn these nuances in the training data, to approach these problems differently for edge cases," he said.
Now to its latest stat: completion probability.
By taking 10 key features - including receiver separation, time to throw, quarterback separation from a rusher, scramble yards, air yards/distance, etc - the NGS department trained its model on 35,000 pass plays from the previous two seasons.
By using the open source XGBoost gradient boosting algorithm, with a 70-20-10 train/validate/test methodology, the team was able to validate a best performing model using SageMaker, Chi said.
In theory this allows the NFL to show the quickly falling likelihood of a pass being completed as the receiver gets further away from the quarterback, becomes better covered by his defender, or as the quarterback gets closed down by a defender, reducing his time to throw.
Take Odell Beckham's famous one-handed grab (below) for example, despite this occurring in 2015, before this Next Gen Stat existed, Computerworld UK would assume this would be assigned a very low completion probability.