Building Reproducible xG Models from StatsBomb Open Data

Summary

This academic paper (available on ResearchGate) presents a reproducible xG modeling pipeline using StatsBomb's open data, comparing logistic regression and mixed-effects models for xG prediction. The analysis uses 10,709 non-penalty shots from La Liga 2015/2016 and the 2018 FIFA World Cup, demonstrating that xG models built on club data can transfer to international football.

The paper's key contribution is showing that a well-constructed logistic regression xG model with 5-7 features achieves comparable predictive accuracy to complex ML models, and that mixed-effects models can account for player-level variation in shooting ability.

Key Concepts

  • Reproducibility via open data: Using StatsBomb open data (freely available) makes xG research reproducible — the data and code can be shared
  • Logistic regression as baseline: Simple logistic regression with distance, angle, body part, and pressure features achieves ~AUC0.75-0.80 on shot classification
  • Mixed-effects models: Accounting for player-level random effects improves xG estimates for individual players — important for team-level xG where player quality variation matters
  • Cross-dataset validation: Models trained on La Liga data transfer reasonably well to World Cup data, validating the generalizability of xG features
  • Feature importance: Distance and angle are the dominant predictors; body part and defensive pressure add incremental value
  • World Cup 2018 data: The paper includes 2018 World Cup data, directly relevant to the World Cup prediction model

Key Findings

  • Distance is primary: Shots within 6 meters have xG ~0.30-0.40; shots beyond 20 meters have xG < 0.05
  • Angle matters: Optimal shooting angle (facing goal directly) adds ~0.10 xG vs. narrow angles
  • Headers are undervalued in simple models: Mixed-effects models show headers have player-specific variation that simple models miss
  • Set piece xG: Direct free kicks have ~0.05 xG; corners have ~0.02 xG; penalties have ~0.76 xG
  • Model transfer: xG model trained on La Liga transfers to World Cup with minimal degradation — the core features are universal

Notes

  • This paper directly addresses the World Cup modeling use case — it includes 2018 World Cup data in its analysis
  • The mixed-effects model approach is particularly interesting for team-level xG: it accounts for the fact that some teams have better shooters than others, beyond just shot location
  • Key finding for the model: xG features are universal across competitions — a model built on club football data should apply to international football
  • The paper confirms that xG is not just a descriptive metric but a valid predictive input for match outcome models
  • The existing expected-goals-xg.md note covers the concept; this source adds the academic validation and StatsBomb-specific implementation details
  • For the World Cup model: the paper suggests using logistic regression xG with distance, angle, body_part, and pressure as the core feature set