Building Reproducible xG Models from StatsBomb Open Data¶
Summary¶
This academic paper (available on ResearchGate) presents a reproducible xG modeling pipeline using StatsBomb's open data, comparing logistic regression and mixed-effects models for xG prediction. The analysis uses 10,709 non-penalty shots from La Liga 2015/2016 and the 2018 FIFA World Cup, demonstrating that xG models built on club data can transfer to international football.
The paper's key contribution is showing that a well-constructed logistic regression xG model with 5-7 features achieves comparable predictive accuracy to complex ML models, and that mixed-effects models can account for player-level variation in shooting ability.
Key Concepts¶
- Reproducibility via open data: Using StatsBomb open data (freely available) makes xG research reproducible — the data and code can be shared
- Logistic regression as baseline: Simple logistic regression with distance, angle, body part, and pressure features achieves ~AUC0.75-0.80 on shot classification
- Mixed-effects models: Accounting for player-level random effects improves xG estimates for individual players — important for team-level xG where player quality variation matters
- Cross-dataset validation: Models trained on La Liga data transfer reasonably well to World Cup data, validating the generalizability of xG features
- Feature importance: Distance and angle are the dominant predictors; body part and defensive pressure add incremental value
- World Cup 2018 data: The paper includes 2018 World Cup data, directly relevant to the World Cup prediction model
Key Findings¶
- Distance is primary: Shots within 6 meters have xG ~0.30-0.40; shots beyond 20 meters have xG < 0.05
- Angle matters: Optimal shooting angle (facing goal directly) adds ~0.10 xG vs. narrow angles
- Headers are undervalued in simple models: Mixed-effects models show headers have player-specific variation that simple models miss
- Set piece xG: Direct free kicks have ~0.05 xG; corners have ~0.02 xG; penalties have ~0.76 xG
- Model transfer: xG model trained on La Liga transfers to World Cup with minimal degradation — the core features are universal
Notes¶
- This paper directly addresses the World Cup modeling use case — it includes 2018 World Cup data in its analysis
- The mixed-effects model approach is particularly interesting for team-level xG: it accounts for the fact that some teams have better shooters than others, beyond just shot location
- Key finding for the model: xG features are universal across competitions — a model built on club football data should apply to international football
- The paper confirms that xG is not just a descriptive metric but a valid predictive input for match outcome models
- The existing
expected-goals-xg.mdnote covers the concept; this source adds the academic validation and StatsBomb-specific implementation details - For the World Cup model: the paper suggests using logistic regression xG with distance, angle, body_part, and pressure as the core feature set