ML-KULeuven soccer_xg — Open-Source xG Implementation¶
Summary¶
The soccer_xg GitHub repository by KU Leuven's ML group is an open-source Python package for training and analyzing expected goals (xG) models using the SPADL (Spatial Action Data Format) event stream data. It provides complete ML pipelines for building xG models from StatsBomb open data, including feature engineering, logistic regression, gradient boosting, and evaluation.
This is the most academically rigorous open-source xG implementation available, with detailed documentation of features, model choices, and reproducibility. The repository demonstrates how xG models are built from first principles using shot location, body part, assist type, and defensive pressure features.
Key Concepts¶
- SPADL format: Standardized spatial action data format that normalizes event stream data from different providers (StatsBomb, Wyscout, Opta) into a common schema
- Feature set for xG: Distance from goal, angle to goal, body part, assist type, shot type (open play vs. set piece), defensive pressure (nearest defender distance), big chance flag
- Model choices: Logistic regression (baseline) and gradient boosting (LightGBM) for predicting goal probability from shot features
- Reproducibility: Built using StatsBomb open data with clear data preprocessing steps
- socceraction library: The parent library (socceraction) provides the SPADL conversion and feature extraction pipeline
Key Features Used in xG Models¶
- Distance from goal: Primary predictor — exponential decay in xG with distance
- Angle to goal: Wider angles (facing goal directly) = higher xG
- Body part: Foot shots ~0.35 xG, headers ~0.11 xG, other ~0.15 xG at similar positions
- Assist type: Through-ball > cross > pass from behind
- Shot type: Open play vs. penalty (~0.76 xG) vs. free kick vs. corner
- Defender proximity: Nearest defender distance significantly reduces xG
- Phase of play: Fast break vs. established attack vs. set piece
Code Example¶
from socceraction.ml import SoccerXGModel
from socceraction.data.spadl import StatsBombEncoder
# Build xG model from SPADL data
model = SoccerXGModel(features=['distance', 'angle', 'body_part',
'defender_near', 'is_big_chance'],
model='logistic_regression') # or 'lgbm'
model.fit(shots_df)
# Predict xG for new shots
shots_df['xg'] = model.predict_proba(shots_df[features])
# Evaluate model
from socceraction.ml.evaluation import xg_metrics
metrics = xg_metrics(shots_df, y_col='goal')
Notes¶
- This is the best open-source xG implementation reference — the existing
expected-goals-xg.mdnote covers the concept but this source provides the actual implementation framework - Key insight: even the best ML xG models rely heavily on distance and angle as primary features — simpler logistic regression gets ~80% of the predictive power of complex models
- The SPADL normalization is important for production systems: it allows training on StatsBomb data and applying to Opta data with minimal adaptation
- The repository uses LightGBM for the best xG models — gradient boosting consistently outperforms logistic regression for xG, though the improvement is modest
- For World Cup modeling: the challenge is that national team xG data is less available than club football; this repo's approach suggests using shot-level features even with limited data
- The feature importance ranking from this work confirms: distance > angle > body_part > defender_near > is_big_chance