Expected Goals (xG)¶
Overview¶
Expected goals (xG) is a statistical metric in association football that assigns a probability to each shot resulting in a goal. By summing these probabilities across a match, season, or set of shots, xG estimates how many goals a team or player would be expected to score given the chances created, independent of whether those chances were actually converted.
xG values are produced by statistical or machine-learning models trained on historical shot data. Models typically include features such as shot location (distance and angle), body part used, type of assist, phase of play, and defensive pressure. An xG value of 0.3 means shots of similar characteristics are expected to be scored ~30% of the time — not a prediction about any single shot.
The concept was formalized by Sam Green (Opta) in 2012, though earlier work by Ensum, Pollard, and Taylor (2004) identified distance, angle, defender proximity, and cross as significant factors.
Why It Matters¶
xG is the foundation of modern football analytics and sports betting models because:
1. Better λ for Poisson models: xG data provides superior expected goals estimates compared to simple historical scoring averages
2. Team quality assessment: xG differential (xG for minus xG against) measures offensive and defensive quality more reliably than actual goals
3. Overperformance detection: Teams scoring more than their xG are likely to regress; teams underperforming are candidates for positive regression
4. Market inefficiency: Bookmakers don't fully price xG information — a model that uses xG can find mispriced odds
Different providers (Opta, StatsBomb, Stats2, Understat) use different models, so xG figures are not directly comparable across sources.
Key Formula¶
Simple xG from distance (logistic model):
$$xG = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times distance)}}$$
xG for a team over a match:
$$xG_{team} = \sum_{i=1}^{n} xG_i$$
Where n = number of shots taken by the team.
Key factors in xG models:
| Factor | Effect |
|---|---|
| Distance from goal | Primary predictor — exponential decay |
| Angle to goal | Wider angles = higher probability |
| Defender proximity | Nearest defender distance reduces probability |
| Body part | Headers ~0.30 xG; foot shots ~0.11 xG for similar positions |
| Shot type | Open play vs. set piece vs. penalty (~0.76 xG) |
Worked Example¶
A team takes 15 shots in a match:
- 5 shots from outside the box (xG=0.05 each = 0.25 total)
- 6 shots from inside the box (xG=0.15 each = 0.90 total)
- 3 shots from close range (xG=0.30 each = 0.90 total)
- 1 penalty (xG=0.76)
Total xG = 2.81
Actual goals scored: 3 (one long-range shot went in)
Overperformance: 3 - 2.81 = +0.19 (the long-range goal was unlikely given its xG of 0.05)
Code Snippet¶
import numpy as np
from sklearn.linear_model import LogisticRegression
def build_xg_model(shots_df):
"""
Build a simple xG model from historical shot data.
shots_df needs: distance, angle, defender_near, is_big_chance, goal (0/1)
"""
features = ['distance', 'angle', 'defender_near', 'is_big_chance']
X = shots_df[features].values
y = shots_df['goal'].values
model = LogisticRegression()
model.fit(X, y)
return model
def predict_xg(model, shot_features):
"""Predict xG for a single shot."""
return model.predict_proba([shot_features])[0, 1]
def team_xg(shots_df, team_filter):
"""Sum xG for all shots by a team."""
team_shots = shots_df[shots_df['team'] == team_filter]
return team_shots['xg'].sum()
# Example: estimate lambda for Poisson from xG
# xG_total_last_5_matches = [2.5, 1.8, 3.1, 2.2, 2.9]
# lambda_estimate = np.mean(xG_total_last_5_matches) = 2.5
Pitfalls¶
- xG data is expensive: Requires subscription to Opta/StatsBomb. Free data sources (football-data.co.uk) don't include shot-level data.
- Model heterogeneity: Different providers use different models — xG from Understat isn't comparable to xG from Opta.
- xG is descriptive, not predictive: It describes past shot quality. For prediction, use historical xG averages to estimate future λ.
- Non-penalty xG (npxG): Penalties (~0.76 xG) distort team strength estimates. Use npxG for better modeling.
- Sample size: xG from a single match is noisy. Use rolling averages over 5–10 matches for team strength estimation.
See Also¶
- poisson-distribution — xG provides λ estimates for Poisson prediction models
- dixon-coles-correction — DC can use xG-based λ for better predictions
- api-football — provides match-level xG stats at the API level
- sportradar — provides detailed shot event data for building custom xG models