Expected Goals (xG)

Overview

Expected goals (xG) is a statistical metric in association football that assigns a probability to each shot resulting in a goal. By summing these probabilities across a match, season, or set of shots, xG estimates how many goals a team or player would be expected to score given the chances created, independent of whether those chances were actually converted.

xG values are produced by statistical or machine-learning models trained on historical shot data. Models typically include features such as shot location (distance and angle), body part used, type of assist, phase of play, and defensive pressure. An xG value of 0.3 means shots of similar characteristics are expected to be scored ~30% of the time — not a prediction about any single shot.

The concept was formalized by Sam Green (Opta) in 2012, though earlier work by Ensum, Pollard, and Taylor (2004) identified distance, angle, defender proximity, and cross as significant factors.

Why It Matters

xG is the foundation of modern football analytics and sports betting models because:
1. Better λ for Poisson models: xG data provides superior expected goals estimates compared to simple historical scoring averages
2. Team quality assessment: xG differential (xG for minus xG against) measures offensive and defensive quality more reliably than actual goals
3. Overperformance detection: Teams scoring more than their xG are likely to regress; teams underperforming are candidates for positive regression
4. Market inefficiency: Bookmakers don't fully price xG information — a model that uses xG can find mispriced odds

Different providers (Opta, StatsBomb, Stats2, Understat) use different models, so xG figures are not directly comparable across sources.

Key Formula

Simple xG from distance (logistic model):

$$xG = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times distance)}}$$

xG for a team over a match:

$$xG_{team} = \sum_{i=1}^{n} xG_i$$

Where n = number of shots taken by the team.

Key factors in xG models:

Factor Effect
Distance from goal Primary predictor — exponential decay
Angle to goal Wider angles = higher probability
Defender proximity Nearest defender distance reduces probability
Body part Headers ~0.30 xG; foot shots ~0.11 xG for similar positions
Shot type Open play vs. set piece vs. penalty (~0.76 xG)

Worked Example

A team takes 15 shots in a match:
- 5 shots from outside the box (xG=0.05 each = 0.25 total)
- 6 shots from inside the box (xG=0.15 each = 0.90 total)
- 3 shots from close range (xG=0.30 each = 0.90 total)
- 1 penalty (xG=0.76)

Total xG = 2.81

Actual goals scored: 3 (one long-range shot went in)

Overperformance: 3 - 2.81 = +0.19 (the long-range goal was unlikely given its xG of 0.05)

Code Snippet

import numpy as np
from sklearn.linear_model import LogisticRegression

def build_xg_model(shots_df):
    """
    Build a simple xG model from historical shot data.
    shots_df needs: distance, angle, defender_near, is_big_chance, goal (0/1)
    """
    features = ['distance', 'angle', 'defender_near', 'is_big_chance']
    X = shots_df[features].values
    y = shots_df['goal'].values
    model = LogisticRegression()
    model.fit(X, y)
    return model

def predict_xg(model, shot_features):
    """Predict xG for a single shot."""
    return model.predict_proba([shot_features])[0, 1]

def team_xg(shots_df, team_filter):
    """Sum xG for all shots by a team."""
    team_shots = shots_df[shots_df['team'] == team_filter]
    return team_shots['xg'].sum()

# Example: estimate lambda for Poisson from xG
# xG_total_last_5_matches = [2.5, 1.8, 3.1, 2.2, 2.9]
# lambda_estimate = np.mean(xG_total_last_5_matches) = 2.5

Pitfalls

  • xG data is expensive: Requires subscription to Opta/StatsBomb. Free data sources (football-data.co.uk) don't include shot-level data.
  • Model heterogeneity: Different providers use different models — xG from Understat isn't comparable to xG from Opta.
  • xG is descriptive, not predictive: It describes past shot quality. For prediction, use historical xG averages to estimate future λ.
  • Non-penalty xG (npxG): Penalties (~0.76 xG) distort team strength estimates. Use npxG for better modeling.
  • Sample size: xG from a single match is noisy. Use rolling averages over 5–10 matches for team strength estimation.

See Also