Brier Score¶

Overview¶

The Brier score is a strictly proper scoring rule for measuring the accuracy of probabilistic predictions. For binary outcomes, it is the mean squared error between predicted probabilities and actual outcomes. It was proposed by Glenn W. Brier in 1950.

A strictly proper scoring rule means the score is optimized only when the predicted probability exactly matches the true probability. If a forecaster reports 60% for an event that occurs 60% of the time, they get the best possible Brier score. Deviating from the true probability always makes the score worse.

The Brier score ranges from 0 (perfect) to 1 (worst). It decomposes into calibration + refinement components.

Why It Matters¶

Brier score is the primary calibration metric for sports betting models because:
1. It rewards accurate probabilities: A model that says 70% and is right 70% of the time scores better than one that's right 70% but says 90%.
2. It's a proper scoring rule: No hedging or manipulation can improve a Brier score — only better calibrated probabilities help.
3. It decomposes: Calibration error and refinement can be analyzed separately to diagnose model problems.
4. It's the standard: Brier score is the most widely used scoring rule in sports prediction model evaluation.

Key Formula¶

Brier score (binary):

$$BS = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2$$

Where p_i = predicted probability, o_i = actual outcome (1 or 0).

Decomposition:

$$BS = \underbrace{\frac{1}{N} \sum_{k} n_k (p_k - \bar{o}k)^2}{\text{reliability (calibration)}} - \underbrace{\frac{1}{N} \sum_{k} n_k (\bar{o}k - \bar{o})^2}{\text{resolution}} + \underbrace{\bar{o}(1 - \bar{o})}_{\text{uncertainty}}$$

Brier Skill Score (BSS):

$$BSS = 1 - \frac{BS_{model}}{BS_{climatology}}$$

Where climatology = predicting the mean outcome rate for all matches.

Worked Example¶

Predictions: [0.60, 0.70, 0.30, 0.80, 0.50]
Outcomes: [1, 1, 0, 1, 0]

$$BS = \frac{(0.6-1)^2 + (0.7-1)^2 + (0.3-0)^2 + (0.8-1)^2 + (0.5-0)^2}{5} = \frac{0.16+0.09+0.09+0.04+0.25}{5} = 0.126$$

Interpretation: Brier score of 0.126 for binary predictions is reasonable. <0.20 is typical for sports prediction; <0.15 is good; <0.10 is excellent.

Code Snippet¶

import numpy as np

def brier_score(predicted_probs, actual_outcomes):
    """Calculate Brier score for binary predictions."""
    return np.mean((predicted_probs - actual_outcomes) ** 2)

def brier_skill_score(predicted_probs, actual_outcomes):
    """BSS vs. climatology baseline."""
    bs_model = brier_score(predicted_probs, actual_outcomes)
    climatology = np.full_like(actual_outcomes, np.mean(actual_outcomes), dtype=float)
    bs_baseline = brier_score(climatology, actual_outcomes)
    return 1 - (bs_model / bs_baseline) if bs_baseline > 0 else 0

# Example
predictions = np.array([0.60, 0.70, 0.30, 0.80, 0.50])
outcomes = np.array([1, 1, 0, 1, 0])
bs = brier_score(predictions, outcomes)
bss = brier_skill_score(predictions, outcomes)
print(f"Brier Score: {bs:.4f}")  # 0.1260
print(f"Brier Skill Score: {bss:.4f}")  # positive = better than climatology

Pitfalls¶

Less discriminating for extreme probabilities: Brier score is less sensitive to overconfident predictions on high-probability events than log-loss.
Small samples are noisy: With only 64 World Cup matches, Brier score per tournament has wide confidence intervals. Aggregate across tournaments.
Does not account for odds: A well-calibrated model at poor odds still loses money. Brier score should be supplemented with CLV and EV metrics.
Calibration vs. refinement tradeoff: A model can be well-calibrated but have no predictive value (predicts 50% for everything).