Calibration Plots

Overview

Calibration plots (also called reliability diagrams or calibration curves) are visual tools for checking whether a model's predicted probabilities match actual outcome frequencies. A perfectly calibrated model that predicts 60% win probability should win exactly 60% of those predictions over a large sample.

The plot shows bins of predicted probability on the x-axis and actual outcome frequency on the y-axis, with a diagonal reference line (perfect calibration). Points above the diagonal = model is underestimating probability (conservative); points below = model is overconfident.

Calibration is fundamental to sports betting because the model outputs probabilities that are compared to de-vigged bookmaker odds to find +EV bets. If the model is poorly calibrated, EV calculations will be unreliable.

Why It Matters

Calibration matters because:
1. EV calculations require calibrated probabilities: Uncalibrated probabilities produce misleading EV estimates.
2. Identifies systematic bias: A model that's consistently overconfident on favorites can be corrected.
3. Complements Brier score: Brier score can be low due to refinement (good at ranking) even with poor calibration.
4. Post-hoc correction available: Platt scaling and isotonic regression can fix calibration without improving discrimination.

Key Formula

Expected Calibration Error (ECE):

$$ECE = \sum_{b=1}^{B} \frac{n_b}{N} \cdot |acc_b - conf_b|$$

Where n_b = predictions in bin b, acc_b = actual win rate in bin b, conf_b = average predicted probability in bin b.

Maximum Calibration Error (MCE): Maximum |acc_b − conf_b| across bins.

Worked Example

Predictions bucketed into 5 bins:

Bin Range Count Avg Pred Actual Win Rate Gap
1 0.0–0.2 20 0.15 0.18 +0.03
2 0.2–0.4 35 0.30 0.29 −0.01
3 0.4–0.6 50 0.50 0.48 −0.02
4 0.6–0.8 40 0.70 0.65 −0.05
5 0.8–1.0 15 0.88 0.80 −0.08

ECE = (20/140)×0.03 + (35/140)×0.01 + (50/140)×0.02 + (40/140)×0.05 + (15/140)×0.08 ≈0.028

Model is overconfident at high probabilities (bin4 and 5).

Code Snippet

import numpy as np
import matplotlib.pyplot as plt

def calibration_plot(predicted_probs, actual_outcomes, n_bins=10):
    """Generate calibration plot and ECE."""
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
    bin_counts = np.zeros(n_bins)
    bin_accuracy = np.zeros(n_bins)
    bin_confidence = np.zeros(n_bins)

    for i in range(n_bins):
        mask = (predicted_probs >= bin_edges[i])& (predicted_probs < bin_edges[i+1])
        if i == n_bins - 1:
            mask = (predicted_probs >= bin_edges[i]) & (predicted_probs <= bin_edges[i+1])
        bin_counts[i] = mask.sum()
        if bin_counts[i] > 0:
            bin_accuracy[i] = actual_outcomes[mask].mean()
            bin_confidence[i] = predicted_probs[mask].mean()

    ece = np.sum((bin_counts / len(predicted_probs)) * np.abs(bin_accuracy - bin_confidence))

    fig, ax = plt.subplots(figsize=(8, 6))
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration', linewidth=2)
    ax.scatter(bin_centers, bin_accuracy, s=bin_counts * 5, alpha=0.6, c='steelblue')
    ax.set_xlabel('Mean Predicted Probability')
    ax.set_ylabel('Fraction of Positives (Actual Win Rate)')
    ax.set_title(f'Calibration Plot (ECE = {ece:.4f})')
    return {'ece': ece, 'figure': fig}

def platt_calibration(probs, outcomes):
    """Platt scaling: fit logistic regression to calibrate probabilities."""
    from sklearn.linear_model import LogisticRegression
    p = np.clip(probs, 1e-5, 1 - 1e-5)
    logit = np.log(p / (1 - p)).reshape(-1, 1)
    calibrator = LogisticRegression()
    calibrator.fit(logit, outcomes)
    return calibrator.predict_proba(logit)[:, 1]

Pitfalls

  • Small samples distort calibration: With only ~64 World Cup matches, use wider bins (5–10) and expect wide confidence intervals.
  • Calibration is necessary but not sufficient: A model can be perfectly calibrated but have no predictive value (predicts 50% for everything).
  • Overconfidence is the most common issue: Neural networks and boosted trees tend to be overconfident — Platt scaling or isotonic regression can fix this.
  • Combine with Brier score: Calibration plots are visual; Brier score is quantitative. Use both.

See Also