Calibration Plots¶
Overview¶
Calibration plots (also called reliability diagrams or calibration curves) are visual tools for checking whether a model's predicted probabilities match actual outcome frequencies. A perfectly calibrated model that predicts 60% win probability should win exactly 60% of those predictions over a large sample.
The plot shows bins of predicted probability on the x-axis and actual outcome frequency on the y-axis, with a diagonal reference line (perfect calibration). Points above the diagonal = model is underestimating probability (conservative); points below = model is overconfident.
Calibration is fundamental to sports betting because the model outputs probabilities that are compared to de-vigged bookmaker odds to find +EV bets. If the model is poorly calibrated, EV calculations will be unreliable.
Why It Matters¶
Calibration matters because:
1. EV calculations require calibrated probabilities: Uncalibrated probabilities produce misleading EV estimates.
2. Identifies systematic bias: A model that's consistently overconfident on favorites can be corrected.
3. Complements Brier score: Brier score can be low due to refinement (good at ranking) even with poor calibration.
4. Post-hoc correction available: Platt scaling and isotonic regression can fix calibration without improving discrimination.
Key Formula¶
Expected Calibration Error (ECE):
$$ECE = \sum_{b=1}^{B} \frac{n_b}{N} \cdot |acc_b - conf_b|$$
Where n_b = predictions in bin b, acc_b = actual win rate in bin b, conf_b = average predicted probability in bin b.
Maximum Calibration Error (MCE): Maximum |acc_b − conf_b| across bins.
Worked Example¶
Predictions bucketed into 5 bins:
| Bin | Range | Count | Avg Pred | Actual Win Rate | Gap |
|---|---|---|---|---|---|
| 1 | 0.0–0.2 | 20 | 0.15 | 0.18 | +0.03 |
| 2 | 0.2–0.4 | 35 | 0.30 | 0.29 | −0.01 |
| 3 | 0.4–0.6 | 50 | 0.50 | 0.48 | −0.02 |
| 4 | 0.6–0.8 | 40 | 0.70 | 0.65 | −0.05 |
| 5 | 0.8–1.0 | 15 | 0.88 | 0.80 | −0.08 |
ECE = (20/140)×0.03 + (35/140)×0.01 + (50/140)×0.02 + (40/140)×0.05 + (15/140)×0.08 ≈0.028
Model is overconfident at high probabilities (bin4 and 5).
Code Snippet¶
import numpy as np
import matplotlib.pyplot as plt
def calibration_plot(predicted_probs, actual_outcomes, n_bins=10):
"""Generate calibration plot and ECE."""
bin_edges = np.linspace(0, 1, n_bins + 1)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_counts = np.zeros(n_bins)
bin_accuracy = np.zeros(n_bins)
bin_confidence = np.zeros(n_bins)
for i in range(n_bins):
mask = (predicted_probs >= bin_edges[i])& (predicted_probs < bin_edges[i+1])
if i == n_bins - 1:
mask = (predicted_probs >= bin_edges[i]) & (predicted_probs <= bin_edges[i+1])
bin_counts[i] = mask.sum()
if bin_counts[i] > 0:
bin_accuracy[i] = actual_outcomes[mask].mean()
bin_confidence[i] = predicted_probs[mask].mean()
ece = np.sum((bin_counts / len(predicted_probs)) * np.abs(bin_accuracy - bin_confidence))
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration', linewidth=2)
ax.scatter(bin_centers, bin_accuracy, s=bin_counts * 5, alpha=0.6, c='steelblue')
ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives (Actual Win Rate)')
ax.set_title(f'Calibration Plot (ECE = {ece:.4f})')
return {'ece': ece, 'figure': fig}
def platt_calibration(probs, outcomes):
"""Platt scaling: fit logistic regression to calibrate probabilities."""
from sklearn.linear_model import LogisticRegression
p = np.clip(probs, 1e-5, 1 - 1e-5)
logit = np.log(p / (1 - p)).reshape(-1, 1)
calibrator = LogisticRegression()
calibrator.fit(logit, outcomes)
return calibrator.predict_proba(logit)[:, 1]
Pitfalls¶
- Small samples distort calibration: With only ~64 World Cup matches, use wider bins (5–10) and expect wide confidence intervals.
- Calibration is necessary but not sufficient: A model can be perfectly calibrated but have no predictive value (predicts 50% for everything).
- Overconfidence is the most common issue: Neural networks and boosted trees tend to be overconfident — Platt scaling or isotonic regression can fix this.
- Combine with Brier score: Calibration plots are visual; Brier score is quantitative. Use both.
See Also¶
- brier-score — quantitative calibration metric
- log-loss-cross-entropy — log-loss training objective for calibration
- bayesian-inference-sports — Bayesian models naturally produce calibrated probabilities
- walk-forward-validation — calibration should be checked across walk-forward periods