scikit-learn Probability Calibration Documentation

Summary

scikit-learn's official documentation on probability calibration is the authoritative reference for implementing calibration plots and post-hoc calibration methods in Python. It covers: (1) the calibration_curve function for computing calibration data, (2) the CalibrationDisplay class for plotting, and (3) CalibratedClassifierCV for post-hoc calibration via Platt scaling (sigmoid) and isotonic regression.

The documentation demonstrates that many classifiers (SVM, naive Bayes, boosted trees) produce poorly calibrated probabilities out of the box, and that post-hoc calibration can fix this without changing the model's ranking performance. This is directly relevant to sports prediction models, where neural networks and gradient boosting often produce overconfident predictions.

Key Functions

  • calibration_curve(y_true, y_prob, n_bins=10): Computes true vs. predicted probabilities for each bin. Returns (prob_true, prob_pred) arrays for plotting.
  • CalibrationDisplay.from_estimator(model, X, y): Plots a calibration curve directly from a fitted model
  • CalibratedClassifierCV(model, method='sigmoid'): Post-hoc calibration using Platt scaling (sigmoid) or isotonic regression
  • method='sigmoid' (Platt scaling): Fits a logistic regression on the model's outputs. Best for models that are just slightly miscalibrated.
  • method='isotonic': Non-parametric calibration. Better for large calibration sets but can overfit with small samples.

Key Concepts

  • Why classifiers are miscalibrated: SVM and boosted trees optimize for classification accuracy, not probability calibration. They produce "hard" probabilities that are overconfident.
  • Platt scaling (sigmoid): Fits Pr(y=1 | f(x)) =1 / (1 + exp(-(af(x) + b))) where f(x) is the model's raw score. Works well when miscalibration is approximately sigmoid-shaped.
  • Isotonic regression: Non-parametric monotone calibration. More flexible but requires more data and can overfit.
  • Cross-validation for calibration: CalibratedClassifierCV uses internal cross-validation to avoid overfitting the calibration
  • Calibration is not discrimination improvement: Post-hoc calibration improves calibration without improving discrimination (ranking). A model that can't discriminate will still have poor Brier score after calibration.

Code Example

from sklearn.calibration import calibration_curve, CalibrationDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt

# Compute calibration curve
prob_true, prob_pred = calibration_curve(
    y_true, y_prob, n_bins=10, strategy='uniform'
)

# Plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
ax.plot(prob_pred, prob_true, marker='o', label='Model')

# Use CalibrationDisplay for cleaner plotting
display = CalibrationDisplay.from_estimator(
    gradient_boosting_model, X_test, y_test, n_bins=10
)
display.ax_.set_xlabel('Mean Predicted Probability')
display.ax_.set_ylabel('Fraction of Positives')

Notes

  • This is the official scikit-learn reference for calibration — the existing calibration-plots.md note covers the concept and provides custom Python code; this source adds the sklearn API and post-hoc calibration methods
  • Key insight: Platt scaling and isotonic regression are standard post-hoc calibration methods that don't change the model's discrimination ability — they only fix calibration
  • For sports prediction: if the model is well-ranked but poorly calibrated, post-hoc calibration is a valid fix before computing EV
  • The cross-validation within CalibratedClassifierCV is important: it prevents the calibration itself from overfitting
  • For World Cup with only 64 matches: isotonic regression is risky (can overfit with small samples); Platt scaling (sigmoid) is more appropriate
  • The sklearn documentation also covers when calibration helps vs. hurts: calibration helps most when the model is well-ranked but miscalibrated, which is exactly the case for many sports prediction models