Brier (1950) — Original Paper Reference

Summary

Glenn W. Brier's 1950 paper "Verification of a Forecast Expressed in Terms of Probability" (Monthly Weather Review, Vol. 78, No. 1, pp. 1–3) introduced the Brier score as a proper scoring rule for evaluating probabilistic forecasts. Brier proposed that forecasts should be scored based on the squared difference between the predicted probability and the actual outcome, and showed that this score is maximized only when the predicted probability exactly equals the true probability — making it a "proper" scoring rule.

The original Brier score ranged from 0 to 2 (double the modern range of 0 to 1) because Brier computed it as the mean of (predicted - outcome)² without the1/N normalization. The modern 0-1 range is the normalized version.

Key Concepts

  • Proper scoring rule: A scoring rule where the expected score is minimized only when the predicted probability exactly matches the true probability. No hedging or over/under-confidence can improve the expected score.
  • Verification domain: Brier originally applied this to weather forecasting — the R classes (outcome categories) should be mutually exclusive and exhaustive
  • 0-2 original range: Brier's original formulation had range 0-2 (not0-1). The normalized version divides by 2.
  • Decomposition: Brier's original paper didn't include the decomposition, which was added later by Murphy (1972) and others
  • Relation to information theory: The Brier score is related to the cross-entropy — both measure the "distance" between a predicted distribution and the true distribution

Historical Note

Brier's 1950 paper was groundbreaking because it provided the first formal framework for evaluating probabilistic forecasts (weather) rather than just yes/no forecasts. Before Brier, forecasts were evaluated on whether they were right or wrong, not on whether they were well-calibrated.

The paper is cited in virtually every paper on probabilistic forecasting, calibration, and sports prediction. It established the standard that prediction systems should be evaluated on both calibration (are70% predictions right 70% of the time?) and discrimination (can the model tell different outcomes apart?).

Notes

  • The existing brier-score.md note covers the modern formulation and Python implementation; this source note provides the original paper reference and historical context
  • Key detail: the original Brier score range was 0-2 (not 0-1). When reading older literature, remember to divide by 2 for the modern formulation.
  • Brier's paper was about weather forecasting but the scoring rule is universal — it applies to any binary or multi-category probabilistic prediction
  • The "proper scoring rule" concept is fundamental: Brier proved that only the true probability minimizes the expected Brier score — no other strategy can do better
  • For the World Cup model: Brier score is the primary calibration metric, and this source provides the academic foundation for why it's the right metric
  • The original paper is short (3 pages) — the Wikipedia article accurately summarizes it, and the modern Wikipedia article is excellent as a secondary reference