Walk-Forward Validation¶

Overview¶

Walk-forward validation (also called walk-forward optimization or rolling forward) is the gold standard for validating sports betting models. Unlike k-fold cross-validation which randomly splits data, walk-forward validation respects temporal order: the model is trained on historical data, then tested on future data that was not available at training time.

The process: train on period 1 → test on period 2 → expand training window to include period 2 → test on period 3 → repeat. This mimics real deployment where today's model is built on all historical data and used to predict tomorrow's games.

Walk-forward validation prevents look-ahead bias (using future information to make predictions) and gives an honest estimate of how the model would have performed in real-time.

Why It Matters¶

Walk-forward validation is critical because:
1. Mimics real deployment: The model only sees past data when predicting future matches — exactly what happens in production.
2. Prevents look-ahead bias: Random cross-validation would use future information, giving misleadingly good results.
3. Detects overfitting: If performance degrades in later walk-forward periods, the model is overfitting to historical patterns.
4. Required by the spec: The client's spec explicitly requires walk-forward validation across 2010, 2014, 2018, and 2022 World Cups.

Key Concepts¶

Expanding window: Training set grows over time (all history to T1, all history to T2, etc.). Most common for betting models.
Rolling window: Training set stays fixed size (last N games), slides forward. Better when team characteristics change significantly.
Purged cross-validation: Removes a buffer zone between training and test sets to prevent information leakage.
Look-ahead bias: Using information that wouldn't have been available at prediction time. Walk-forward prevents this by design.

Process¶

Period 1 (2010 WC) → train on 2006-2010 → test on 2010 WC
Period 2 (2014 WC) → train on 2006-2014 → test on 2014 WC
Period 3 (2018 WC) → train on 2006-2018 → test on 2018 WC
Period 4 (2022 WC) → train on 2006-2022 → test on 2022 WC

For each period:
  1. Train model on all data up to tournament start
  2. Generate predictions for all matches
  3. Compare to actual outcomes
  4. Record metrics: ROI, CLV, Brier score, hit rate

Code Snippet¶

import pandas as pd
import numpy as np

def walk_forward_validate(df, train_end, test_start, test_end, features, target, model_class, params={}):
    """Single walk-forward validation fold."""
    train_df = df[(df['date'] >= df['date'].min())& (df['date'] < test_start)]
    test_df = df[(df['date'] >= test_start) & (df['date'] <= test_end)]
    if len(train_df) < 50 or len(test_df) < 5:
        return None
    X_train, y_train = train_df[features], train_df[target]
    X_test = test_df[features]
    model = model_class(**params)
    model.fit(X_train, y_train)
    predictions = test_df.copy()
    predictions['pred_prob'] = model.predict_proba(X_test)[:, 1]
    return predictions

def run_walk_forward(df, tournaments, features, target, model_class, params={}):
    """Run walk-forward across multiple tournaments."""
    all_results = []
    for train_end, test_start, test_end in tournaments:
        result = walk_forward_validate(df, train_end, test_start, test_end, features, target, model_class, params)
        if result is not None:
            all_results.append(result)
    combined = pd.concat(all_results)
    return {
        'total_bets': len(combined),
        'brier_score': ((combined['pred_prob'] - combined['outcome'])**2).mean(),
        'clv_mean': combined['clv'].mean() if 'clv' in combined.columns else None,
    }

world_cup_tournaments = [
    ('2010-06-10', '2010-06-11', '2010-07-11'),
    ('2014-06-12', '2014-06-13', '2014-07-13'),
    ('2018-06-14', '2018-06-15', '2018-07-15'),
    ('2022-11-20', '2022-11-21', '2022-12-18'),
]

Pitfalls¶

Small per-fold samples: 64 World Cup matches per tournament is a small test set. Aggregate across all 4 tournaments for statistical power.
Expanding window may accumulate stale data: Team strength signals from 2006 may not apply to 2022. Consider decay weighting.
Non-stationarity: Football team strength changes over time. A model that works in 2010 may not work in 2022.
The key metric is CLV: Does the model beat the closing line consistently across all four World Cups?