Kaggle Walk-Forward Validation Notebook¶

Summary¶

This Kaggle notebook provides a practical Python implementation of walk-forward validation for time series, demonstrating the technique on financial data with expanding and rolling windows. The notebook shows how to implement walk-forward validation from scratch in Python using pandas, with clear visualizations of in-sample vs. out-of-sample performance.

The notebook is particularly useful for understanding the computational implementation: how to slice temporal data, how to compute metrics per fold, and how to aggregate across folds. It also demonstrates the look-ahead bias problem with side-by-side comparison of k-fold vs. walk-forward results.

Key Concepts¶

Expanding window in Python: Using pandas date slicing to create expanding training windows
Rolling window in Python: Using shift() and rolling() for fixed-size windows
Fold aggregation: Computing mean and standard deviation of metrics across all folds
Look-ahead bias demonstration: Shows how k-fold CV gives artificially good results on temporal data
Visualization: Plots of in-sample vs. out-of-sample performance across folds
Metric tracking: Per-fold metrics stored in a list and aggregated at the end

Python Implementation Pattern¶

import pandas as pd
import numpy as np

def walk_forward_validate(df, target_col, feature_cols, model_class, 
                          train_end dates, test_dates):
    """
    Walk-forward validation with expanding window.

    Args:
        df: DataFrame with datetime index
        target_col: column to predict
        feature_cols: list of feature column names
        train_end_dates: list of train cutoff dates
        test_dates: list of test period start dates

    Returns:
        DataFrame with per-fold predictions
    """
    all_predictions = []

    for train_end, test_start in zip(train_end_dates, test_dates):
        # Expanding window: all history up to train_end
        train = df[df.index <= train_end]
        test = df[(df.index > test_start) & (df.index <= test_start + pd.Timedelta(days=30))]

        if len(train) < 50 or len(test) < 5:
            continue

        X_train = train[feature_cols]
        y_train = train[target_col]
        X_test = test[feature_cols]

        model = model_class()
        model.fit(X_train, y_train)

        test['pred'] = model.predict(X_test)
        test['fold'] = str(train_end.date())
        all_predictions.append(test)

    return pd.concat(all_predictions)

# Aggregate metrics across folds
def aggregate_fold_metrics(predictions_df, y_true, y_pred):
    """Compute mean and std of metrics across folds."""
    fold_metrics = predictions_df.groupby('fold').apply(
        lambda fold: compute_metrics(fold[y_true], fold[y_pred])
    )
    return {
        'mean_metric': fold_metrics.mean(),
        'std_metric': fold_metrics.std(),
        'per_fold': fold_metrics
    }

Notes¶

This Kaggle notebook is the best practical Python implementation reference for walk-forward validation — complements the theoretical foundation from Forecasting: Principles and Practice
The look-ahead bias demonstration is particularly valuable: it shows quantitatively why k-fold fails for temporal data
Key implementation detail: using a DatetimeIndex and pandas date slicing is the cleanest way to implement expanding windows in Python
The per-fold metric tracking pattern is exactly what the World Cup model needs: record Brier score, CLV, and ROI for each of the 4 tournaments, then aggregate
For the World Cup model: the test period should be the ~1 month of each World Cup tournament, with training ending at the tournament start date
The standard deviation across folds is as important as the mean — a model with high variance across tournaments is unreliable even if the mean is good