Kaggle Walk-Forward Validation Notebook¶
Summary¶
This Kaggle notebook provides a practical Python implementation of walk-forward validation for time series, demonstrating the technique on financial data with expanding and rolling windows. The notebook shows how to implement walk-forward validation from scratch in Python using pandas, with clear visualizations of in-sample vs. out-of-sample performance.
The notebook is particularly useful for understanding the computational implementation: how to slice temporal data, how to compute metrics per fold, and how to aggregate across folds. It also demonstrates the look-ahead bias problem with side-by-side comparison of k-fold vs. walk-forward results.
Key Concepts¶
- Expanding window in Python: Using pandas date slicing to create expanding training windows
- Rolling window in Python: Using shift() and rolling() for fixed-size windows
- Fold aggregation: Computing mean and standard deviation of metrics across all folds
- Look-ahead bias demonstration: Shows how k-fold CV gives artificially good results on temporal data
- Visualization: Plots of in-sample vs. out-of-sample performance across folds
- Metric tracking: Per-fold metrics stored in a list and aggregated at the end
Python Implementation Pattern¶
import pandas as pd
import numpy as np
def walk_forward_validate(df, target_col, feature_cols, model_class,
train_end dates, test_dates):
"""
Walk-forward validation with expanding window.
Args:
df: DataFrame with datetime index
target_col: column to predict
feature_cols: list of feature column names
train_end_dates: list of train cutoff dates
test_dates: list of test period start dates
Returns:
DataFrame with per-fold predictions
"""
all_predictions = []
for train_end, test_start in zip(train_end_dates, test_dates):
# Expanding window: all history up to train_end
train = df[df.index <= train_end]
test = df[(df.index > test_start) & (df.index <= test_start + pd.Timedelta(days=30))]
if len(train) < 50 or len(test) < 5:
continue
X_train = train[feature_cols]
y_train = train[target_col]
X_test = test[feature_cols]
model = model_class()
model.fit(X_train, y_train)
test['pred'] = model.predict(X_test)
test['fold'] = str(train_end.date())
all_predictions.append(test)
return pd.concat(all_predictions)
# Aggregate metrics across folds
def aggregate_fold_metrics(predictions_df, y_true, y_pred):
"""Compute mean and std of metrics across folds."""
fold_metrics = predictions_df.groupby('fold').apply(
lambda fold: compute_metrics(fold[y_true], fold[y_pred])
)
return {
'mean_metric': fold_metrics.mean(),
'std_metric': fold_metrics.std(),
'per_fold': fold_metrics
}
Notes¶
- This Kaggle notebook is the best practical Python implementation reference for walk-forward validation — complements the theoretical foundation from Forecasting: Principles and Practice
- The look-ahead bias demonstration is particularly valuable: it shows quantitatively why k-fold fails for temporal data
- Key implementation detail: using a DatetimeIndex and pandas date slicing is the cleanest way to implement expanding windows in Python
- The per-fold metric tracking pattern is exactly what the World Cup model needs: record Brier score, CLV, and ROI for each of the 4 tournaments, then aggregate
- For the World Cup model: the test period should be the ~1 month of each World Cup tournament, with training ending at the tournament start date
- The standard deviation across folds is as important as the mean — a model with high variance across tournaments is unreliable even if the mean is good