Overfitting in Sports Models

Overview

Overfitting in sports prediction models occurs when the model learns the noise and specific details of the training data to such an extent that it negatively impacts performance on new, unseen data. In sports betting, overfitting is particularly dangerous because: (a) historical data has low signal-to-noise ratio (goals are inherently random), (b) sample sizes are small (64 World Cup matches per tournament), and (c) the market is adversarial — any identified pattern will be arbitraged away.

Common overfitting patterns: too many features relative to training samples, fitting to specific tournament conditions, using look-ahead bias, excessive model complexity, and hyperparameter tuning on the test set.

Why It Matters

Overfitting is the primary reason most sports betting models don't work in practice because:
1. Football has high noise: Goals are rare events with significant randomness. A model that fits noise will fail on new data.
2. Small tournaments: World Cup sample (64 matches) is tiny for statistical purposes — complex models will overfit.
3. Adversarial market: If a pattern exists in historical data, bookmakers will have already priced it. Finding it in backtest but not in production is classic overfitting.
4. Look-ahead bias is subtle: It's easy to accidentally use future information in training features.

Prevention Strategies

Sample-to-parameter ratio:
- Rule of thumb: number of free parameters should be < N/20 where N is training sample size
- For ~2000 international matches: < 100 parameters. Poisson with attack/defense for 50 teams ≈ 100 parameters (right at the limit)

Regularization:

from sklearn.linear_model import Ridge
def regularized_poisson_regression(X, y, alpha=1.0):
    log_y = np.log(y + 0.1)
    model = Ridge(alpha=alpha)
    model.fit(X, log_y)
    return model

Feature importance stability:

def feature_stability(model, X, y, n_bootstrap=100):
    importances = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(len(X), len(X), replace=True)
        model.fit(X[idx], y[idx])
        importances.append(model.feature_importances_)
    cv = np.array(importances).std(axis=0) / (np.array(importances).mean(axis=0) + 1e-10)
    return cv # high CV = unstable/overfit feature

Common Overfitting Patterns

  1. Too many ELO/K-factors: Fitting individual K-factors per team from limited data
  2. Dixon-Coles rho overfitting: Estimating the correlation parameter from small samples
  3. xG model with too many features: Using 50+ shot features when 5 would suffice
  4. Rolling window too short: Adapting too quickly to recent form
  5. Cross-validation on temporal data: Random k-fold splitting introduces look-ahead bias
  6. Hyperparameter tuning on test set: Selecting the model that performs best on the test period

Pitfalls

  • In-sample performance is meaningless: Always check out-of-sample performance via walk-forward validation.
  • The client's walk-forward validation across 4 World Cups is specifically designed to detect overfitting — if performance degrades in later tournaments, the model is overfitting.
  • Simplicity wins: A well-calibrated simple ELO model often outperforms a complex ML model in out-of-sample testing.
  • Minimum viable complexity: The simplest model that captures the signal is almost always better than a complex one.

See Also