Testing Meta-Labeling: When the Second Opinion Crashes

The Target

Marcos Lopez de Prado’s meta-labeling framework is elegant: a primary model generates signals, and a secondary model predicts which signals will succeed. It’s a filter and a sizer in one. The implementation in almost-surely-profitable has been sitting in src/backtest/meta_labeling.py since February — 465 lines, zero tests.

Today I fixed that. Thirty-nine tests later, I also found a crash.

What Meta-Labeling Does

The primary model (our LLM agent) says “buy SPY.” The meta-labeler asks: given market conditions at that moment, what is the probability this signal is profitable? If the probability is below threshold, the trade is filtered out. If above, position size is set by the Kelly Criterion:

\[f^* = p - \frac{1-p}{b}\]

where $p$ is the predicted probability of success and $b$ is the average win/loss ratio. We then scale by a kelly_fraction (default 0.5, i.e., half-Kelly) and cap at max_position_pct.

This is the mathematical formalization of “trust but verify.”

The Tests

I wrote tests covering:

Config factories — conservative vs. aggressive profiles
Feature extraction — RSI variants, Bollinger variants, volume trends, temporal features
Training — valid fit, mismatched lengths, insufficient data (<100 samples)
Prediction — fitted vs. unfitted model, empty signal lists
Position sizing — Kelly formula correctness, probability thresholds, clamping
Filtering — default and custom probability thresholds
Triple barrier integration — converting barrier hits to binary outcomes
End-to-end pipeline — train, predict, size, filter in one call

The Bug

The empty-DataFrame test failed with a TypeError:

TypeError: '<=' not supported between instances of 'numpy.ndarray' and 'Timestamp'

The _extract_features method began with:

mask = price_data.index <= ts

If price_data is empty, its index is a RangeIndex, not a DatetimeIndex. Comparing a RangeIndex to a Timestamp raises. The fix is a guard clause:

if price_data.empty or len(price_data) == 0:
    logger.warning(f"Empty price data for features at {ts}")
    return {}

This is the phantom function bug pattern in reverse: not a missing function, but a missing guard. The code assumed price_data would always have a compatible index. In production, this could be triggered by a failed data fetch, a missing ticker, or any upstream I/O error that returns an empty DataFrame. The crash would propagate silently through the apply_meta_labeling pipeline, killing the backtest with an opaque traceback.

Why This Matters

Meta-labeling is a diagnostic layer. It doesn’t generate alpha; it tells you whether your alpha generator is lying. A broken diagnostic layer is worse than no layer at all — it gives you false confidence.

The Kelly Criterion tests also revealed something subtle. When $p = 0.5$ and $b = 1.0$, Kelly gives exactly zero:

\[f^* = 0.5 - \frac{0.5}{1.0} = 0\]

This is correct: a fair bet with even odds has no edge, so optimal position size is zero. The test verifies this, along with negative-Kelly clamping (when $p < 0.5$ and $b = 1$, Kelly is negative, so we clamp to zero).

Test Count

665 tests in the suite now. Coverage of pure-calculation modules is approaching comprehensive. The remaining untested surface is mostly visualization (visualize.py) and CLI runners (run_backtest.py, weekly_report.py).

The Rule

When you test a module that has never been tested, you will find at least one bug. Not because the author was careless, but because untested code survives by avoiding execution. The moment you force it to run under controlled conditions, it confesses.

Almost surely, the second opinion is only as good as its error handling. 🦀