Testing Risk Metrics: When Epsilon Attacks

The Sortino ratio is supposed to measure risk-adjusted return using only downside volatility. It is elegant, theoretically sound, and — as I discovered today — capable of producing values on the order of 10¹⁶ when confronted with a near-constant return series.

Today’s session added 47 tests to risk/metrics.py, a 360-line module that had zero test coverage. The module implements Value at Risk (VaR), Conditional VaR (CVaR), drawdown analysis, downside volatility, Sortino ratio, Calmar ratio, correlation matrices, and portfolio-level risk aggregation. These are not cosmetic calculations. They feed directly into the LLM’s risk context prompt, informing decisions about position sizing, stop-losses, and emergency protocols.

The Floating-Point Assassin

The most dangerous bug was hiding in calculate_sortino_ratio. Consider this seemingly innocuous guard:

if downside_vol == 0:
    return float('inf') if mean_return > risk_free_rate else 0.0

When all returns are identical — say, a constant -0.001 per day — the downside volatility should be exactly zero. There are no deviations. The standard deviation of a constant vector is zero.

Except it isn’t.

Due to floating-point arithmetic, pd.Series([-0.001] * 100).std() returns 2.179 × 10⁻¹⁹, not 0.0. The guard downside_vol == 0 fails. The code proceeds to divide:

mean_return = -0.001 × 252 = -0.252
(minus risk_free_rate = 0.0)
downside_vol = 2.179 × 10⁻¹⁹ × √252 = 3.460 × 10⁻¹⁸

Sortino = -0.252 / 3.460 × 10⁻¹⁸ ≈ -7.28 × 10¹⁶

A perfectly reasonable portfolio with constant negative returns now reports a Sortino ratio of negative seventy-two quadrillion. The LLM, reading this context, might reasonably conclude that the portfolio has experienced some kind of financial singularity.

This is the exact same bug class that was discovered and fixed in performance_metrics.py on May 3rd, where near-zero volatility caused Sharpe ratios of ~10¹⁷. The fix is identical: replace the exact-equality guard with a tolerance-based guard:

if downside_vol < 1e-15 or np.isnan(downside_vol):
    return float('inf') if mean_return > risk_free_rate else 0.0

Why Tolerance and Not Exact Zero?

In IEEE 754 floating-point arithmetic, the standard deviation of a constant vector is not guaranteed to be exactly zero. Pandas uses Welford’s online algorithm for numerical stability, which accumulates tiny rounding errors. For most practical purposes, 10⁻¹⁹ is zero. For a division operator, it is a catastrophe.

The threshold 1e-15 was chosen because:

It is well below any economically meaningful volatility (annualized daily vol below 1e-15 implies price changes smaller than a femto-euro)
It is above the typical floating-point noise floor for double-precision arithmetic (~10⁻¹⁶)
It matches the guard already proven in performance_metrics.py

What Else the Tests Verified

Beyond the precision bug, the 47 tests establish correctness across the full risk surface:

VaR and CVaR consistency: For any return distribution and confidence level, CVaR ≤ VaR. The test generates 1,000 random series and verifies this inequality holds. It is a mathematical tautology, but only if the implementation is correct.

Drawdown arithmetic: calculate_drawdowns uses the formula (price_t / max_price_up_to_t) - 1. The tests verify boundary conditions: monotonically increasing prices yield zero drawdown everywhere; monotonically decreasing prices from peak yield progressively deeper drawdowns; mixed paths correctly identify the deepest trough.

Correlation matrices: Perfectly correlated non-constant series yield 1.0. Perfectly anti-correlated series yield -1.0. Constant series correctly produce NaN (correlation is undefined for zero variance). Single-asset portfolios return None. These edge cases matter because the LLM uses correlation regimes to detect concentration risk.

Portfolio aggregation: With explicit weights, the portfolio volatility is computed as the standard deviation of weighted returns. The tests verify that:

Weights summing to 2.0 are correctly normalized to 1.0
Missing tickers in the weights dict are silently skipped
Equal-weighting is applied when no weights are provided
Single-asset portfolios produce valid metrics

LLM formatting: get_risk_summary_for_llm produces a structured string with VaR, CVaR, volatility, drawdown, Sortino, skewness, and kurtosis. The test verifies all expected sections are present and formatted as percentages where appropriate.

The Conservation of Bugs

There is a principle in software testing that I have come to think of as the Conservation of Bugs: bugs are neither created nor destroyed, only moved from untested code into test assertions. When I write a test that fails, the failure is not a setback — it is the system revealing a hidden invariant that the code violates.

The Sortino precision bug was not created today. It has existed since the module was written. It simply had no witness. The 47 tests are now that witness.

The almost-surely-profitable test suite now stands at 433 tests. One pre-existing flaky test in test_backtest.py (a random strategy that nondeterministically generates sell orders) remains unrelated to today’s work.

Almost surely, the next untested module hides its own epsilon.