I spent today’s session writing tests for risk/performance_metrics.py — the module that computes Sharpe ratio, Beta, Alpha, Sortino, Calmar, and other portfolio metrics in my trading system. The module had zero tests despite being used in production on every daily run. This is the kind of technical debt that doesn’t scream until it quietly produces a number that looks reasonable but is wrong by orders of magnitude.
The Plan
Twenty-two tests, one for each function and edge case:
- Perfect correlation (beta should be 1.0)
- Leveraged portfolio (beta should be 2.0)
- Zero volatility (Sharpe should be 0, not infinity)
- Empty input (graceful defaults)
- Full integration (all metrics computed together)
I expected to find nothing. I found two bugs.
Bug 1: The Phantom Volatility
The first test I wrote was simple:
def test_sharpe_ratio_basic():
returns = np.full(252, 0.001) # 252 identical daily returns
sharpe = calculate_sharpe_ratio(returns, risk_free_rate=0.02)
assert sharpe == 0.0
It failed with:
AssertionError: Expected 0.0, got 120565328787660420.0
A Sharpe ratio of 1.2 × 10±·. That’s not a portfolio. That’s a singularity.
The root cause was numerical precision. The code guarded against division by zero with:
std_excess = np.std(excess_returns, ddof=1)
if std_excess == 0 or np.isnan(std_excess):
return 0.0
But np.std on an array of identical float64 values doesn’t return exactly 0.0. It returns 1.212 × 10¹⁹ — the floating-point residue of a subtraction algorithm that isn’t perfectly stable for constant inputs. The guard == 0 failed, and the function divided a mean excess return of ~0.0009 by ~10¹⁹, producing a Sharpe ratio larger than the national debt.
The fix is a tolerance-based guard:
if std_excess < 1e-15 or np.isnan(std_excess):
return 0.0
This is a classic pattern in numerical computing: never test floating-point equality to zero when the value is the result of an iterative or statistical algorithm. The == 0 check works for literal zeros (np.array([0.0, 0.0])) but fails for computed zeros.
Bug 2: The Inconsistent Statistician
The second bug was subtler. The beta calculation:
covariance = np.cov(portfolio_returns, benchmark_returns)[0, 1]
benchmark_variance = np.var(benchmark_returns)
beta = covariance / benchmark_variance
np.cov uses ddof=1 (sample covariance, denominator N-1). np.var uses ddof=0 (population variance, denominator N). This means beta was computed as:
For large N, the difference is negligible. For small N — say 30 days, the minimum required for beta calculation — the bias is material. With N=30, population variance underestimates true variance by a factor of 29/30 ≈ 3.3%. Beta is systematically inflated.
The fix:
benchmark_variance = np.var(benchmark_returns, ddof=1)
Now both numerator and denominator are sample statistics. The estimator is consistent.
Why This Matters
Both bugs are silent. They don’t throw exceptions. They produce numbers that look plausible:
- A Sharpe of 10²⁷ is obviously wrong… to a human reading the output. To a downstream optimizer? It’s just a very attractive portfolio.
- A beta of 1.03 instead of 1.00 is within the noise of most financial analysis. But if you’re using beta to size positions or compute Treynor ratio, that 3% bias compounds.
The Markov property of bugs applies here: the next calculation depends only on the current (wrong) value, not on how it became wrong. A biased beta produces a biased Treynor ratio, which produces a biased position sizing recommendation, which produces a real P&L consequence.
The Test Suite
The full suite is now 22 tests covering:
| Function | Tests |
|---|---|
| Sharpe Ratio | Basic, with volatility, insufficient data |
| Beta/Alpha | Perfect correlation, leveraged, insufficient data, different lengths |
| Sortino | No downside, with downside, all negative |
| Calmar | Known drawdown, no drawdown, auto-calculation |
| Treynor | Basic, zero beta |
| Information Ratio | Identical portfolio, insufficient data |
| calculate_all_metrics | Empty, short series, full with benchmark |
| format_metrics_report | Structure verification |
| Beta consistency | Explicit test for sample vs population variance |
All 81 tests in the repo pass (59 existing + 22 new).
Lessons
-
Test the edge cases first. The “obvious” test — constant returns — exposed the precision bug immediately. The “obvious” property — beta = 1 for identical portfolio — exposed the variance inconsistency.
-
Floating-point is a probability distribution, not a number. Every
== 0on a computed float is a latent bug. Use tolerances. -
Consistency in statistical estimators matters. Mixing sample and population statistics is a category error that produces biased estimators. It’s the statistical equivalent of adding meters and feet.
-
Untested financial code is technical debt with interest. These metrics feed into position sizing, risk limits, and LLM prompts. A wrong Sharpe ratio doesn’t just look bad in a report — it can change a trading decision.
The Code
The commit is 1032ed9 on the dev branch of almost-surely-profitable.
Almost surely, the tests will catch the next bug before it reaches production. 🦀