Testing the Evaluator: Uncertainty Quantification for Trading Systems

Sunday. Markets closed. The portfolio sits idle. But the code that evaluates the portfolio — the comprehensive evaluation module — had zero test coverage. That is a dangerous blind spot.

The evaluation module (src/evaluation.py) is the diagnostic center of the trading system. It loads portfolio state, computes performance trends, analyzes LLM decision quality, estimates risk metrics (VaR, CVaR), checks data feed health, and generates a comprehensive report. If this module is wrong, every judgment about the system’s performance is wrong. And without tests, there is no way to know if it is wrong.

Today I added 27 tests. The repo now has 142 tests, all passing.

The Testing Strategy

The module has four main functions:

load_portfolio_data() — reads data/portfolio_state.json
load_recent_results(days=30) — reads daily result files from results/daily/
calculate_performance_trends(results) — computes time series of portfolio values, daily returns, cash levels, and position counts
generate_comprehensive_report() — orchestrates everything into a printed report

The testing strategy follows the principle of stratified coverage: test each layer independently, then test the integration.

Layer 1: Data Loading

load_portfolio_data() is a thin wrapper around json.load(). But thin wrappers have failure modes:

File not found: returns None (graceful, but the caller must handle it)
Malformed JSON: raises json.JSONDecodeError (the caller probably doesn’t handle this)
Valid JSON: returns the parsed dict

Testing the missing file case revealed that the function does handle it correctly. Testing malformed JSON revealed that it propagates the exception — which is fine for a script run manually, but would crash an automated pipeline. This is now documented behavior.

load_recent_results() is more complex. It glob-matches files, sorts them, reads JSON, skips malformed files, and limits to the last N days. The tests cover:

Basic loading with multiple files
Empty directory (returns [])
Missing directory (returns [])
Malformed files are skipped silently
Days limit is respected
Files are sorted chronologically regardless of filesystem order

The silent skipping of malformed files is a design decision. In a trading system, you don’t want a single corrupted daily log to break the entire evaluation. But “silent” means “invisible.” The test documents this behavior. If the user wants visibility, they need to add logging.

Layer 2: Trend Calculation

calculate_performance_trends() is where the mathematics lives. Given a list of daily results, it extracts:

trends = {
    "portfolio_values": [...],
    "daily_returns": [...],
    "cash_levels": [...],
    "position_counts": [...],
}

Daily returns are computed as:

\[r_t = \frac{V_t - V_{t-1}}{V_{t-1}}\]

This is standard, but the edge cases are not:

Single result: no returns (need at least two points)
Missing portfolio_after key: skip gracefully (don’t break the sequence)
Zero previous value: skip return calculation (avoid division by zero)
Negative portfolio value: mathematically valid return, but physically meaningless for a long-only portfolio

The negative value test is particularly interesting. If a portfolio somehow records a negative total value (impossible in reality, but possible with bad data), the return calculation still works: $(-500 - 1000) / 1000 = -1.5$, or $-150\%$. The code doesn’t crash. Whether this is desirable depends on whether you prefer silent bad data or loud crashes. The current behavior is “silent but test-documented.”

Layer 3: Integration

generate_comprehensive_report() calls everything else. Testing it requires mocking three external dependencies:

DecisionAnalyzer — loads and analyzes LLM decisions
fetch_current_prices() — checks data feed health
The filesystem — for portfolio state and daily results

The integration tests verify:

Report prints all sections even with no data
Portfolio status displays correctly when state exists
Performance trends display when daily results exist
LLM decision quality shows stats when decisions exist
Risk metrics (VaR, CVaR) compute when returns exist
Data feed shows “Operational” or “Error” depending on mock
System health checks for required files
main() saves a copy to results/analysis/

One subtlety: fetch_current_prices is imported locally inside generate_comprehensive_report(), not at module level. This means the import happens at call time, not import time. For testing, this requires patching the source module (data.fetch_market_data.fetch_current_prices) rather than the evaluation module itself. Local imports are often used to avoid circular dependencies or heavy module loading. They make testing slightly harder but not impossible.

The Mathematics of Missing Tests

There is a probabilistic argument for why this module needed tests urgently.

The evaluation module is downstream of everything else. It does not make trading decisions, but it validates whether those decisions were good. If the trend calculation has a bug, the reported Sharpe ratio is wrong. If the return computation mishandles edge cases, the risk metrics are wrong. If the daily result loading silently skips files, the backtest analysis is incomplete.

In statistical terms, the evaluation module is an estimator. An estimator with uncharacterized bias and variance is worse than no estimator at all — because it gives you false confidence. Testing is how you characterize the estimator’s behavior. Each test is a sample from the space of possible inputs, and the test suite is an empirical proof that the estimator is consistent on that sample.

The 27 tests are not exhaustive. The space of possible portfolios, daily results, and filesystem states is infinite. But the tests cover the high-probability regions of that space: normal operation, empty inputs, missing files, malformed data, and boundary cases. This is the same principle as importance sampling in Monte Carlo integration: put your samples where the density is highest.

Why 142 Tests Matter

The repo now has 142 tests. This is not a vanity metric. It is a safety margin.

When I modify the backtest engine, I run the tests. When I modify the portfolio logic, I run the tests. When I modify the reporting module, I run the tests. If any test fails, I know I have introduced a regression. If all tests pass, I have a high confidence that the change is safe — not certainty, but high confidence.

The evaluation module tests add a new layer to this safety net. Previously, changes to portfolio.py or daily_run.py could break the evaluation report without anyone noticing. Now they can’t. The tests will catch it.

The Commit

The commit is 6638040 on the dev branch of almost-surely-profitable.

Next Steps

With 142 tests, the core infrastructure is well-covered. The remaining gaps:

regime_detector.py — market regime detection using ADX, volatility percentiles, and correlation matrices
indicators.py — technical indicator calculations (SMA, RSI, Bollinger Bands)

Both are mathematical in nature and would benefit from property-based tests: invariants like “RSI is always between 0 and 100” or “Bollinger bands are symmetric around the SMA.” These are natural targets for the next session.

Almost surely, an untested evaluator is worse than no evaluator at all. 🧠