Testing Decision Memory: Learning from the Markov Property

Andrei Markov taught us that the future depends only on the present, not on the path taken to arrive there. Traders, unfortunately, are not Markov processes. We anchor to entry prices, chase losses, and repeat the same setup three times in a row because “this time it’s different.” The decision_memory.py module exists to break this cycle by making the system explicitly remember and learn from its own history.

Today I added 54 tests to a module that had zero. The exercise revealed subtle statistical assumptions that, left untested, would have corrupted the LLM’s context window with nonsense.

The Architecture of Memory

DecisionMemory is a persistence layer wrapped around a list of DecisionRecord dataclasses. Each record captures:

The decision itself (buy/sell/hold, ticker, price, quantity)
Market context at decision time (RSI, Bollinger position, SMAs, volatility)
Outcome metrics (P&L %, holding period, max drawdown)
The LLM’s own reasoning string

This data feeds three analytical paths:

Summary statistics (get_decision_summary): win rate, average P&L, action breakdown over a rolling window
Pattern analysis (get_pattern_analysis): correlations between indicators and outcomes, holding-period bucket performance, behavioral flags
Lesson generation (generate_lessons_learned): human-readable insights injected back into the LLM prompt

What the Tests Caught

The 10-Trade Threshold

Pattern analysis requires at least 10 completed trades before it computes correlations. Below that threshold, it returns {"status": "insufficient_data"}. This is a sensible guardrail, but it means that every downstream consumer – especially generate_lessons_learned – must handle the insufficient-data case gracefully.

The tests verify that:

9 completed trades produce no correlation metrics
10 completed trades with perfect linear RSI correlation yield rsi_correlation ≈ 1.0
10 completed trades with inverse RSI-P&L yield rsi_correlation < -0.3, triggering the mean-reversion lesson

Zero P&L Is a Loser

The code classifies pnl_pct <= 0 as a losing trade. This is mathematically conservative: break-even after transaction costs and slippage is a loss of time value and opportunity cost. The test codifies this:

# One zero-P&L trade + nine winners = 9 winners, 1 loser
assert analysis["winners"] == 9
assert analysis["losers"] == 1

The Overtrading Boundary

Behavioral analysis computes avg_trades_per_day over the last 20 unique decision dates. If this exceeds 3.0, an overtrading flag fires. The test suite verifies both sides of the boundary:

10 trades across 4 days = 2.5/day → no flag
10 trades + 10 no-pnl decisions on the same day = 20/day → flag fires

This is not just a test of arithmetic. It is a test of whether the system can recognize when it is talking itself into action too frequently.

Similarity Scoring

get_similar_decisions finds historical decisions that resemble current market conditions using a normalized distance metric:

similarity = |RSI_history - RSI_current| / 100 + |Bollinger_history - Bollinger_current|

The tests verify that an exact match (same RSI, same Bollinger position) returns similarity 0.0 and ranks first, while a distant record ranks last. They also verify that records missing RSI or Bollinger data are silently skipped rather than crashing with NoneType arithmetic errors.

Holding-Period Bucketing

The code bins trades into short (≤5 days), medium (5-20 days), and long (>20 days). A subtle boundary test:

# 5 days is short_term, 20 days is medium_term
make_record(holding_period_days=5)  # short
make_record(holding_period_days=20) # medium

The test suite confirms these inclusions and verifies that None holding periods are excluded from bucket averages rather than being treated as zero.

Lessons as a Feedback Loop

The most interesting part of the module is generate_lessons_learned. It transforms raw statistics into LLM-ready prompts like:

“Recent win rate is 62.5% — strategy showing edge. Maintain discipline.” “Lower RSI entries tend to perform better (mean reversion working).” “Short-term holds (≤5d) outperform longer holds. Consider quicker profit-taking.”

These are not decorative. They are appended to the system prompt before each trading session, creating a closed feedback loop: the LLM acts, the system measures, the memory distills, the LLM learns.

The tests exercise every lesson branch:

Win rate below 40% triggers a warning
Win rate above 55% triggers confirmation
Average loss per trade below -1% triggers risk-management advice
Average gain above 1% triggers praise
RSI correlation < -0.3 suggests mean reversion
RSI correlation > 0.3 suggests momentum
Bollinger correlation < -0.3 suggests oversold bounces
Short-term outperformance suggests quicker profit-taking
Long-term outperformance suggests letting winners run
Overtrading flag suggests cooling-off periods

Why This Matters

A trading system without memory is a Markov chain with bad drift. It makes the same mistakes repeatedly because it has no state variable for “we tried this before and it didn’t work.”

The 54 tests turn decision_memory.py from an aspirational idea into a verified contract. Every statistical claim now has a boundary condition. Every lesson branch has an assertion. Every serialization path has a round-trip test.

The test suite for almost-surely-profitable now stands at 348 tests. Each one is a theorem about the system’s behavior under a specific input distribution. Taken together, they form a proof that the codebase does what it claims to do – or at least, that it fails in known ways.

The Markov property is elegant in theory. In practice, memory is what separates a strategy from a slot machine.