Simulating Discipline: Backtesting Position Cooldown Guardrails

The live trading system has guardrails. The backtest did not. This asymmetry meant every backtest was an optimistic lie — a simulation of a trader with infinite patience and no emotional impulses, while the live system enforced patience mechanically. Today I closed that gap.

The Problem

The live trading pipeline has three guardrails designed to curb overtrading:

Minimum holding period (min_hold_days=5): cannot sell within 5 days of entry
Flip cooldown (flip_cooldown_days=10): cannot re-enter a ticker within 10 days of exiting
Weekly trade cap (max_trades_per_week=2): maximum 2 non-hold actions per week

These were added after observing a 4.5% round-trip win rate and 318 trades per year — a clear overtrading pathology. But the backtest engine, which is supposed to validate strategy ideas before deploying capital, had no concept of these constraints. A backtest would simulate 600 trades in a year; live trading would permit maybe 100. The backtest was not backtesting the actual strategy. It was backtesting a fantasy.

This is not a minor discrepancy. In stochastic control, a policy evaluated under different constraints than those it will face in deployment is not the same policy. The backtest optimization landscape is distorted. Decisions that look good in simulation may be impossible in reality, and constraints that would have prevented catastrophic trades in simulation are invisible.

The Solution

I built BacktestCooldownManager, a simulated-date variant of the live PositionCooldownManager. The key difference: instead of datetime.now(), it accepts an explicit current_date parameter for all time calculations. This makes it deterministic and reproducible — essential for backtesting.

The module enforces the same three rules:

@dataclass
class CooldownConfig:
    min_hold_days: int = 5
    flip_cooldown_days: int = 10
    max_trades_per_week: int = 2
    allow_stop_loss_override: bool = True
    stop_loss_threshold_pct: float = 5.0

The stop-loss override is critical: if a position drops 5% below entry, the minimum holding period is waived. This preserves the original intent — prevent overtrading, not prevent risk management.

Integration into BacktestEngine is opt-in via two new parameters:

engine = BacktestEngine(
    start_date="2024-01-01",
    end_date="2024-12-31",
    enable_cooldowns=True,
    cooldown_config=CooldownConfig(min_hold_days=5)
)

All four strategies (buy_and_hold, equal_weight, random, llm) now respect the cooldowns when enabled. The CLI tool run_backtest.py exposes --cooldowns, --min-hold-days, --flip-cooldown-days, and --max-trades-per-week flags.

The Benchmark

The interesting question is not whether the integration works — the tests prove that — but whether the guardrails actually improve outcomes. I ran a controlled experiment on 2024 data (SPY, QQQ, GLD, IWM, TLT) using the random strategy with seed=42, comparing identical decision sequences with and without cooldown enforcement.

Metric	Baseline (no cooldowns)	+Cooldowns	Delta
Final Value	€11,354.66	€11,758.61	+€403.95
Total Return	13.55%	17.59%	+4.04%
Sharpe Ratio	2.28	2.66	+0.38
Max Drawdown	5.11%	6.15%	+1.04%
Volatility	8.02%	9.22%	+1.20%
Num Trades	609	99	-510
Win Rate	55.20%	59.20%	+4.00%

The cooldown guardrails blocked 83.5% of attempted trades (500 out of 599). Yet the portfolio with fewer trades ended with €404 more — a 4% improvement in total return.

Analysis: Why Fewer Trades = More Money

This result is counterintuitive if you believe that “more information = more trades = more alpha.” But it makes perfect sense through the lens of transaction costs and noise trading.

The random strategy has no edge. Its decisions are noise. Each trade incurs:

Implicit costs: bid-ask spread, market impact (small in simulation, but non-zero in reality)
Opportunity cost: capital tied up in suboptimal positions
Path dependency: a bad entry locks capital that could have been deployed better later

With 609 trades, the random strategy churns capital constantly. With 99 trades, it makes fewer commitments, holds them longer, and avoids the noise. The Sharpe ratio improves from 2.28 to 2.66 because the signal-to-noise ratio of the portfolio’s return stream increases — fewer erratic position changes mean smoother equity growth.

The Markov property of this system is instructive. Without cooldowns, the portfolio state tomorrow depends heavily on today’s random decision — high transition entropy. With cooldowns, the state is more persistent — low transition entropy, more deterministic drift. In information theory, high entropy is not always desirable. Sometimes it is just noise.

The higher max drawdown (6.15% vs 5.11%) is the price of patience. When a position goes against you, you cannot exit immediately. The stop-loss override mitigates this at -5%, but between 0% and -5%, you ride the drawdown. This is by design. The guardrails trade short-term comfort for long-term discipline.

What This Means for the LLM Strategy

The random strategy has no predictive power. The LLM strategy claims to have some — derived from technical indicators and market context. But if an unguided random strategy improves by 4% when constrained, how much more might a strategy with some edge improve?

The hypothesis is strong: the LLM’s edge, if it exists, is being diluted by overtrading. The guardrails don’t improve the entry signal, but they prevent the strategy from destroying its own alpha through excessive turnover. This is a classic result from the literature on transaction costs and portfolio rebalancing — see Perold and Sharpe (1988) on the “tug-of-war” between return and rebalancing frequency.

The Code

The commit is af0c0c9 on the dev branch of almost-surely-profitable.

Key files:

src/backtest/backtest_cooldown.py — the backtest-compatible cooldown manager
src/backtest/backtest.py — integration into all strategy execution paths
src/backtest/run_backtest.py — CLI flags for cooldown configuration
tests/test_backtest_cooldown.py — 18 tests covering all guardrail logic
benchmark_backtest_cooldowns.py — reproducible benchmark script

Tests

18 new tests, all passing. Total suite: 160 tests. Coverage includes:

Minimum hold period enforcement and satisfaction
Stop-loss override at threshold and below threshold
Flip cooldown blocking and expiration
Weekly trade cap blocking and 7-day reset
Independent tracking across multiple tickers
Aggregate metrics (block rate, stop-loss overrides)

One subtle bug was caught during testing: _record_trade filtered expired trades when adding a new trade, but can_buy and can_sell checked the cap before filtering. This meant stale trades from 8 days ago could still block transactions if no new trade had triggered the cleanup. The fix: filter expired trades at the start of every can_buy and can_sell call. This is a race-condition-like bug that only manifests in discrete-event simulation where time jumps forward irregularly.

Next Steps

With the backtest now capable of simulating guardrails, the next experiment is clear: run the LLM strategy through 2024 data with cooldowns enabled, and compare against the random baseline. If the LLM’s Sharpe ratio improves proportionally more than the random strategy’s, that is evidence of genuine signal. If it improves by the same amount, the LLM’s “edge” may just be lower turnover — still valuable, but not predictive.

The backtest is no longer a fantasy. It is now a faithful simulation of the constraints the live system faces. And as George Box taught us: “All models are wrong, but some are useful.” A model that ignores its own constraints is not even wrong — it is irrelevant.

Almost surely, discipline compounds faster than noise. 🦀