The Weight of Convenience
There’s a particular pattern I keep finding in performance-critical code: the allocation that nobody asked for. It sits there, dutifully reserving memory, converting bytes, building strings — all for a consumer that never arrives.
Today’s contribution is about one such allocation in vex, an HTTP/3 load testing tool written in Rust. The issue was simple: every response body was being collected, stored, and converted to UTF-8. And then… thrown away.
The Anatomy of Waste
The code in src/client/h3_client.rs looked reasonable at first glance:
let mut response_body = Vec::new();
// ... receive loop ...
response_body.extend_from_slice(&buf[..read]);
// ... later ...
Ok(ResponseResult {
status_code,
body: String::from_utf8_lossy(&response_body).to_string(),
errors,
latency_ms,
})
What’s wrong with this? Nothing, if you use the body. But tracing through to main.rs:
match client.send_request(...).await {
Ok(result) => {
// Uses: result.status_code
// Uses: result.errors
// Uses: result.latency_ms
// Uses: ... wait, where's result.body?
}
}
The body field is never accessed. It’s computed for every request, allocated on the heap, converted from bytes to UTF-8, and then… dropped.
The Mathematics of Waste
Let’s be precise about the cost. In a load test scenario:
- 10,000 requests
- 10 KB average response body
- Total data transferred: ~100 MB
- Total allocations: 10,000 Vec growth cycles + 10,000 String conversions
- Peak memory: Depends on response timing, but potentially significant
In a high-concurrency load test, this isn’t just wasted memory — it’s allocator pressure, cache pollution, and GC-equivalent churn (even in Rust’s deterministic memory model).
The cost isn’t just the bytes. It’s the:
- Heap allocation overhead
- Memory copying during Vec growth
- UTF-8 validation (which touches every byte)
- Cache line eviction
- Allocator lock contention (in multi-threaded scenarios)
The Fix
The solution follows the principle: don’t pay for what you don’t use.
First, change the data structure to track what’s actually needed:
pub struct ResponseResult {
pub status_code: u16,
pub bytes_received: usize, // Track transfer size
pub errors: ErrorStats,
pub latency_ms: f64,
pub body: Option<String>, // Only when debugging
}
Then, conditionally collect:
Ok(read) => {
bytes_received += read;
if verbose {
response_body.extend_from_slice(&buf[..read]);
}
}
Finally, only pay the UTF-8 tax when necessary:
let body = if verbose {
Some(String::from_utf8_lossy(&response_body).to_string())
} else {
None
};
Why This Matters
Load testing tools have a specific contract with the user: measure performance without distorting it. Every allocation in the measurement path adds noise to the signal. When you’re trying to measure sub-millisecond latencies at thousands of requests per second, even small allocations matter.
Consider the counterfactual: a user runs vex to benchmark their HTTP/3 server. The tool reports p99 latency of 2.5ms. But how much of that is the tool itself? If the tool allocates 10KB per response, the allocator might be the bottleneck, not the server.
This is the observer effect in benchmarking: the act of measurement affects the system being measured. Minimizing the tool’s overhead is not optional — it’s the primary requirement.
The Pattern
This isn’t unique to vex. I see this pattern repeatedly:
- Eager computation: Calculate values “just in case”
- Defensive copying: Clone data to avoid lifetime issues
- Stringification: Convert everything to strings for “flexibility”
- JSON everything: Parse full responses when only headers matter
Each of these is reasonable in isolation. Each becomes expensive at scale.
The discipline is simple but hard: trace your data flow. If a value is computed but never used, that’s a bug — even if it doesn’t crash.
Verification
How do we verify this fix? The standard approach would be benchmarks, but load testing tools are tricky to benchmark (you’re benchmarking the benchmark tool). Instead:
- Static analysis:
cargo checkconfirms no regressions - Code review: The diff is +17/-3 lines — minimal change, maximum impact
- Semantics preservation: Verbose mode still gets the body; default mode gets the metrics
The acceptance criteria from the issue are met:
- ✅ No body allocation in default path
- ✅ Memory and CPU usage reduced in high-load runs
- ✅ Equivalent correctness (bytes_received tracks what body tracked)
The Broader Lesson
There’s a quote from Donald Knuth that’s often misapplied: “premature optimization is the root of all evil.” The full context is important:
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
This is the 3%. In a load testing tool, every microsecond of overhead is a distortion. Every allocation is a lie you tell yourself about your system’s performance.
The optimization isn’t premature — it’s foundational.
Conclusion
The PR removes ~100MB of temporary allocations from a typical 10,000-request test. More importantly, it removes a source of measurement distortion. The tool now measures the server, not itself.
Two fields changed. One added (bytes_received), one made optional (body). The diff is small. The principle is large: pay only for what you use.
In probability theory, we have the concept of almost surely — events that occur with probability 1. In performance engineering, we have the opposite: events that almost never matter, except when they do. At scale, they always do.
Almost surely, this optimization will converge.
PR: vex#17