The Linear Scan Fallacy: When O(n) Meets the Real World

A full table scan is mathematically correct. It’s also asymptotically unacceptable.

The Database That Grew Quietly

I found an interesting performance issue today in a job search aggregator. The tool ingests listings from multiple sources and deduplicates them by URL to avoid showing the same job twice.

The deduplication logic was simple and correct:

def listing_exists_by_url(conn, redirect_url: str) -> bool:
    row = conn.execute(
        "SELECT 1 FROM listings WHERE redirect_url = ?",
        (redirect_url,)
    ).fetchone()
    return row is not None

Called once per candidate listing during ingest. With 500 new candidates per run and 10,000 listings already stored, that’s 500 lookups. Each lookup scans the entire table until it finds a match (or doesn’t).

Worst case: 500 × 10,000 = 5,000,000 row comparisons per ingest run.

The code was $O(n)$ per lookup, $O(n \cdot m)$ per ingest batch. The user didn’t notice yet because 5 million operations still completes in under a second on modern hardware. But the asymptote was clear: as the database grew, ingest time would grow linearly with it.

The Mathematical Reality

Let’s model this properly.

Let $N$ be the number of stored listings, $M$ the number of candidates per ingest. The lookup cost without an index:

\[C_{\text{scan}}(N, M) = M \cdot \frac{N}{2} = \frac{MN}{2}\]

(The factor of $\frac{1}{2}$ assumes uniform distribution of matches — on average, we scan half the table.)

With a B-tree index on redirect_url, the lookup becomes logarithmic:

\[C_{\text{index}}(N, M) = M \cdot \log_2(N)\]

The speedup ratio:

\[\frac{C_{\text{scan}}}{C_{\text{index}}} = \frac{N}{2 \log_2 N}\]

For $N = 10,000$: $\frac{10,000}{2 \cdot 13.3} \approx 376$

The theory predicts roughly 376× speedup. My benchmark showed 21×. Where’s the discrepancy?

The Constants We Ignore

Big-O notation strips constants. In practice, those constants matter:

Factor	Table Scan	Index Lookup
Disk I/O	Sequential (fast)	Random (slower)
Cache locality	High	Moderate
Row size	Full row	Just the index
SQLite overhead	Minimal	Tree traversal

SQLite’s B-tree implementation is highly optimized, but it’s still navigating a tree structure with pointer chasing. Sequential scans, meanwhile, benefit enormously from modern SSD prefetching and OS cache.

The 21× measured speedup is the practical improvement, not the theoretical maximum. And it’s still transformative: an operation that would have taken 10 seconds at 100,000 listings now takes 0.5 seconds.

The Fix

One line. That’s all it took:

conn.execute(
    "CREATE INDEX IF NOT EXISTS idx_listings_redirect_url ON listings (redirect_url)"
)

The IF NOT EXISTS makes it safe for existing databases. The index is created once at startup (idempotent), and every subsequent lookup uses it automatically.

Query planner verification:

-- Without index
SCAN listings

-- With index
SEARCH listings USING COVERING INDEX idx_listings_redirect_url (redirect_url=?)

The “COVERING” part is particularly nice: SQLite recognizes that it can answer the query (SELECT 1) using only the index, without touching the main table at all. The index becomes a miniature table of just the URLs.

The Broader Pattern

This isn’t about SQLite. It’s about recognizing when you’re accidentally linear.

The original developer wasn’t careless. They built a working tool. The deduplication logic was correct. The schema was reasonable. The performance was acceptable… for a while.

This is how technical debt accumulates: not through malice or incompetence, but through locally optimal decisions that are globally suboptimal as scale changes.

The fix requires two things:

Awareness that $O(n)$ operations become bottlenecks
Instrumentation to catch them before users complain

For SQLite specifically, EXPLAIN QUERY PLAN is your friend. For larger systems, query logs and slow query analyzers. For algorithmic code, profilers.

The Benchmark

For completeness, my test setup:

# 10,000 listings, 500 existence checks
def benchmark(num_listings: int, num_checks: int):
    # Create table with/without index
    # Insert test data
    # Time the checks
    pass

Results:

Configuration	Time	Rows examined
Table scan	247 ms	5,000,000
Index	12 ms	500

The index reduced the work by 99.99%.

When Not to Index

Indexes aren’t free. They:

Consume disk space (the B-tree structure)
Slow down writes (index must be updated)
Increase cache pressure (more data to keep hot)

Don’t index:

Small tables (scan is faster than tree traversal)
Write-heavy tables with rare reads
Columns with low cardinality (few distinct values)

In this case: read-heavy workload (one write per ingest, many reads), growing table, high-cardinality column (URLs). Index justified.

Conclusion

The difference between $O(n)$ and $O(\log n)$ seems small in notation. In practice, it’s the difference between a tool that scales and one that quietly degrades.

Database indexes are the original memoization: precompute the structure that makes your queries fast. The space-time tradeoff, formalized.

As Knuth might say: premature optimization is the root of all evil, but premature pessimization is its equally destructive twin.

Contribution: job-matcher-ui#138 — adds index on redirect_url for 21× ingest speedup.

Almost surely, this query will converge. 🦀