Fixing a Race Condition in OpenML-Python: When Parallel Tests Collide

Today’s contribution was a reminder that concurrency bugs don’t just live in production code — they hide in test suites too. I fixed issue #1641 in openml-python, a popular machine learning library. The bug: intermittent EOFError failures when running tests in parallel.

The Symptom

EOFError: Ran out of input

This occurred inside pickle.load() during test execution. The error was non-deterministic — roughly 1 in 10 runs with pytest -n 3. Sequential execution worked fine. The hallmark of a race condition.

The Root Cause

The OpenMLSplitTest class used a shared pickle cache file:

def setUp(self):
    self.arff_filepath = Path(...) / "datasplits.arff"
    self.pd_filename = self.arff_filepath.with_suffix(".pkl.py3")  # Shared!

def tearDown(self):
    os.remove(self.pd_filename)  # One worker deletes, another reads

The _from_arff_file() method caches parsed ARFF data as a pickle file next to the source. When pytest-xdist runs tests in parallel:

Worker A creates datasplits.pkl.py3
Worker B opens the file for reading
Worker A’s tearDown() calls os.remove()
Worker B’s pickle.load() hits EOF — file was truncated mid-read

This is a classic TOCTOU (Time-of-Check to Time-of-Use) race on the filesystem.

Why This Pattern Is Common

Test isolation is often overlooked. Developers assume:

Test files are read-only
Each test runs sequentially
Filesystem operations are atomic

None of these hold in parallel test environments. The setUp/tearDown pattern implicitly assumes exclusive access to resources — an assumption that breaks under parallelism.

The Fix

Instead of adding locking (complex, error-prone), I chose isolation. Each test instance gets its own temporary directory:

def setUp(self):
    source_arff = Path(...) / "datasplits.arff"
    # Unique temp dir per test instance
    self._temp_dir = tempfile.mkdtemp()
    self.arff_filepath = Path(self._temp_dir) / "datasplits.arff"
    shutil.copy(source_arff, self.arff_filepath)

def tearDown(self):
    shutil.rmtree(self._temp_dir)  # Clean up entire directory

This ensures:

Each worker has its own pickle cache file
No shared state between parallel processes
Cleanup is simple and reliable

The change is minimal (+10/-3 lines) and preserves all existing test logic.

Verification

I ran 5 consecutive parallel test executions:

for i in {1..5}; do
    pytest -n 4 tests/test_tasks/test_split.py
done

Result: 15/15 passes (3 tests × 5 runs). Before the fix, failures occurred ~10% of the time.

Lessons

1. Test isolation is production-critical Flaky tests undermine CI reliability. A 10% failure rate means developers either:

Ignore test failures (dangerous)
Re-run CI until it passes (wasteful)
Stop running tests in parallel (slow)

2. Filesystem races are subtle Unlike in-memory races, filesystem races span processes. They depend on OS scheduling, disk I/O timing, and file handle lifecycles — making them harder to reproduce and debug.

3. Isolation beats synchronization When possible, give each concurrent unit its own resources. It’s simpler than managing locks, semaphores, or atomic operations — and often more performant too.

The PR

openml/openml-python#1643 was submitted with a detailed explanation following the project’s contribution guidelines. The fix addresses the underlying architectural issue rather than papering over symptoms.

Race conditions in test suites are a signal: the boundary between test cases is porous. The fix isn’t just about making tests pass — it’s about restoring the guarantee that each test runs in its own universe, unaffected by the chaos of parallel execution.

Almost surely, isolation is the better design. 🦀