Skip to content

Race Condition Fix

Norm Brandinger edited this page Nov 20, 2025 · 1 revision

Race Condition Fix: PgBouncer Connection Pooling Test

Date: 2025-11-04 Test: test_pooling_behavior in tests/test-pgbouncer.sh Status: ✅ Fixed


Executive Summary

Fixed a timing-based race condition in the PgBouncer connection pooling behavior test that caused non-deterministic failures. The test now reliably passes by using longer-running queries, polling for peak connections, and avoiding exact timing assumptions.


The Race Condition

Original Implementation

# Get initial pool state
local initial_clients=$(psql ... -c "SHOW CLIENTS;" | wc -l)

# Create multiple connections
for i in {1..5}; do
    psql ... -c "SELECT pg_sleep(0.1);" &>/dev/null &
done

sleep 1

# Get pool state during connections
local active_clients=$(psql ... -c "SHOW CLIENTS;" | wc -l)

if [ "$active_clients" -gt "$initial_clients" ]; then
    # Test passes
fi

The Problem

The test attempted to verify that PgBouncer manages multiple concurrent connections by:

  1. Recording the initial client count
  2. Starting 5 background psql processes with pg_sleep(0.1) (100ms sleep)
  3. Waiting 1 second
  4. Checking if the active client count increased

This created a race condition because:

  1. Process startup overhead: Background processes don't start instantly

    • Fork/exec system calls take time
    • Process scheduling delays vary by system load
    • Initial connection establishment (TCP handshake, authentication) adds latency
  2. Query execution time underestimated: pg_sleep(0.1) = 100ms, but actual connection time includes:

    • Connection establishment: ~10-50ms
    • Query parsing and planning: ~5-10ms
    • Query execution (pg_sleep): 100ms
    • Result transmission: ~5-10ms
    • Connection cleanup: ~5-10ms
    • Total: ~125-180ms per connection
  3. PgBouncer transaction pooling mode: Releases connections immediately after transaction completion

    • In transaction pool mode, connections return to the pool as soon as the query finishes
    • By the time sleep 1 (1000ms) completes, all 5 connections had already finished (5 × ~150ms = ~750ms)
    • The test was checking for active connections after they had all disconnected
  4. Single measurement point: The test only checked client count once, at a fixed time

    • If connections completed before the check, the test failed
    • No mechanism to catch the peak connection count

Why This Is Difficult to Diagnose

Race conditions are notoriously difficult to identify and fix:

  1. Non-deterministic failures:

    • May pass 90% of the time, fail 10%
    • "Works on my machine" syndrome
    • Different results on different runs
  2. Timing-dependent:

    • Passes on slower systems (connections still active during check)
    • Fails on faster systems (connections complete before check)
    • Affected by system load, CPU speed, I/O performance
  3. Environment-dependent:

    • Network latency varies
    • Container startup time varies
    • Database response time varies
    • Process scheduling varies
  4. Silent failure:

    • No error messages indicating why
    • Just sees "expected >2 clients, got 2"
    • No indication that timing is the issue
  5. Measurement affects behavior:

    • Adding debug statements changes timing
    • Trying to log values changes the race condition
    • "Heisenbugs" - bugs that disappear when you try to observe them

The Solution

Design Principles

To eliminate the race condition, the redesigned test follows these principles:

  1. No reliance on exact timing

    • Don't assume processes start/complete at specific times
    • Don't use fixed sleep durations as synchronization
  2. Polling instead of single measurement

    • Check multiple times to catch peak activity
    • Track maximum observed value, not just final value
  3. Wider observation window

    • Use longer-running queries (2 seconds instead of 0.1)
    • Give more time to observe concurrent connections
  4. Early exit on success

    • Stop polling once we've confirmed the behavior
    • Reduces test execution time when successful
  5. Clear success criteria

    • Define minimum acceptable threshold (≥3 concurrent connections)
    • Don't require exact counts (which may vary)

New Implementation

test_pooling_behavior() {
    # Get initial pool state (should be 0 or very low)
    local initial_clients=$(psql ... -c "SHOW CLIENTS;" | wc -l)

    # Create multiple long-running connections (2 seconds each)
    # This gives us a much wider window to observe them
    for i in {1..5}; do
        psql ... -c "SELECT pg_sleep(2);" &>/dev/null &
    done

    # Give processes time to start and establish connections
    sleep 0.5

    # Poll for active connections multiple times to catch peak
    local max_clients=0
    local attempts=0
    local max_attempts=8

    while [ $attempts -lt $max_attempts ]; do
        local current_clients=$(psql ... -c "SHOW CLIENTS;" | wc -l)

        if [ "$current_clients" -gt "$max_clients" ]; then
            max_clients=$current_clients
        fi

        # If we've observed enough active connections, we can stop polling
        if [ "$max_clients" -ge 3 ]; then
            break
        fi

        attempts=$((attempts + 1))
        sleep 0.2
    done

    # Wait for all background jobs to complete
    wait

    # Verify we observed more connections than initially
    # We expect to see at least 3 of the 5 concurrent connections
    if [ "$max_clients" -gt "$initial_clients" ] && [ "$max_clients" -ge 3 ]; then
        success "Connection pooling behavior verified (peak clients: $max_clients, initial: $initial_clients)"
        return 0
    fi

    fail "Pooling behavior test failed (peak: $max_clients, initial: $initial_clients, expected >= 3)" "Pooling behavior"
    return 1
}

Key Improvements

  1. Longer query duration (2 seconds):

    • 20× longer than original (0.1s → 2s)
    • Provides wide observation window
    • Ensures connections are active during polling
  2. Initial startup delay (0.5 seconds):

    • Allows background processes to start
    • Lets connections establish
    • Reduces likelihood of missing early connections
  3. Polling loop (up to 8 attempts × 0.2s = 1.6 seconds):

    • Checks client count multiple times
    • Tracks maximum observed value
    • Catches peak concurrent connections
  4. Early exit optimization:

    • Stops polling once ≥3 connections observed
    • Reduces test time (doesn't always wait full 1.6s)
    • Confirms expected behavior as soon as possible
  5. Clear success criteria:

    • Must observe at least 3 concurrent connections
    • Doesn't require exact count (allows variance)
    • More realistic expectations

Results

Before Fix

  • Test reliability: ~50% (highly variable)
  • Failure mode: Silent - just returned unexpected count
  • Pass rate: 2/10 PgBouncer tests
  • Overall: 15/16 test suites passing

After Fix

  • Test reliability: 100% (5/5 consecutive runs)
  • Failure mode: Clear - shows peak/initial/expected counts
  • Pass rate: 10/10 PgBouncer tests ✅
  • Overall: 16/16 test suites passing ✅

Test Output Example

[PASS] Connection pooling behavior verified (peak clients: 7, initial: 2)

The test now consistently observes 7 peak clients (5 test connections + 2 for the polling queries), confirming that PgBouncer properly manages concurrent connections.


Lessons Learned

1. Timing Assumptions Are Dangerous

  • Never assume fixed timing in concurrent systems
  • Process scheduling is non-deterministic
  • Network/disk I/O introduces variable latency

2. Single Measurements Miss Transient States

  • Concurrent systems have transient states
  • Peak values may occur between measurements
  • Polling captures transient behavior

3. Test What You Mean

  • Original test: "Are there more connections after 1 second?"
  • Actual goal: "Does PgBouncer handle multiple concurrent connections?"
  • Redesigned test: "Can we observe multiple concurrent connections?"

4. Make Races Obvious

  • Use longer durations to widen observation windows
  • Poll multiple times to catch transient states
  • Track maximums, not just final values

5. Fail With Context

  • Old error: "test failed"
  • New error: "peak: 1, initial: 2, expected >= 3"
  • Better diagnostics help future debugging

Best Practices for Concurrent Testing

Based on this fix, here are general principles for testing concurrent systems:

1. Design for Observability

# Bad: Single measurement
count=$(measure_once)

# Good: Multiple measurements, track maximum
max=0
for i in {1..10}; do
    current=$(measure)
    [ $current -gt $max ] && max=$current
done

2. Use Adequate Time Windows

# Bad: Tight timing
process &
sleep 0.1
check  # Race condition!

# Good: Generous timing
process &
sleep 1
check  # Wider window

3. Implement Early Exit

# Bad: Always wait full duration
while [ $i -lt 100 ]; do
    check && sleep 0.1
    i=$((i+1))
done

# Good: Exit when condition met
while [ $i -lt 100 ]; do
    check && break
    sleep 0.1
    i=$((i+1))
done

4. Set Realistic Thresholds

# Bad: Expect exact count
[ $count -eq 5 ] && pass

# Good: Expect minimum threshold
[ $count -ge 3 ] && pass

5. Provide Diagnostic Output

# Bad: Silent failure
[ $actual -gt $expected ] || fail

# Good: Contextual failure
[ $actual -gt $expected ] || \
    fail "Expected >$expected, got $actual (started with $initial)"

References

  • Original test: tests/test-pgbouncer.sh (lines 171-200, original version)
  • Fixed test: tests/test-pgbouncer.sh (lines 171-260, current version)
  • Related: docs/TESTING_APPROACH.md - Testing methodology
  • Related: tests/TEST_COVERAGE.md - Test coverage details

Verification

To verify the fix is reliable, run:

# Single run
./tests/test-pgbouncer.sh

# Multiple runs to verify reliability
for i in {1..10}; do
    echo "Run $i/10"
    ./tests/test-pgbouncer.sh || break
done

All runs should pass consistently.

Clone this wiki locally