Skip to content

RabbitMQ connection reset due to missed heartbeats during high-load E2E tests #151

@vredchenko

Description

@vredchenko

Problem

During E2E test runs with high-speed playback (45.6x compression), the RabbitMQ connection is reset after ~3 minutes due to missed heartbeats.

Root Cause

RabbitMQ server closes the connection after not receiving heartbeats for 60 seconds. The API server's main thread becomes too busy processing rapid requests to service the pika heartbeat mechanism.

Evidence

From RabbitMQ logs:

2026-01-29 19:00:01.095545+00:00 [error] closing AMQP connection (duration: '3M, 0s'):
missed heartbeats from client, timeout: 60s

From API logs:

pika.adapters.blocking_connection - ERROR - Unexpected connection close detected: 
StreamLostError: ("Stream connection lost: ConnectionResetError(104, 'Connection reset by peer')",)

Impact

  • Failed to publish grid.created events
  • Cascading HTTP 500 errors in agent (~791 errors)
  • Events lost during the reconnection window

Proposed Fix Options

  1. Use threaded heartbeat handling in pika - Configure BlockingConnection with threaded heartbeat processing
  2. Switch to async pika (aio-pika) - Better heartbeat handling with async I/O
  3. Implement connection recovery/retry logic - Graceful reconnection with message buffering
  4. Increase heartbeat timeout or disable for local dev - Quick fix for dev/test environments

Affected Component

smartem_backend - RabbitMQ event publisher

Reproduction

Run E2E test with compressed playback:

./repos/DiamondLightSource/smartem-devtools/tests/e2e/run-e2e-test.sh

The issue manifests after ~3 minutes of sustained high-throughput ingestion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugfixingFixing defects or unexpected behavior in existing codesmartem-backendCore backend services, messaging, and persistence layersmartem-backend:apiREST API endpoints and HTTP interface changessmartem-devtools:e2e-testEnd-to-end testing infrastructure and scenariostestingWriting, updating, or fixing automated tests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions