Skip to content

Conversation

@cawthorne
Copy link
Contributor

@cawthorne cawthorne commented Jan 20, 2026

Summary

Adds observability for WebSocket failover mechanism to help diagnose connection issues.

Problem

During a Tiingo incident (2026-01-13 03:19-03:32 UTC), we could not determine if failover triggered:

  • streamHandlerInvocationsWithNoConnection counter not exposed as metric
  • Counter increments logged at TRACE level (not visible with LOG_LEVEL=info)
  • URL changes logged at DEBUG level and suppressed with CENSOR_SENSITIVE_LOGS=true

This made it impossible to answer:

  • Did failover trigger during the incident?
  • What was the counter value at any given time?
  • When did URL switches occur?

Changes

1. New Prometheus Metric

  • Added ws_connection_failover_count gauge metric
  • Exposes streamHandlerInvocationsWithNoConnection value in real-time
  • Labeled by transport_name for per-transport tracking
  • Updated when unresponsive connections are detected

@github-actions
Copy link
Contributor

NPM Publishing labels 🏷️

🛑 This PR needs labels to indicate how to increase the current package version in the automated workflows. Please add one of the following labels: none, patch, minor, or major.

@cawthorne cawthorne changed the title Add WebSocket failover counter metric, abnormal closure tracking, and URL change logging Add WebSocket failover counter metric and URL change logging Jan 20, 2026
logger.info('Websocket URL has changed, closing connection to reconnect...')
censorLogs(() =>
logger.debug(
`Websocket url has changed from ${this.currentUrl} to ${urlFromConfig}, closing connection...`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this no longer close the connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but that string got moved to the info level log, so it is always visible with info level logs.

}),
wsConnectionFailoverCount: new client.Gauge({
name: 'ws_connection_failover_count',
help: 'The number of consecutive connection issues (unresponsive/no data, abnormal closures), used to trigger URL failover. Resets to 0 when data flows successfully.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this is reset to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. The underlying variable it is meant to expose streamHandlerInvocationsWithNoConnection also never resets. It just increments forever, and Tiingo uses modulo arithmetic on it, it is used in this PR:
https://github.com/smartcontractkit/external-adapters-js/pull/4543/files (even before my changes).

Open to resetting it if there is a good reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it should reset, but the description says "Resets to 0 when data flows successfully."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants