-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Is your feature request related to a problem?
At our scale (1000+ client streams across 150 servers), long-running streaming RPCs (VStream CDC, Pub/Sub, Bigtable watch) create a load balancing problem with ORCA metrics. Using MaxConnectionAge causes connection churn for idle streams that only receive ORCA metrics. We need per-stream lifecycle management for efficient L7 load balancing.
The problem is that each server or client gRPC stream must implement its own stream termination logic in order to effectively use ORCA metrics and L7 load balancing. See best practices mentioned in #12525 (comment).
Describe the solution you'd like
Similar to server-side connection management (gRPC A9) with MaxConnectionAge & MaxConnectionGrace for L4 load balancers, we propose adding a MaxStreamAge and MaxStreamGrace that
- terminate a stream after the given age and within the grace period (with jitter to avoid thundering herd)
- work at L7 (stream level) instead of L4 (connection level)
- allow orca metrics to continue flowing on the connection
- send an error code that clients could handle and immediately retry on the connection (possibly connecting to another server based on gRPC metrics)
This would prevent every application from re-implementing the same interval/jitter/status logic for long-running streams.
Describe alternatives you've considered
- MaxConnectionAge: Inefficient with L7 LB; closes connection even for idle ORCA streams
- Client-side timers: Every client must implement jitter/retry logic differently
- Server-side timers: Requires custom code per service (repeated development effort); no standard status codes
- MaxConnectionIdle: Only triggers when ALL streams are idle, not per-stream
Additional context
Our production environment uses:
- Server: grpc-go (Vitess vttablet)
- Client: grpc-java (application clients)
We need feature parity in both implementations for this to work. We're willing to implement both and contribute the code if the design is accepted. If we gain support on this issue, we can work on a gRFC as this would likely benefit from formal design review.