-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[DO NOT MERGE]: Leverage x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.
#47631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…into AzCosmos_AddHubRegionProcessingOnlyHeader
|
|
||
| Mono<Void> refreshLocationCompletable = this.refreshLocation(isReadRequest, forceRefresh, usePreferredLocations); | ||
|
|
||
| // if PPAF is enabled, mark pk-range as unavailable and force a retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this getting removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FabianMeiswinkel Ignore this change - I forgot for a moment why I added the per-partition set failover at both places. Immediately after the 403-3 is detected and also in shouldRetryOnEndpointFailureAsync (it can be relevant in Gateway Endpoint Unavailability).
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one question whether one change is intended - and if so why?
…ion-processing-only` header is set.
… regions for new hub.
x-ms-cosmos-hub-region-processing-only header.x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.
x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.x-ms-cosmos-hub-region-processing-only for 404 Read Session Not Available cross-region retry scenarios.
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Motivation
The pull request serves as the first iteration which integrates
x-ms-cosmos-hub-region-processing-onlyheader. Setting the value of this header totruewill allow a Cosmos DB backend node to return a 403:3 in case the backend node belongs to a non-hub physical partition.Using this setup, the
CosmosClientinstance can determine partition-set level hub which in the first iteration helps in region detection of 404Read Session Not Availablecross-region detection for Single-Writer accounts. This is needed in particular when failover happens in a rolling-manner partition-set by partition-set and in Per-Partition Automatic Failover cases where hub is a partition-set granular notion. Simply relying on LocationCache to provide account-level hub region is incorrect.Scope
In this pull request, the focus is on how 404
Read Session Not Availablecross-region retry handling happens for Single-Writer accounts.Critical Changes
The approach taken here is to pin the
x-ms-cosmos-hub-region-processing-onlyonce a request hits a 404Read Session Not Available. This ensures an operation (a construct which encapsulates several I/O calls) is sticky to the hub region.The other change, as a result of keeping the header set, a non-write operation can now see 403
Write Forbidden. As the goal is to determine hub, 403Write Forbiddenhandling when such header is set is to ensure cycle through of available read regions as maintained byLocationCache.Testing
Per-Partition Automatic Failover
The approach was to set a naming configuration (
simulateRevokeLocalWriteStatusOfPartition) consumed by the service fabric process mapped to the original hub region (sayNorth Central US) for a particular physical partition.Post that, a 404
Read Session Not Availableis injected into the same partition for which the write privilege was revoked (North Central US).Using a "pure-read" workload, the goal is to assert whether the read (a
readItemoperation) gets a 200 status code from the partition-set specific hub region.As "reads" can get a 403
Write Forbiddenstatus code, these "reads" can update partition-set level hub which future reads and writes can use.Pending item: extend this test to
QueryandChangeFeedoperations.Single-Writer accounts with no PPAF enabled
Pending item: The expected test setup is to execute a write region change on an account with a physical partition-set count in the order of ~2000 (typical in our DR drills) and to subject the account to a "read-only" workload and see how hub-region stickiness holds up.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines