Skip to content

Conversation

@DomGarguilo
Copy link
Member

Fixes #5951

When a scan contained multiple duplicate keys and the scan broke or cut before returning all of those duplicates, the remainder of those duplicate keys were dropped and not returned when the scan resumed. This happened because when the scan cut, we did not store the MemKey which contains an additional field, kvCount which when used, distinguishes otherwise duplicate keys. When we resumed the scan without these MemKey details, the scanned picked back up after the logical key we left off on, which incorrectly skips the remaining duplicate keys in the scan. We now store the full MemKey and when the scan resumes on the same key, skips only the one entry we already returned so the other duplicate keys are correctly returned now.

The new test case fails without the rest of the changes and passes with them.

@DomGarguilo DomGarguilo added this to the 2.1.5 milestone Oct 10, 2025
@DomGarguilo DomGarguilo self-assigned this Oct 10, 2025
@DomGarguilo DomGarguilo added blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug. labels Oct 10, 2025
@DomGarguilo DomGarguilo linked an issue Oct 20, 2025 that may be closed by this pull request
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be two scenarios where this will not work as expected.

First when reading duplicate keys from multiple files they may not always come back in the same order. So if 100 duplicate keys are spread across 3 files and 50 are read then when coming back and skipping 50 those may not be the same 50 values. The keys from multiple files are placed in a priority queue and it will have no guaranteed order for items with the same priority. If files are added or removed between reading scan batches this would probably greatly increase the chance of duplicate keys having a different scan order.

Second this change stores state on the server side. If a scan reads 50 of 100 duplicate keys from tserver A and then tserver A fails, then the scan will skip the other 50 when it resumes on another tservers.

One possible solution to this is to change server side scan batching such that it never ends a batch on a duplicate key or the scan fails if it would do that. So maybe we let the batch grow up to 2x or 3x the batch size (this could be configurable via a new setting) when duplicate keys are present and if it still fills up then the scan fails with an error. This way the scan either gets all the duplicate keys or it fails instead of getting incorrect data sometimes for this case.

Another possible solution is to sort data on the value when everything in the keys is equals. Then scans could continue based on key+value instead of just key. However persisted data is not currently sorted this way. This would be a really big change to code, data, and user creation of rfiles. It offers the advantage of avoiding having to buffer all duplicate keys on the server side..

@keith-turner
Copy link
Contributor

Another thing to consider is that not all keys on the server side are MemKeys. Persisted data are not MemKeys. Only recently written data stored in inmemory maps are MemKeys. The purpose of the count on MemKeys is to avoid reading data written after a scan started.

@DomGarguilo DomGarguilo changed the title Preserve duplicate keys for scan resumptions WIP - Preserve duplicate keys for scan resumptions Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker This issue blocks any release version labeled on it. bug This issue has been verified to be a bug.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scanning a table without the versioning iterator can drop keys

2 participants