-
Notifications
You must be signed in to change notification settings - Fork 415
Upsert: Don't produce empty snapshots #1810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yikes! This makes sure to only produce a snapshot when there is anything to update or add.
sungwy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| tx.overwrite(rows_to_update, overwrite_filter=overwrite_mask_predicate) | ||
| tx.overwrite(rows_to_update, overwrite_filter=overwrite_mask_predicate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we enforce not producing empty snapshots lower in the stack, in the write functions themselves (overwrite/append)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a great suggestion. I was also looking into that, but there are two caveats:
- Before computing all the manifests, we don't know if they are empty. The delete manifests are lazily computed, so making sure that we short-circuit earlier in the process, makes sure that we don't do all the work.
- There is also a behavioral aspect around it, for example, when you do
tbl.append(df)wheredfis empty, ortbl.delete('x = 10'), you may still want to produce a snapshot to indicate that something happened. My opinion is that it should result in a no-op, but Java creates a snapshot:
I believe @sungwy and I had a similar discussion.
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Yikes! This makes sure to only produce a snapshot when there is anything
to update or append.
# Are these changes tested?
Yes, by checking the snapshots that are being produced.
# Are there any user-facing changes?
Smaller metadata and faster commits when there is nothing to
append/update :)
<!-- In the case of user-facing changes, please add the changelog label.
-->
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Yikes! This makes sure to only produce a snapshot when there is anything
to update or append.
# Are these changes tested?
Yes, by checking the snapshots that are being produced.
# Are there any user-facing changes?
Smaller metadata and faster commits when there is nothing to
append/update :)
<!-- In the case of user-facing changes, please add the changelog label.
-->

Rationale for this change
Yikes! This makes sure to only produce a snapshot when there is anything to update or append.
Are these changes tested?
Yes, by checking the snapshots that are being produced.
Are there any user-facing changes?
Smaller metadata and faster commits when there is nothing to append/update :)