-
Notifications
You must be signed in to change notification settings - Fork 412
infra: use spark base image for docker #2540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .config("spark.sql.shuffle.partitions", "1") | ||
| .config("spark.default.parallelism", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've set these to avoid creating multiple files with just one row. This way, when a shuffle is performed, it will be coalesced into a single file. This affects tests such as positional deletes, because when all the rows in a single file are marked for deletion, the whole file is dropped instead of creating merge-on-read deletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i remember these, i can add them to spark-defaults but the tests are passing now without them 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are passing, but we're not testing the positional deletes anymore since Spark will throw away the whole file, instead of creating the positional deletes:
The following test illustrates the problem:
diff --git a/tests/integration/test_reads.py b/tests/integration/test_reads.py
index 375eb35b2..ed6e805e3 100644
--- a/tests/integration/test_reads.py
+++ b/tests/integration/test_reads.py
@@ -432,6 +432,11 @@ def test_pyarrow_deletes(catalog: Catalog, format_version: int) -> None:
# (11, 'k'),
# (12, 'l')
test_positional_mor_deletes = catalog.load_table(f"default.test_positional_mor_deletes_v{format_version}")
+
+ if format_version == 2:
+ files = test_positional_mor_deletes.scan().plan_files()
+ assert all([len(file.delete_files) > 0 for file in files])
+
arrow_table = test_positional_mor_deletes.scan().to_arrow()
assert arrow_table["number"].to_pylist() == [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12]
This one passes on main but fails on this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like spark connect doesnt support these options
.config("spark.sql.shuffle.partitions", "1")
.config("spark.default.parallelism", "1")
And INSERT INTO writes 1 data file per row. In order to force a single data file, im using
.coalesce(1).writeTo(identifier).append()
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @kevinjqliu for adding the checks 👍
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
Follow up to #2540
Related to #1527
# Rationale for this change
Put all the files related to spark inside `dev/spark/`
## Are these changes tested?
## Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->
Rationale for this change
Closes #1527
This PR modifies the
dev/Dockerfilefile to use spark as the base image. This should be better than downloading spark from source. And likely faster on github runner.This PR also
Are these changes tested?
Are there any user-facing changes?