Skip to content

Conversation

@kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Sep 28, 2025

Rationale for this change

Closes #1527

This PR modifies the dev/Dockerfile file to use spark as the base image. This should be better than downloading spark from source. And likely faster on github runner.

This PR also

  • modifies provision.py uses spark connect
  • add healthcheck to spark docker container

Are these changes tested?

Are there any user-facing changes?

@kevinjqliu kevinjqliu marked this pull request as ready for review September 28, 2025 07:29
@kevinjqliu kevinjqliu requested a review from Fokko September 28, 2025 07:29
Comment on lines -33 to -34
.config("spark.sql.shuffle.partitions", "1")
.config("spark.default.parallelism", "1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've set these to avoid creating multiple files with just one row. This way, when a shuffle is performed, it will be coalesced into a single file. This affects tests such as positional deletes, because when all the rows in a single file are marked for deletion, the whole file is dropped instead of creating merge-on-read deletes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i remember these, i can add them to spark-defaults but the tests are passing now without them 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are passing, but we're not testing the positional deletes anymore since Spark will throw away the whole file, instead of creating the positional deletes:

The following test illustrates the problem:

diff --git a/tests/integration/test_reads.py b/tests/integration/test_reads.py
index 375eb35b2..ed6e805e3 100644
--- a/tests/integration/test_reads.py
+++ b/tests/integration/test_reads.py
@@ -432,6 +432,11 @@ def test_pyarrow_deletes(catalog: Catalog, format_version: int) -> None:
     #  (11, 'k'),
     #  (12, 'l')
     test_positional_mor_deletes = catalog.load_table(f"default.test_positional_mor_deletes_v{format_version}")
+
+    if format_version == 2:
+        files = test_positional_mor_deletes.scan().plan_files()
+        assert all([len(file.delete_files) > 0 for file in files])
+
     arrow_table = test_positional_mor_deletes.scan().to_arrow()
     assert arrow_table["number"].to_pylist() == [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12]

This one passes on main but fails on this branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like spark connect doesnt support these options

        .config("spark.sql.shuffle.partitions", "1")
        .config("spark.default.parallelism", "1")

And INSERT INTO writes 1 data file per row. In order to force a single data file, im using

.coalesce(1).writeTo(identifier).append()

@kevinjqliu kevinjqliu requested a review from Fokko October 1, 2025 17:12
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @kevinjqliu for adding the checks 👍

@Fokko Fokko merged commit 5ee5eea into apache:main Oct 2, 2025
10 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/use-spark-docker branch October 3, 2025 03:18
Fokko pushed a commit that referenced this pull request Oct 3, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
Follow up to #2540
Related to #1527

# Rationale for this change
Put all the files related to spark inside `dev/spark/`

## Are these changes tested?

## Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve dev/Dockerfile

2 participants