CI: Limit parallelism #1764

Fokko · 2025-03-04T22:19:10Z

For the tests, we want to limit parallelism to avoid creating 1-row Parquet files.

kevinjqliu

LGTM

We previously added this config in dev/provision.py

Lines 25 to 35 in 9945f83

    
           # The configuration is important, otherwise we get many small 
        
           # parquet files with a single row. When a positional delete 
        
           # hits the Parquet file with one row, the parquet file gets 
        
           # dropped instead of having a merge-on-read delete file. 
        
           spark = ( 
        
               SparkSession 
        
                   .builder 
        
                   .config("spark.sql.shuffle.partitions", "1") 
        
                   .config("spark.default.parallelism", "1") 
        
                   .getOrCreate() 
        
           )

and looks like these are the only two places where we create SparkSession
https://grep.app/search?f.repo.pattern=iceberg-python&q=getOrCreate%28%29

For the tests, we want to limit parallelism to avoid creating 1-row Parquet files.

CI: Limit parallelism

7c2f5e8

For the tests, we want to limit parallelism to avoid creating 1-row Parquet files.

kevinjqliu approved these changes Mar 4, 2025

View reviewed changes

kevinjqliu merged commit e3a5c3b into main Mar 4, 2025
7 checks passed

kevinjqliu deleted the fd-config branch March 4, 2025 22:48

gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025

CI: Limit parallelism (apache#1764)

44363e9

For the tests, we want to limit parallelism to avoid creating 1-row Parquet files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: Limit parallelism #1764

CI: Limit parallelism #1764

Uh oh!

Fokko commented Mar 4, 2025

Uh oh!

kevinjqliu left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# The configuration is important, otherwise we get many small
	# parquet files with a single row. When a positional delete
	# hits the Parquet file with one row, the parquet file gets
	# dropped instead of having a merge-on-read delete file.
	spark = (
	SparkSession
	.builder
	.config("spark.sql.shuffle.partitions", "1")
	.config("spark.default.parallelism", "1")
	.getOrCreate()
	)

CI: Limit parallelism #1764

CI: Limit parallelism #1764

Uh oh!

Conversation

Fokko commented Mar 4, 2025

Uh oh!

kevinjqliu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinjqliu left a comment •

edited

Loading