-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Apache Iceberg version
1.10.0
Query engine
Spark
Please describe the bug 🐞
We are seeing severe outbound socket exhaustion (TIME_WAIT) when running Iceberg maintenance operations (specifically CALL system.rewrite_data_files) on large tables stored on S3.
This happens even with Apache HTTP client + connection pooling enabled and after removing any Hadoop/S3A usage.
The issue seems to correlate strongly with metadata / manifest (.avro) downloads, not with large data file reads.
Environment
- Iceberg version: 1.10.0
- Spark version: 4.0.1
- Spark on Kubernetes (Spark Operator / SparkApplication CRD)
- Storage: Amazon S3
- FileIO:
org.apache.iceberg.aws.s3.S3FileIO - HTTP client: Apache HTTP client (Iceberg shaded)
- AWS SDK: via
iceberg-aws-bundle - No
s3a://, nohadoop-awsin use - REST Catalog (Lakekeeper), but REST traffic is minimal; sockets are clearly to S3
Observed behavior
During rewrite_data_files on a table with ~26k data files:
-
Outbound connections to S3 explode to 40k–45k sockets in TIME_WAIT
-
Remote IPs are public S3 endpoints (
3.x,52.x) -
Happens primarily while reading metadata files:
metadata/*.jsonmanifest-list.avrosnap-*.avro
-
Kernel ephemeral ports get exhausted, causing job instability
And socket inspection from inside the executor pod:
~43k TIME_WAIT sockets
Top destinations:
- 3.5.x.x
- 52.218.x.x
Relevant configuration
spark.sql.catalog.lakehouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO
spark.sql.catalog.lakehouse.http-client.type=apache
spark.sql.catalog.lakehouse.http-client.apache.max-connections=200
spark.sql.catalog.lakehouse.http-client.apache.connection-max-idle-time-ms=300000
spark.sql.catalog.lakehouse.http-client.apache.connection-time-to-live-ms=3600000
spark.sql.iceberg.planning.max-threads=4 # reducing to 1 helps but does not eliminateWhy this looks like an Iceberg-level issue
- The connection explosion correlates with manifest/metadata access, not data file I/O
- Planning and rewrite phases appear to trigger bursty, highly parallel small-object GETs
- Even with pooling, connections are frequently closed and recreated
This suggests:
- Metadata access patterns may be too aggressively parallel
- Manifest downloads may bypass or defeat effective connection reuse
- Planning threads / metadata splits may cause connection churn beyond what pooling can absorb
Questions / possible directions
- Is metadata/manifest I/O intentionally parallelized at this level?
- Are there known issues with connection reuse during manifest reads?
- Should
planning.max-threadsor metadata split behavior be auto-throttled? - Are there additional cache knobs or client reuse guarantees for metadata reads?
- Has similar behavior been observed or addressed in newer versions?
We’re happy to provide:
- Additional logs (with request paths)
- Repro steps
- Packet/socket stats
- A minimal test case if needed
Thanks — this one is pretty brutal in production environments with strict networking limits.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time