Skip to content

Massive TIME_WAIT socket exhaustion during metadata (manifest/avro) reads with S3FileIO + Apache HTTP client #14951

@Sbaia

Description

@Sbaia

Apache Iceberg version

1.10.0

Query engine

Spark

Please describe the bug 🐞

We are seeing severe outbound socket exhaustion (TIME_WAIT) when running Iceberg maintenance operations (specifically CALL system.rewrite_data_files) on large tables stored on S3.

This happens even with Apache HTTP client + connection pooling enabled and after removing any Hadoop/S3A usage.
The issue seems to correlate strongly with metadata / manifest (.avro) downloads, not with large data file reads.


Environment

  • Iceberg version: 1.10.0
  • Spark version: 4.0.1
  • Spark on Kubernetes (Spark Operator / SparkApplication CRD)
  • Storage: Amazon S3
  • FileIO: org.apache.iceberg.aws.s3.S3FileIO
  • HTTP client: Apache HTTP client (Iceberg shaded)
  • AWS SDK: via iceberg-aws-bundle
  • No s3a://, no hadoop-aws in use
  • REST Catalog (Lakekeeper), but REST traffic is minimal; sockets are clearly to S3

Observed behavior

During rewrite_data_files on a table with ~26k data files:

  • Outbound connections to S3 explode to 40k–45k sockets in TIME_WAIT

  • Remote IPs are public S3 endpoints (3.x, 52.x)

  • Happens primarily while reading metadata files:

    • metadata/*.json
    • manifest-list.avro
    • snap-*.avro
  • Kernel ephemeral ports get exhausted, causing job instability

And socket inspection from inside the executor pod:

~43k TIME_WAIT sockets
Top destinations:
- 3.5.x.x
- 52.218.x.x

Relevant configuration

spark.sql.catalog.lakehouse.io-impl=org.apache.iceberg.aws.s3.S3FileIO

spark.sql.catalog.lakehouse.http-client.type=apache
spark.sql.catalog.lakehouse.http-client.apache.max-connections=200
spark.sql.catalog.lakehouse.http-client.apache.connection-max-idle-time-ms=300000
spark.sql.catalog.lakehouse.http-client.apache.connection-time-to-live-ms=3600000

spark.sql.iceberg.planning.max-threads=4   # reducing to 1 helps but does not eliminate

Why this looks like an Iceberg-level issue

  • The connection explosion correlates with manifest/metadata access, not data file I/O
  • Planning and rewrite phases appear to trigger bursty, highly parallel small-object GETs
  • Even with pooling, connections are frequently closed and recreated

This suggests:

  • Metadata access patterns may be too aggressively parallel
  • Manifest downloads may bypass or defeat effective connection reuse
  • Planning threads / metadata splits may cause connection churn beyond what pooling can absorb

Questions / possible directions

  • Is metadata/manifest I/O intentionally parallelized at this level?
  • Are there known issues with connection reuse during manifest reads?
  • Should planning.max-threads or metadata split behavior be auto-throttled?
  • Are there additional cache knobs or client reuse guarantees for metadata reads?
  • Has similar behavior been observed or addressed in newer versions?

We’re happy to provide:

  • Additional logs (with request paths)
  • Repro steps
  • Packet/socket stats
  • A minimal test case if needed

Thanks — this one is pretty brutal in production environments with strict networking limits.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions