Skip to content

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 18, 2025

Rationale for this change

This is problematic if you try to implement your own FileIO. Then Streams are opened both through the FileIO and the FileSystem directly.

Are these changes tested?

Yes, existing tests.

Are there any user-facing changes?

No, but I think this makes the code esthetically also more pleasing by removing complexity.

Numbers

A while ago I did some inspection of the calls being made to S3, so just to be sure that we don't alter anything, I've collected some stats using a small "benchmark" locally:

def test_fokko(session_catalog: RestCatalog):
    parquet_file = "/Users/fokko.driesprong/Downloads/yellow_tripdata_2024-01.parquet"

    from pyarrow import parquet as pq

    df = pq.read_table(parquet_file)

    try:
        session_catalog.drop_table("default.taxi")
    except Exception:
        pass

    tbl = session_catalog.create_table("default.taxi", schema=df.schema)

    with tbl.update_spec() as tx:
        tx.add_field("tpep_pickup_datetime", "hour")

    tbl.append(df)

    rounds = []
    for _ in range(22):
        start = round(time.time() * 1000)
        assert len(tbl.scan().to_arrow()) == 2964624
        stop = round(time.time() * 1000)
        rounds.append(stop - start)

    print(f"Took: {sum(rounds) / len(rounds)} ms on average")

Main:
Took: 1715.1818181818182 ms on average

> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  697.9  701µs     153µs     1.6ms     463µs     838µs     ↑159B ↓712K  ↑108K ↓485M  0       
s3.HeadObject               73 (27.7%)  661.6  192µs     107µs     735µs     177µs     719µs     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  335.4  8.2ms     1.9ms     17.5ms    8.2ms     17.5ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  335.4  6.2ms     2.1ms     14.2ms    6.1ms     14.1ms    ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  335.4  18.4ms    5.1ms     38.8ms    18.4ms    38.8ms    ↑1.4M        ↑469M        0       
s3.PutObject                3 (1.1%)    27.2   5.4ms     3.4ms     8.8ms     5.3ms     8.8ms     ↑2.8K        ↑75K         0  

Branch:
Took: 1723.1818181818182 ms on average

> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  696.3  927µs     171µs     4.5ms     610µs     3.5ms     ↑159B ↓712K  ↑108K ↓484M  0       
s3.HeadObject               73 (27.7%)  660.1  222µs     109µs     1.2ms     205µs     1.2ms     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  334.6  4.4ms     1.2ms     14.2ms    4.4ms     14.2ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  334.6  4.3ms     1.2ms     15ms      4.3ms     15ms      ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  334.6  14.5ms    2.6ms     30.7ms    14.5ms    30.7ms    ↑1.4M        ↑468M        0       
s3.PutObject                3 (1.1%)    27.1   6.6ms     2.8ms     10.4ms    6.5ms     10.3ms    ↑2.8K        ↑75K         0  

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu kevinjqliu merged commit f71806e into apache:main Jun 18, 2025
10 checks passed
@Fokko Fokko deleted the fd-rely-on-io-rather-than-filesystem branch July 2, 2025 10:37
amitgilad3 pushed a commit to amitgilad3/iceberg-python that referenced this pull request Jul 7, 2025
# Rationale for this change

This is problematic if you try to implement your own `FileIO`. Then
Streams are opened both through the FileIO and the FileSystem directly.

# Are these changes tested?

Yes, existing tests.

# Are there any user-facing changes?

No, but I think this makes the code esthetically also more pleasing by
removing complexity.

<!-- In the case of user-facing changes, please add the changelog label.
-->

# Numbers

A while ago I did some inspection of the calls being made to S3, so just
to be sure that we don't alter anything, I've collected some stats using
a small "benchmark" locally:

```python
def test_fokko(session_catalog: RestCatalog):
    parquet_file = "/Users/fokko.driesprong/Downloads/yellow_tripdata_2024-01.parquet"

    from pyarrow import parquet as pq

    df = pq.read_table(parquet_file)

    try:
        session_catalog.drop_table("default.taxi")
    except Exception:
        pass

    tbl = session_catalog.create_table("default.taxi", schema=df.schema)

    with tbl.update_spec() as tx:
        tx.add_field("tpep_pickup_datetime", "hour")

    tbl.append(df)

    rounds = []
    for _ in range(22):
        start = round(time.time() * 1000)
        assert len(tbl.scan().to_arrow()) == 2964624
        stop = round(time.time() * 1000)
        rounds.append(stop - start)

    print(f"Took: {sum(rounds) / len(rounds)} ms on average")
```

Main:
Took: 1715.1818181818182 ms on average

```
> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  697.9  701µs     153µs     1.6ms     463µs     838µs     ↑159B ↓712K  ↑108K ↓485M  0       
s3.HeadObject               73 (27.7%)  661.6  192µs     107µs     735µs     177µs     719µs     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  335.4  8.2ms     1.9ms     17.5ms    8.2ms     17.5ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  335.4  6.2ms     2.1ms     14.2ms    6.1ms     14.1ms    ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  335.4  18.4ms    5.1ms     38.8ms    18.4ms    38.8ms    ↑1.4M        ↑469M        0       
s3.PutObject                3 (1.1%)    27.2   5.4ms     3.4ms     8.8ms     5.3ms     8.8ms     ↑2.8K        ↑75K         0  
```

Branch:
Took: 1723.1818181818182 ms on average

```
> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  696.3  927µs     171µs     4.5ms     610µs     3.5ms     ↑159B ↓712K  ↑108K ↓484M  0       
s3.HeadObject               73 (27.7%)  660.1  222µs     109µs     1.2ms     205µs     1.2ms     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  334.6  4.4ms     1.2ms     14.2ms    4.4ms     14.2ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  334.6  4.3ms     1.2ms     15ms      4.3ms     15ms      ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  334.6  14.5ms    2.6ms     30.7ms    14.5ms    30.7ms    ↑1.4M        ↑468M        0       
s3.PutObject                3 (1.1%)    27.1   6.6ms     2.8ms     10.4ms    6.5ms     10.3ms    ↑2.8K        ↑75K         0  
```
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
# Rationale for this change

This is problematic if you try to implement your own `FileIO`. Then
Streams are opened both through the FileIO and the FileSystem directly.

# Are these changes tested?

Yes, existing tests.

# Are there any user-facing changes?

No, but I think this makes the code esthetically also more pleasing by
removing complexity.

<!-- In the case of user-facing changes, please add the changelog label.
-->

# Numbers

A while ago I did some inspection of the calls being made to S3, so just
to be sure that we don't alter anything, I've collected some stats using
a small "benchmark" locally:

```python
def test_fokko(session_catalog: RestCatalog):
    parquet_file = "/Users/fokko.driesprong/Downloads/yellow_tripdata_2024-01.parquet"

    from pyarrow import parquet as pq

    df = pq.read_table(parquet_file)

    try:
        session_catalog.drop_table("default.taxi")
    except Exception:
        pass

    tbl = session_catalog.create_table("default.taxi", schema=df.schema)

    with tbl.update_spec() as tx:
        tx.add_field("tpep_pickup_datetime", "hour")

    tbl.append(df)

    rounds = []
    for _ in range(22):
        start = round(time.time() * 1000)
        assert len(tbl.scan().to_arrow()) == 2964624
        stop = round(time.time() * 1000)
        rounds.append(stop - start)

    print(f"Took: {sum(rounds) / len(rounds)} ms on average")
```

Main:
Took: 1715.1818181818182 ms on average

```
> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  697.9  701µs     153µs     1.6ms     463µs     838µs     ↑159B ↓712K  ↑108K ↓485M  0       
s3.HeadObject               73 (27.7%)  661.6  192µs     107µs     735µs     177µs     719µs     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  335.4  8.2ms     1.9ms     17.5ms    8.2ms     17.5ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  335.4  6.2ms     2.1ms     14.2ms    6.1ms     14.1ms    ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  335.4  18.4ms    5.1ms     38.8ms    18.4ms    38.8ms    ↑1.4M        ↑469M        0       
s3.PutObject                3 (1.1%)    27.2   5.4ms     3.4ms     8.8ms     5.3ms     8.8ms     ↑2.8K        ↑75K         0  
```

Branch:
Took: 1723.1818181818182 ms on average

```
> mc admin trace --stats minio

Call                        Count       RPM    Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min    Errors  
s3.GetObject                77 (29.2%)  696.3  927µs     171µs     4.5ms     610µs     3.5ms     ↑159B ↓712K  ↑108K ↓484M  0       
s3.HeadObject               73 (27.7%)  660.1  222µs     109µs     1.2ms     205µs     1.2ms     ↑153B        ↑99K         0       
s3.CompleteMultipartUpload  37 (14.0%)  334.6  4.4ms     1.2ms     14.2ms    4.4ms     14.2ms    ↑397B ↓507B  ↑130K ↓166K  0       
s3.NewMultipartUpload       37 (14.0%)  334.6  4.3ms     1.2ms     15ms      4.3ms     15ms      ↑153B ↓437B  ↑50K ↓143K   0       
s3.PutObjectPart            37 (14.0%)  334.6  14.5ms    2.6ms     30.7ms    14.5ms    30.7ms    ↑1.4M        ↑468M        0       
s3.PutObject                3 (1.1%)    27.1   6.6ms     2.8ms     10.4ms    6.5ms     10.3ms    ↑2.8K        ↑75K         0  
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants