Block Cache requests the data twice for reads greater than 160MB

The code for block cache is located [here](https://github.com/fsspec/filesystem_spec/blob/master/fsspec/caching.py#L331)

The relevant code pulled from the above file

```
    def _fetch(self, start: int | None, end: int | None) -> bytes:
        ....

        # these are cached, so safe to do multiple calls for the same start and end.
        for block_number in range(start_block_number, end_block_number + 1):
            self._fetch_block_cached(block_number)

        return self._read_cache(
            start,
            end,
            start_block_number=start_block_number,
            end_block_number=end_block_number,
        )
```

Problem
- BlockCache is an LRU-based cache with a default size of 32. Given the default block size of 5MB, the maximum cache size is 160MB at any given time.

- In the code above, we first iterate through the required blocks to fetch them. If the requested data exceeds 160MB (more than 32 blocks), the earlier blocks fetched at the start of the loop are evicted to make room for the later blocks before the loop even finishes.

- When _read_cache is called immediately after the loop, it again iterates from block 1 assuming data is already present. However, because the first few blocks were evicted during the fetch loop, _read_cache forces a new network request to retrieve them again. This causes a cascading eviction cycle where we end up requesting each data twice.

Example Scenario
- If we request 33 blocks (165MB of data):
- The loop fetches blocks 1 through 33, By the time it fetches block 33, the cache has evicted block 1 to maintain the limit of 32 items. The cache now holds [2, 3, ..., 33].
- We then enter self._read_cache. It attempts to read block 1 first. Since block 1 is missing, it triggers a new network call, Fetching block 1 evicts block 2. The cache now holds [1, 3, 4, ..., 33].
- _read_cache then tries to read block 2. Since it was just evicted, another network call is made, evicting block 3. The cache becomes [1, 2, 4, ..., 33].

This cycle continues for the entire read. Consequently, for any request where size > block_size * max_blocks (currently 160MB), we effectively download the data twice.

Reproduction Script:

```
import gcsfs

client = gcsfs.GCSFileSystem('seventhsky')

with client.open('rahman-bucket/20gb-file', cache_type='blockcache') as f:
    print(f.cache)
    data = f.read(165*1024*1024)
    print(f.cache)
```

Output
```
(env) margubur@instance-lin:~/gcsfs$ python3 check-gcsfs-caching-bug.py

        <BlockCache:
            block size  :   5242880
            block count :   4096
            file size   :   21474836480
            cache hits  :   0
            cache misses:   0
            total requested bytes: 0>
        

        <BlockCache:
            block size  :   5242880
            block count :   4096
            file size   :   21474836480
            cache hits  :   1
            cache misses:   68
            total requested bytes: 356515840>
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block Cache requests the data twice for reads greater than 160MB #1960

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Block Cache requests the data twice for reads greater than 160MB #1960

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions