-
Notifications
You must be signed in to change notification settings - Fork 423
Description
The code for block cache is located here
The relevant code pulled from the above file
def _fetch(self, start: int | None, end: int | None) -> bytes:
....
# these are cached, so safe to do multiple calls for the same start and end.
for block_number in range(start_block_number, end_block_number + 1):
self._fetch_block_cached(block_number)
return self._read_cache(
start,
end,
start_block_number=start_block_number,
end_block_number=end_block_number,
)
Problem
-
BlockCache is an LRU-based cache with a default size of 32. Given the default block size of 5MB, the maximum cache size is 160MB at any given time.
-
In the code above, we first iterate through the required blocks to fetch them. If the requested data exceeds 160MB (more than 32 blocks), the earlier blocks fetched at the start of the loop are evicted to make room for the later blocks before the loop even finishes.
-
When _read_cache is called immediately after the loop, it again iterates from block 1 assuming data is already present. However, because the first few blocks were evicted during the fetch loop, _read_cache forces a new network request to retrieve them again. This causes a cascading eviction cycle where we end up requesting each data twice.
Example Scenario
- If we request 33 blocks (165MB of data):
- The loop fetches blocks 1 through 33, By the time it fetches block 33, the cache has evicted block 1 to maintain the limit of 32 items. The cache now holds [2, 3, ..., 33].
- We then enter self._read_cache. It attempts to read block 1 first. Since block 1 is missing, it triggers a new network call, Fetching block 1 evicts block 2. The cache now holds [1, 3, 4, ..., 33].
- _read_cache then tries to read block 2. Since it was just evicted, another network call is made, evicting block 3. The cache becomes [1, 2, 4, ..., 33].
This cycle continues for the entire read. Consequently, for any request where size > block_size * max_blocks (currently 160MB), we effectively download the data twice.
Reproduction Script:
import gcsfs
client = gcsfs.GCSFileSystem('seventhsky')
with client.open('rahman-bucket/20gb-file', cache_type='blockcache') as f:
print(f.cache)
data = f.read(165*1024*1024)
print(f.cache)
Output
(env) margubur@instance-lin:~/gcsfs$ python3 check-gcsfs-caching-bug.py
<BlockCache:
block size : 5242880
block count : 4096
file size : 21474836480
cache hits : 0
cache misses: 0
total requested bytes: 0>
<BlockCache:
block size : 5242880
block count : 4096
file size : 21474836480
cache hits : 1
cache misses: 68
total requested bytes: 356515840>