Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jan 1, 2026

Which issue does this PR close?

Part of #19433

Rationale for this change

In preparation for ordering inference from Parquet metadata, the cache system needs refactoring to:

  1. Support storing ordering information alongside statistics
  2. Simplify the CacheAccessor trait by removing the Extra associated type and *_with_extra methods
  3. Move validation logic into typed wrapper structs with explicit is_valid_for() methods

What changes are included in this PR?

Simplify CacheAccessor trait

Before:

pub trait CacheAccessor<K, V>: Send + Sync {
    type Extra: Clone;
    fn get(&self, k: &K) -> Option<V>;
    fn get_with_extra(&self, k: &K, e: &Self::Extra) -> Option<V>;
    fn put(&self, key: &K, value: V) -> Option<V>;
    fn put_with_extra(&self, key: &K, value: V, e: &Self::Extra) -> Option<V>;
    // ... other methods
}

After:

pub trait CacheAccessor<K, V>: Send + Sync {
    fn get(&self, key: &K) -> Option<V>;
    fn put(&self, key: &K, value: V) -> Option<V>;
    // ... other methods (no Extra type, no *_with_extra methods)
}

Introduce typed wrapper structs for cached values

Instead of passing validation metadata separately via Extra, embed it in the cached value type:

  • CachedFileMetadata - contains meta: ObjectMeta, statistics: Arc<Statistics>, ordering: Option<LexOrdering>
  • CachedFileList - contains files: Arc<Vec<ObjectMeta>> with filter_by_prefix() helper
  • CachedFileMetadataEntry - contains meta: ObjectMeta, file_metadata: Arc<dyn FileMetadata>

Each wrapper has an is_valid_for(&ObjectMeta) method that checks if the cached entry is still valid (size and last_modified match).

New validation pattern

The typical usage pattern changes from:

// Before: validation hidden in get_with_extra
if let Some(stats) = cache.get_with_extra(&path, &object_meta) {
    // use stats
}

To:

// After: explicit validation
if let Some(cached) = cache.get(&path) {
    if cached.is_valid_for(&object_meta) {
        // use cached.statistics
    }
}

Add ordering support

  • CachedFileMetadata has new ordering: Option<LexOrdering> field
  • FileStatisticsCacheEntry has new has_ordering: bool field for introspection

Are these changes tested?

Yes, existing cache tests pass plus new tests for ordering support.

Are there any user-facing changes?

Breaking change to cache traits. Users with custom cache implementations will need to:

  1. Update CacheAccessor impl to remove Extra type and *_with_extra methods
  2. Update cached value types to the new wrappers (CachedFileMetadata, etc.)
  3. Update callsites to use the new validation pattern with is_valid_for()

🤖 Generated with Claude Code

@github-actions github-actions bot added catalog Related to the catalog crate execution Related to the execution crate datasource Changes to the datasource crate labels Jan 1, 2026
@adriangb adriangb force-pushed the cache-ordering-refactor branch 2 times, most recently from 3697e42 to 12e3efe Compare January 1, 2026 21:39
@github-actions github-actions bot added the core Core DataFusion crate label Jan 1, 2026
@adriangb adriangb force-pushed the cache-ordering-refactor branch from 12e3efe to aa3f29c Compare January 1, 2026 22:12
@github-actions github-actions bot removed the core Core DataFusion crate label Jan 1, 2026
@adriangb adriangb requested a review from Copilot January 1, 2026 22:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the cache API trait hierarchy to prepare for ordering inference from Parquet metadata. The refactoring eliminates the awkward Extra generic parameter from CacheAccessor, introduces wrapper types for cached data (CachedFileMetadata, CachedFileList, CachedFileMetadataEntry), and establishes a cleaner trait hierarchy where specific cache traits extend the base CacheAccessor trait.

Key changes:

  • Removed Extra associated type from CacheAccessor, simplifying the trait with unified get/put methods
  • Introduced CachedFileMetadata struct to store statistics and ordering information together
  • Changed cache key types from ObjectMeta to Path for consistency, with validation now handled by wrapper structs

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
datafusion/execution/src/cache/mod.rs Refactored CacheAccessor trait to remove Extra type and get_with_extra/put_with_extra methods; improved documentation
datafusion/execution/src/cache/cache_manager.rs Added CachedFileMetadata, CachedFileList, and CachedFileMetadataEntry wrapper types with validation methods; updated trait definitions to extend CacheAccessor
datafusion/execution/src/cache/cache_unit.rs Updated DefaultFileStatisticsCache to use new CachedFileMetadata type and implement split trait hierarchy; added ordering support tests
datafusion/execution/src/cache/file_metadata_cache.rs Changed cache key from ObjectMeta to Path; validation moved to CachedFileMetadataEntry::is_valid_for; updated all tests
datafusion/execution/src/cache/list_files_cache.rs Removed inline prefix filtering from cache internals; moved to CachedFileList::filter_by_prefix helper method; simplified cache API
datafusion/datasource/src/url.rs Updated list_with_cache to use CachedFileList and apply prefix filtering after cache retrieval
datafusion/datasource-parquet/src/metadata.rs Updated to use CachedFileMetadataEntry with Path keys and explicit validation checks
datafusion/catalog-listing/src/table.rs Updated do_collect_statistics to use new cache API with CachedFileMetadata wrapper
datafusion/execution/Cargo.toml Added datafusion-physical-expr-common dependency for LexOrdering support
Comments suppressed due to low confidence (1)

datafusion/datasource/src/url.rs:402

  • The prefix filtering logic is duplicated in lines 370-382 and lines 393-402. Consider using the CachedFileList::filter_by_prefix helper method instead. Replace the inline filtering with cached.filter_by_prefix(&Some(full_prefix)) and CachedFileList::new(vec.clone()).filter_by_prefix(&Some(full_prefix)) respectively.
                if prefix.is_some() {
                    let full_prefix_str = full_prefix.as_ref();
                    cached
                        .files
                        .iter()
                        .filter(|meta| {
                            meta.location.as_ref().starts_with(full_prefix_str)
                        })
                        .cloned()
                        .collect()
                } else {
                    cached.files.as_ref().clone()
                }
            } else {
                // Cache miss - always list and cache the full table
                // This ensures we have complete data for future prefix queries
                let vec = store
                    .list(Some(table_base_path))
                    .try_collect::<Vec<ObjectMeta>>()
                    .await?;
                cache.put(table_base_path, CachedFileList::new(vec.clone()));

                // If a prefix filter was requested, apply it to the results
                if prefix.is_some() {
                    let full_prefix_str = full_prefix.as_ref();
                    vec.into_iter()
                        .filter(|meta| {
                            meta.location.as_ref().starts_with(full_prefix_str)
                        })
                        .collect()
                } else {
                    vec
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@adriangb
Copy link
Contributor Author

adriangb commented Jan 2, 2026

@mkleen since you're working on these caches curious what you think of this change

@adriangb
Copy link
Contributor Author

adriangb commented Jan 7, 2026

@zhuqi-lucas are you able to review this?

Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @adriangb , minor comments left.

}

/// Filter the files by prefix.
pub fn filter_by_prefix(&self, prefix: &Option<Path>) -> Vec<ObjectMeta> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible we don't need to clone here, something like:

pub fn files_matching_prefix(&self, prefix: &Option<Path>) -> Arc<Vec<ObjectMeta>> {
    match prefix {
        None => Arc::clone(&self.files),
        Some(p) => Arc::new(self.filter_by_prefix(&Some(p.clone())))
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

/// This struct embeds the [`ObjectMeta`] used for cache validation,
/// along with the cached statistics and ordering information.
#[derive(Debug, Clone)]
pub struct CachedFileMetadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question, do we need to add memory_size function for this struct?

impl CachedFileMetadata {
    pub fn memory_size(&self) -> usize {
        size_of::<Self>() 
            + self.statistics.memory_size()
            + self.ordering.as_ref().map_or(0, |o| o.memory_size())
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Statistics has memory_size() yet - see #19599

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@adriangb adriangb force-pushed the cache-ordering-refactor branch 3 times, most recently from 94ccafa to a6c6682 Compare January 8, 2026 15:16
@adriangb
Copy link
Contributor Author

adriangb commented Jan 8, 2026

I've sadly had to resolve major conflicts multiple times on this branch 😢

@jizezhang I'm now resolving conflicts with #19616. Could you please take a look at this change and the latest commit resolving conflicts especially so I don't have to do another round of conflict resolution? Thanks!

@adriangb adriangb force-pushed the cache-ordering-refactor branch from a6c6682 to 0bd093b Compare January 8, 2026 15:41
@jizezhang
Copy link
Contributor

I've sadly had to resolve major conflicts multiple times on this branch 😢

@jizezhang I'm now resolving conflicts with #19616. Could you please take a look at this change and the latest commit resolving conflicts especially so I don't have to do another round of conflict resolution? Thanks!

I checked that TableScopedPath is unchanged and used for putting and getting entries to the cache. It looks good to me and thank you.

@adriangb adriangb force-pushed the cache-ordering-refactor branch from 71eb8d3 to 39e9650 Compare January 8, 2026 17:26
@adriangb
Copy link
Contributor Author

adriangb commented Jan 8, 2026

I plan to merge this once CI passes

@adriangb adriangb added this pull request to the merge queue Jan 8, 2026
Merged via the queue into apache:main with commit 0cf45ca Jan 8, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate datasource Changes to the datasource crate execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants