Skip to content

Simplify provider & data set resolution logic & API #495

@rvagg

Description

@rvagg

Related:

Summary

The provider and data set resolution logic in StorageContext has grown to accommodate many potential use cases. This flexibility has come at the cost of maintainability, testability, and user comprehension. This proposal extracts resolution logic into a focused ProviderResolver interface with simpler implementations that cover the use cases we need to support, plus basic options to support golden-path top-level APIs.

Problem

The logic inside createContext() and createContexts() has become difficult to maintain, test, and document. We've introduced flexibility (and in fact had a lot of this from the begining) to solve for imagined user needs, but at the cost of significant code complexity.

Most users fall into one of three categories:

  1. "Just store my data": no opinions about providers or data sets
  2. "Use these specific providers": explicit provider selection
  3. "Use these specific data sets": Filecoin Pin Demo is an example of this

The current implementation tries to accommodate partial specifications, cascading fallbacks, and mixing of options in ways that are hard to reason about and harder to explain.

Current complexity

createContexts() implements a cascading three-tier resolution:

  1. If dataSetIds provided -> resolve each via resolveByDataSetId() (up to count)
  2. If still need more AND providerIds provided -> resolve remaining via resolveByProviderId() (filtering out already-resolved providers)
  3. If still need more -> fill remaining slots via smartSelectProvider() (excluding all previously resolved)

Each tier maintains its own exclusion tracking and conditionally hands off to the next.

Introducing the concept of endorsed SPs adds even more complexity to this, with two new tiers within smartSelectProvider().

resolveProviderAndDataSet() (for single context) has its own parallel logic tree:

  • Checks dataSetId -> resolveByDataSetId()
  • Else checks providerId -> resolveByProviderId()
  • Else checks providerAddress -> resolveByProviderAddress()
  • Else -> smartSelectProvider()

Additional complexity:

  • forceCreateDataSet / forceCreateDataSets flags alter behaviour at multiple levels
  • excludeProviderIds adds another dimension of filtering
  • Singular options (providerId, providerAddress, dataSetId) vs plural (providerIds, dataSetIds) have different code paths
  • Metadata matching, provider ping validation, and data set preference sorting are duplicated across methods
  • dataSetId = -1 sentinel value indicates "create new" vs existing

Further, moving to an API where we have separate interaction modes for a primary SP ("endorsed") and one or more secondary SPs (see multi-copy upload via SP-to-SP fetch), we'd need to introduce either another selection tier or strictly enforce ordering in the selection process.

Proposal

Trim down the options and focus on three use cases that represent how users actually interact with the SDK:

  1. User has no opinions: using upload() or createContexts() with no options. We figure out what to do based on what we find on-chain for their wallet.
  2. User has opinions about providers: supply provider IDs, count must match, we find or create data sets for those providers.
  3. User has opinions about data sets: supply data set IDs, count must match, we validate ownership and use those data sets.

In all cases, we identify an "endorsed" provider from the resolved set and return it first. If no endorsed provider is available (e.g., user specified non-endorsed providers), the first result is treated as "primary" for upload purposes.

Simplified options

Remove:

  • providerId (singular): use providerIds: [id]
  • providerAddress: can query registry by ID if needed
  • dataSetId (singular): use dataSetIds: [id]
  • excludeProviderIds: no longer needed with explicit selection model
  • dev, withIpni: not needed
  • forceCreateDataSet / forceCreateDataSets: can be achieved with a custom ProviderResolver (below)

Keep:

  • count: number of contexts (default: 2)
  • dataSetIds: explicit data set selection
  • providerIds: explicit provider selection
  • metadata: for data set matching and creation
  • withCDN: sugar for metadata

Validation rules:

  • dataSetIds and providerIds are mutually exclusive, error if both provided
  • If dataSetIds provided: length must equal count
  • If providerIds provided: length must equal count

Resolver interface

(Thanks to @hugomrdias for seeding this idea)

Extract resolution logic into a simple interface:

interface ProviderResolver {
  resolveNext(): Promise<ProviderSelectionResult | null>
}

Three focused implementations we will use internally:

Resolver Input Behaviour
SmartResolver nothing Query chain for existing data sets and approved providers, prefer endorsed, ping validate
ProviderIdsResolver provider IDs Validate providers exist and are approved, find matching data sets or mark for creation, order by endorsement
DataSetIdsResolver data set IDs Validate ownership/live/managed, get providers from data sets, order by endorsement

Factory function selects the appropriate implementation based on options provided.

What we keep

Useful logic that remains, potentially shared across resolver implementations:

  • Metadata matching for data set reuse
  • Provider ping validation for health checking
  • Data set preference ordering: with pieces > without pieces, older first
  • Endorsement detection and ordering

User-provided resolvers

The ProviderResolver interface is simple enough that advanced users could provide their own implementation if they have needs beyond the three standard cases (perhaps an external reputation service, doing per-country filtering yourself, or even just implementing forceCreateDataSets). This is an escape hatch for edge cases, not a primary API surface. We add a resolver option that overrides much of the default behaviour and lets you control it yourself.

Benefits

  • Each resolver is small, focused, and independently testable
  • No cascading tier logic or conditional handoffs between resolution strategies
  • Clear validation rules that are easy to document and explain
  • Mutual exclusivity enforced upfront rather than through complex interactions
  • Easier to add endorsement ordering without further complicating existing logic
  • Simpler mental model: users either specify what they want or let us figure it out

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    🐱 Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions