Skip to content

DRA: Reserved Capacity #5719

@johnbelamaric

Description

@johnbelamaric

Enhancement Description

Accelerators fail. Especially for multi-node devices, it can be useful to "hold back" a certain number of accelerators so that a failure can be "repaired" (by substituting in one of the hold back devices). We should consider if this warrants any special handling in DRA. For example, we could model reserve resources so they require a special device toleration. This could be used to allow them to be scheduled with low-priority or preemptible workloads unless they are needed to repair a multi-node device.

  • One-line enhancement description (can be used as a release note): DRA now allows vendors to model "reserved" capacity, which can be used to manage failures in collections of accelerators or other devices.
  • Kubernetes Enhancement Proposal: TBD
  • Discussion Link:
  • PRs by stage and milestone:
    • Alpha - v1.xx
      • KEP (k/enhancements) update PR(s):
      • Code (k/k) update PR(s):
      • Docs (k/website) update PR(s):

/wg device-management
/sig scheduling
/sig node
/cc @pohly @klueska @mortent

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    sig/nodeCategorizes an issue or PR as relevant to SIG Node.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

    Type

    No type

    Projects

    Status

    🆕 New

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions