-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
sig/nodeCategorizes an issue or PR as relevant to SIG Node.Categorizes an issue or PR as relevant to SIG Node.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.Categorizes an issue or PR as relevant to SIG Scheduling.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.Categorizes an issue or PR as relevant to WG Device Management.
Description
Enhancement Description
Accelerators fail. Especially for multi-node devices, it can be useful to "hold back" a certain number of accelerators so that a failure can be "repaired" (by substituting in one of the hold back devices). We should consider if this warrants any special handling in DRA. For example, we could model reserve resources so they require a special device toleration. This could be used to allow them to be scheduled with low-priority or preemptible workloads unless they are needed to repair a multi-node device.
- One-line enhancement description (can be used as a release note): DRA now allows vendors to model "reserved" capacity, which can be used to manage failures in collections of accelerators or other devices.
- Kubernetes Enhancement Proposal: TBD
- Discussion Link:
- PRs by stage and milestone:
- Alpha - v1.xx
- KEP (
k/enhancements) update PR(s): - Code (
k/k) update PR(s): - Docs (
k/website) update PR(s):
- KEP (
- Alpha - v1.xx
/wg device-management
/sig scheduling
/sig node
/cc @pohly @klueska @mortent
Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
Metadata
Metadata
Assignees
Labels
sig/nodeCategorizes an issue or PR as relevant to SIG Node.Categorizes an issue or PR as relevant to SIG Node.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.Categorizes an issue or PR as relevant to SIG Scheduling.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.Categorizes an issue or PR as relevant to WG Device Management.
Type
Projects
Status
🆕 New
Status
Needs Triage