Skip to content

Commit 7bf7669

Browse files
committed
Add admission controller feature
1 parent ec88832 commit 7bf7669

File tree

3 files changed

+905
-201
lines changed

3 files changed

+905
-201
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Kubernetes Admission Controller
2+
3+
## Overview
4+
5+
The K8s Admission Controller prevents overwhelming the Kubernetes scheduler when submitting large numbers of workflow tasks as Pods.
6+
7+
### Problem
8+
9+
When thousands of tasks are ready to execute, creating thousands of Pods immediately can overwhelm the Kubernetes scheduler, resulting in many Pods stuck in Pending state and API throttling (HTTP 429 errors).
10+
11+
### Solution
12+
13+
The admission controller implements a local gate using:
14+
15+
1. **Watch-based state tracking** - Monitors Pod state via long-lived watch (no polling)
16+
2. **Token bucket rate limiter** - Smoothly paces Pod creation requests
17+
3. **Concurrency window** - Caps the number of pending Pods
18+
4. **Adaptive tuning** - Automatically adjusts limits based on observed cluster throughput
19+
20+
## How It Works
21+
22+
### Two-Gate Model
23+
24+
Every task submission must pass through two gates:
25+
26+
**Gate 1: Concurrency Window**
27+
```
28+
IF pendingCount >= pendingMax THEN wait
29+
```
30+
- Prevents too many Pods stuck in Pending state
31+
- `pendingCount` tracked in real-time via Kubernetes Watch API
32+
- `pendingMax` is the maximum allowed pending Pods
33+
34+
**Gate 2: Token Bucket**
35+
```
36+
IF tokens < 1 THEN wait
37+
tokens = tokens - 1
38+
```
39+
- Smooths out bursts and paces submissions over time
40+
- Tokens refill continuously at rate `fillRate` (tokens/second)
41+
- Maximum token accumulation capped at `burst`
42+
43+
### Token Bucket Refill
44+
45+
```
46+
elapsed = currentTime - lastRefillTime
47+
newTokens = elapsed * fillRate
48+
tokens = min(burst, tokens + newTokens)
49+
```
50+
51+
**Parameters:**
52+
- `fillRate` - Tokens added per second (controls submission rate)
53+
- `burst` - Maximum token accumulation (handles bursts)
54+
- `tokens` - Current available tokens
55+
56+
### Adaptive Tuning
57+
58+
Every second, the controller measures cluster performance and adjusts parameters:
59+
60+
**Throughput Measurement:**
61+
```
62+
currentRate = Pending→Running transitions / elapsed time (pods/sec)
63+
runningRateEWMA = α × currentRate + (1 - α) × runningRateEWMA
64+
```
65+
66+
**Fill Rate Adaptation:**
67+
```
68+
targetRate = max(1.0, 1.2 × runningRateEWMA)
69+
errorPenalty = max(0.5, 1 - 2 × createErrorEWMA)
70+
fillRate = max(1.0, min(configuredRate × 2, targetRate × errorPenalty))
71+
```
72+
73+
**Pending Max Adaptation (with Hysteresis):**
74+
```
75+
minuteBuffer = round(60 × max(0.5, runningRateEWMA))
76+
targetPendingMax = clamp(minuteBuffer, minPendingMax, maxPendingMax)
77+
78+
IF targetPendingMax > pendingMax THEN
79+
pendingMax = targetPendingMax // Increase immediately (aggressive)
80+
ELSE IF targetPendingMax < pendingMax AND timeSinceLastDecrease > 2 minutes THEN
81+
pendingMax = targetPendingMax // Decrease only if sustained (conservative)
82+
lastPendingMaxDecrease = now
83+
ELSE
84+
// Keep current pendingMax (ignore temporary dip)
85+
END
86+
```
87+
88+
**Parameters:**
89+
- `α` (alpha) = 0.2 - EWMA smoothing factor (higher = more reactive)
90+
- `runningRateEWMA` - Exponentially weighted moving average of scheduling rate
91+
- `createErrorEWMA` - EWMA of API error rate (0-1)
92+
- `minPendingMax` = 50 - Minimum allowed pending limit
93+
- `maxPendingMax` = 2000 - Maximum allowed pending limit
94+
- `hysteresisWindow` = 2 minutes - Minimum time before decreasing pendingMax
95+
96+
**Adaptive behavior:**
97+
- **High throughput observed** → increase `fillRate` to submit more
98+
- **API errors detected** → reduce `fillRate` via error penalty
99+
- **Fast scheduling** → increase `pendingMax` immediately (aggressive)
100+
- **Temporary slowdown** → keep `pendingMax` high (prevent under-utilization)
101+
- **Sustained slowdown (>2 min)** → decrease `pendingMax` conservatively
102+
103+
## Configuration
104+
105+
### Environment Variables
106+
107+
| Variable | Default | Description |
108+
|----------|---------|-------------|
109+
| `HF_VAR_ADMISSION_CONTROLLER` | `1` | Set to `0` to disable |
110+
| `HF_VAR_ADMISSION_PENDING_MAX` | `200` | Max pending Pods |
111+
| `HF_VAR_ADMISSION_FILL_RATE` | `1` | Token fill rate (tokens/sec) |
112+
| `HF_VAR_ADMISSION_BURST` | `20` | Max token bucket size |
113+
| `HF_VAR_ADMISSION_ADAPTIVE` | `1` | Set to `0` to disable adaptive tuning |
114+
| `HF_VAR_ADMISSION_DEBUG` | `0` | Set to `1` to enable debug logging |
115+
116+
## Usage
117+
118+
The admission controller is **automatically enabled** when using the standard HyperFlow k8s executor. No code changes required.
119+
120+
To disable:
121+
```bash
122+
export HF_VAR_ADMISSION_CONTROLLER=0
123+
```
124+
125+
## Tuning Examples
126+
127+
### Conservative (Small Clusters)
128+
```bash
129+
export HF_VAR_ADMISSION_PENDING_MAX=50
130+
export HF_VAR_ADMISSION_FILL_RATE=1
131+
export HF_VAR_ADMISSION_BURST=10
132+
```
133+
134+
### Aggressive (Large Clusters)
135+
```bash
136+
export HF_VAR_ADMISSION_PENDING_MAX=500
137+
export HF_VAR_ADMISSION_FILL_RATE=10
138+
export HF_VAR_ADMISSION_BURST=50
139+
```
140+
141+
### Adaptive (Recommended)
142+
```bash
143+
export HF_VAR_ADMISSION_ADAPTIVE=1
144+
# Controller automatically adjusts based on observed cluster performance
145+
```

0 commit comments

Comments
 (0)