|
| 1 | +README |
| 2 | +====== |
| 3 | + |
| 4 | +Optimal-Execution Toy Benchmark for OpenEvolve |
| 5 | +--------------------------------------------- |
| 6 | + |
| 7 | +This repository contains a **minimal yet complete** benchmark that lets an evolutionary-search engine learn how to execute a fixed quantity of shares in an order-book with market impact. |
| 8 | +It mirrors the structure of the earlier “function-minimisation” example but replaces the mathematical objective with a *trading* objective: |
| 9 | + |
| 10 | +*Minimise implementation-shortfall / slippage when buying or selling a random volume during a short horizon.* |
| 11 | + |
| 12 | +The benchmark is intentionally lightweight – short Python, no external dependencies – yet it shows every building-block you would find in a realistic execution engine: |
| 13 | + |
| 14 | +1. synthetic order-book generation |
| 15 | +2. execution-schedule parameterisation |
| 16 | +3. a search / learning loop confined to an `EVOLVE-BLOCK` |
| 17 | +4. an **independent evaluator** that scores candidates on unseen market scenarios. |
| 18 | + |
| 19 | +------------------------------------------------------------------------------- |
| 20 | + |
| 21 | +Repository Layout |
| 22 | +----------------- |
| 23 | + |
| 24 | +``` |
| 25 | +. |
| 26 | +├── initial_program.py # candidate – contains the EVOLVE-BLOCK |
| 27 | +├── evaluator.py # ground-truth evaluator |
| 28 | +└── README.md # ← you are here |
| 29 | +``` |
| 30 | + |
| 31 | +Why two files? |
| 32 | +• `initial_program.py` is what the evolutionary framework mutates. |
| 33 | +• `evaluator.py` is trusted, *never* mutated and imports nothing except the |
| 34 | + candidate’s public `run_search()` function. |
| 35 | + |
| 36 | +------------------------------------------------------------------------------- |
| 37 | + |
| 38 | +Quick-start |
| 39 | +----------- |
| 40 | + |
| 41 | +``` |
| 42 | +python initial_program.py |
| 43 | + # Runs the candidate’s own training loop (random-search on α) |
| 44 | + |
| 45 | +python evaluator.py initial_program.py |
| 46 | + # Scores the candidate on fresh market scenarios |
| 47 | +``` |
| 48 | + |
| 49 | +Typical console output: |
| 50 | + |
| 51 | +``` |
| 52 | +Best alpha: 1.482 | Estimated average slippage: 0.00834 |
| 53 | +{'value_score': 0.213, 'speed_score': 0.667, |
| 54 | + 'reliability': 1.0, 'overall_score': 0.269} |
| 55 | +``` |
| 56 | + |
| 57 | +------------------------------------------------------------------------------- |
| 58 | + |
| 59 | +1. Mechanics – Inside the Candidate (`initial_program.py`) |
| 60 | +---------------------------------------------------------- |
| 61 | + |
| 62 | +The file is split into two parts: |
| 63 | + |
| 64 | +### 1.1 EVOLVE-BLOCK (mutable) |
| 65 | + |
| 66 | +```python |
| 67 | +# EVOLVE-BLOCK-START … EVOLVE-BLOCK-END |
| 68 | +``` |
| 69 | + |
| 70 | +Only the code between those delimiters will be altered by OpenEvolve. |
| 71 | +Everything else is *frozen*; it plays the role of a “library.” |
| 72 | + |
| 73 | +Current strategy: |
| 74 | + |
| 75 | +1. **Parameter** – a single scalar `alpha (α)` |
| 76 | + • α < 0 → front-loads the schedule |
| 77 | + • α = 0 → uniform (TWAP) |
| 78 | + • α > 0 → back-loads the schedule |
| 79 | + |
| 80 | +2. **Search** – naïve random search over α |
| 81 | + (`search_algorithm()` evaluates ~250 random α’s and keeps the best.) |
| 82 | + |
| 83 | +3. **Fitness** – measured by `evaluate_alpha()` which, in turn, calls the |
| 84 | + **fixed** simulator (`simulate_execution`) for many random scenarios and |
| 85 | + averages per-share slippage. |
| 86 | + |
| 87 | +Return signature required by the evaluator: |
| 88 | + |
| 89 | +```python |
| 90 | +def run_search() -> tuple[float, float]: |
| 91 | + return best_alpha, estimated_cost |
| 92 | +``` |
| 93 | + |
| 94 | +The first element (α) is mandatory; anything after that is ignored by the |
| 95 | +evaluator but can be useful for debugging. |
| 96 | + |
| 97 | +### 1.2 Fixed “library” code (non-mutable) |
| 98 | + |
| 99 | +* `create_schedule(volume, horizon, alpha)` |
| 100 | + Weights each slice `(t+1)^α`, then normalises to equal volume. |
| 101 | + |
| 102 | +* `simulate_execution(...)` |
| 103 | + Ultra-simplified micro-structure: |
| 104 | + |
| 105 | + • The mid-price `P_t` follows a Gaussian random walk |
| 106 | + • The current spread is constant (`±spread/2`) |
| 107 | + • Market impact grows linearly with child-order size relative to |
| 108 | + book depth: |
| 109 | + `impact = (size / depth) * spread/2` |
| 110 | + |
| 111 | + Execution price for each slice: |
| 112 | + |
| 113 | + ``` |
| 114 | + BUY : P_t + spread/2 + impact |
| 115 | + SELL: P_t - spread/2 - impact |
| 116 | + ``` |
| 117 | + |
| 118 | + Slippage is summed over the horizon and returned *per share*. |
| 119 | + |
| 120 | +------------------------------------------------------------------------------- |
| 121 | + |
| 122 | +2. Mechanics – The Evaluator (`evaluator.py`) |
| 123 | +--------------------------------------------- |
| 124 | + |
| 125 | +The evaluator is the **oracle**; it owns the test scenarios and the scoring |
| 126 | +function. A successful candidate must *generalise*: the random numbers in |
| 127 | +the evaluator are independent from those inside the candidate. |
| 128 | + |
| 129 | +### 2.1 Process flow |
| 130 | + |
| 131 | +For each of `NUM_TRIALS = 10`: |
| 132 | + |
| 133 | +1. Draw a *fresh* `(volume, side)` pair |
| 134 | + `volume ∈ [100, 1000]`, `side ∈ {buy, sell}` |
| 135 | + |
| 136 | +2. Call `run_search()` **once** (time-limited to 8 s) |
| 137 | + |
| 138 | +3. Extract α and compute: |
| 139 | + |
| 140 | + ``` |
| 141 | + cost_candidate = simulate_execution(vol, side, α) |
| 142 | + cost_baseline = simulate_execution(vol, side, 0.0) # uniform TWAP |
| 143 | + improvement = (cost_baseline - cost_candidate) |
| 144 | + / max(cost_baseline, 1e-9) |
| 145 | + ``` |
| 146 | + |
| 147 | +4. Store runtime and improvement. |
| 148 | + |
| 149 | +### 2.2 Scores |
| 150 | + |
| 151 | +After the 10 trials: |
| 152 | + |
| 153 | +``` |
| 154 | +value_score = mean(max(0, improvement)) ∈ [0, 1] |
| 155 | +speed_score = min(10, 1/mean(runtime)) / 10 ∈ [0, 1] |
| 156 | +reliability_score = success / 10 ∈ [0, 1] |
| 157 | + |
| 158 | +overall_score = 0.8·value + 0.1·speed + 0.1·reliability |
| 159 | +``` |
| 160 | + |
| 161 | +Intuition: |
| 162 | + |
| 163 | +* **Value** (quality of execution) dominates. |
| 164 | +* **Speed** rewards fast optimisation but is capped. |
| 165 | +* **Reliability** ensures the candidate rarely crashes or times-out. |
| 166 | + |
| 167 | +### 2.3 Stage-based evaluation (optional) |
| 168 | + |
| 169 | +* `evaluate_stage1()` – smoke-test; passes if `overall_score > 0.05` |
| 170 | +* `evaluate_stage2()` – identical to `evaluate()` |
| 171 | + |
| 172 | +Those mirrors the two-stage funnel from the previous demo. |
| 173 | + |
| 174 | +------------------------------------------------------------------------------- |
| 175 | + |
| 176 | +3. Extending the Benchmark |
| 177 | +-------------------------- |
| 178 | + |
| 179 | +The framework is deliberately tiny so you can experiment. |
| 180 | + |
| 181 | +Ideas: |
| 182 | + |
| 183 | +1. **Richer parameterisation** |
| 184 | + • Add `beta` for *U-shape* schedule |
| 185 | + • Add *child-order participation cap* (%ADV) |
| 186 | + |
| 187 | +2. **Better search / learning** |
| 188 | + • Replace random search with gradient-free CMA-ES, Bayesian optimisation or |
| 189 | + even RL inside the EVOLVE-BLOCK. |
| 190 | + |
| 191 | +3. **Enhanced market model** |
| 192 | + • Stochastic spread |
| 193 | + • Non-linear impact (`impact ∝ volume^γ`) |
| 194 | + • Resilience (price reverts after child order) |
| 195 | + |
| 196 | +4. **Multi-objective scoring** |
| 197 | + Mix risk metrics (variance of slippage) into the evaluator. |
| 198 | + |
| 199 | +When you add knobs, remember: |
| 200 | + |
| 201 | +* All **simulation logic for evaluation must live in `evaluator.py`**. |
| 202 | + Candidates cannot peek or tamper with it. |
| 203 | +* The evaluator must still be able to extract the *decision variables* from |
| 204 | + the tuple returned by `run_search()`. |
| 205 | + |
| 206 | +------------------------------------------------------------------------------- |
| 207 | + |
| 208 | +4. Known Limitations |
| 209 | +-------------------- |
| 210 | + |
| 211 | +1. **Impact model is linear & memory-less** |
| 212 | + Good for demonstration; unrealistic for real-world HFT. |
| 213 | + |
| 214 | +2. **No order-book micro-structure** |
| 215 | + We do not simulate queue positions, cancellations, hidden liquidity, etc. |
| 216 | + |
| 217 | +3. **Single parameter α** |
| 218 | + Optimal execution in reality depends on volatility, spread forecast, |
| 219 | + order-book imbalance and so forth. Here we sidestep all that for clarity. |
| 220 | + |
| 221 | +4. **Random search baseline** |
| 222 | + Evolutionary engines will easily outperform it; that is the point – we |
| 223 | + want a hill to climb. |
| 224 | + |
| 225 | +------------------------------------------------------------------------------- |
| 226 | + |
| 227 | +5. FAQ |
| 228 | +------ |
| 229 | + |
| 230 | +Q: **Why does the evaluator re-implement `simulate_execution`?** |
| 231 | +A: To guarantee the candidate cannot cheat by hard-coding answers from its own |
| 232 | +RNG realisations. |
| 233 | + |
| 234 | +Q: **What happens if my `run_search()` returns something weird?** |
| 235 | +A: The evaluator casts the *first* item to `float`. Non-numeric or `NaN` |
| 236 | +values yield zero score. |
| 237 | + |
| 238 | +Q: **Is it okay to import heavy libraries (pandas, torch) inside the EVOLVE-BLOCK?** |
| 239 | +A: Technically yes, but remember the 8-second time-out and the judge’s machine |
| 240 | +may not have GPU or large RAM. |
| 241 | + |
| 242 | +------------------------------------------------------------------------------- |
| 243 | + |
| 244 | +6. License |
| 245 | +---------- |
| 246 | + |
| 247 | +The example is released under the MIT License – do whatever you like, but |
| 248 | +please keep references to the original authorship when redistributing. |
| 249 | + |
0 commit comments