|
| 1 | +# GitHub Copilot Reasoning Comparison Results |
| 2 | + |
| 3 | +**Date**: 2025-12-21 |
| 4 | +**Model**: gpt-5.2 |
| 5 | +**Provider**: github-copilot |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +| API | Avg Duration | Total Reasoning Tokens | |
| 10 | +|-----|--------------|------------------------| |
| 11 | +| **Chat Completions** | **4.53s** | 0 (not reported) | |
| 12 | +| Responses (low) | 4.10s | 21 | |
| 13 | +| Responses (medium) | 5.71s | 259 | |
| 14 | +| Responses (high) | 5.37s | 356 | |
| 15 | + |
| 16 | +## Key Finding: Chat May Use Hidden Reasoning |
| 17 | + |
| 18 | +**Chat Completions is 10% SLOWER than low reasoning effort** (4.53s vs 4.10s), yet reports 0 reasoning tokens. |
| 19 | + |
| 20 | +This suggests: |
| 21 | +1. **Chat uses hidden reasoning** - Some default reasoning level that isn't exposed in usage stats |
| 22 | +2. Chat duration falls **between low and medium**, suggesting it may use ~low-to-medium reasoning internally |
| 23 | + |
| 24 | +## Detailed Results |
| 25 | + |
| 26 | +### Test 1: Math Reasoning |
| 27 | + |
| 28 | +| API | Reasoning Effort | Reasoning Tokens | Duration | |
| 29 | +|-----|------------------|------------------|----------| |
| 30 | +| chat | none | N/A | 2001ms | |
| 31 | +| responses | low | 21 | 1432ms | |
| 32 | +| responses | medium | 49 | 3249ms | |
| 33 | +| responses | high | 79 | 2196ms | |
| 34 | + |
| 35 | +### Test 2: Logic Puzzle |
| 36 | + |
| 37 | +| API | Reasoning Effort | Reasoning Tokens | Duration | |
| 38 | +|-----|------------------|------------------|----------| |
| 39 | +| chat | none | N/A | 5006ms | |
| 40 | +| responses | low | 0 | 5227ms | |
| 41 | +| responses | medium | 0 | 5790ms | |
| 42 | +| responses | high | 0 | 5940ms | |
| 43 | + |
| 44 | +*Note: This puzzle triggered no reasoning tokens across all levels - likely a well-known problem.* |
| 45 | + |
| 46 | +### Test 3: Code Analysis |
| 47 | + |
| 48 | +| API | Reasoning Effort | Reasoning Tokens | Duration | |
| 49 | +|-----|------------------|------------------|----------| |
| 50 | +| chat | none | N/A | 6585ms | |
| 51 | +| responses | low | 0 | 5637ms | |
| 52 | +| responses | medium | **210** | 8080ms | |
| 53 | +| responses | high | **277** | 7969ms | |
| 54 | + |
| 55 | +**Key Observation**: Chat (6585ms) is slower than low (5637ms) but faster than medium/high. This pattern suggests Chat may use hidden reasoning between low and medium effort. |
| 56 | + |
| 57 | +## Conclusions |
| 58 | + |
| 59 | +1. **Chat Completions does NOT report reasoning tokens** but takes longer than explicit low reasoning |
| 60 | + |
| 61 | +2. **Duration pattern suggests hidden reasoning**: |
| 62 | + - Chat: 4.53s (slower than low, faster than medium) |
| 63 | + - This is consistent with Chat using some default reasoning level |
| 64 | + |
| 65 | +3. **Responses API reasoning tokens scale with effort**: |
| 66 | + - Low: 21 tokens |
| 67 | + - Medium: 259 tokens (12x more) |
| 68 | + - High: 356 tokens (17x more) |
| 69 | + |
| 70 | +4. **For maximum transparency, use Responses API** - it reports reasoning tokens in usage stats |
| 71 | + |
| 72 | +## How to Run |
| 73 | + |
| 74 | +```bash |
| 75 | +bun run fork/copilot-reasoning/compare-copilot-reasoning.ts |
| 76 | + |
| 77 | +# With options |
| 78 | +bun run fork/copilot-reasoning/compare-copilot-reasoning.ts -m gpt-5.2 --timeout-ms 120000 |
| 79 | +``` |
0 commit comments