Skip to content

Commit e536d99

Browse files
committed
compare copilot reasoning
1 parent 11852dd commit e536d99

File tree

2 files changed

+772
-0
lines changed

2 files changed

+772
-0
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# GitHub Copilot Reasoning Comparison Results
2+
3+
**Date**: 2025-12-21
4+
**Model**: gpt-5.2
5+
**Provider**: github-copilot
6+
7+
## Summary
8+
9+
| API | Avg Duration | Total Reasoning Tokens |
10+
|-----|--------------|------------------------|
11+
| **Chat Completions** | **4.53s** | 0 (not reported) |
12+
| Responses (low) | 4.10s | 21 |
13+
| Responses (medium) | 5.71s | 259 |
14+
| Responses (high) | 5.37s | 356 |
15+
16+
## Key Finding: Chat May Use Hidden Reasoning
17+
18+
**Chat Completions is 10% SLOWER than low reasoning effort** (4.53s vs 4.10s), yet reports 0 reasoning tokens.
19+
20+
This suggests:
21+
1. **Chat uses hidden reasoning** - Some default reasoning level that isn't exposed in usage stats
22+
2. Chat duration falls **between low and medium**, suggesting it may use ~low-to-medium reasoning internally
23+
24+
## Detailed Results
25+
26+
### Test 1: Math Reasoning
27+
28+
| API | Reasoning Effort | Reasoning Tokens | Duration |
29+
|-----|------------------|------------------|----------|
30+
| chat | none | N/A | 2001ms |
31+
| responses | low | 21 | 1432ms |
32+
| responses | medium | 49 | 3249ms |
33+
| responses | high | 79 | 2196ms |
34+
35+
### Test 2: Logic Puzzle
36+
37+
| API | Reasoning Effort | Reasoning Tokens | Duration |
38+
|-----|------------------|------------------|----------|
39+
| chat | none | N/A | 5006ms |
40+
| responses | low | 0 | 5227ms |
41+
| responses | medium | 0 | 5790ms |
42+
| responses | high | 0 | 5940ms |
43+
44+
*Note: This puzzle triggered no reasoning tokens across all levels - likely a well-known problem.*
45+
46+
### Test 3: Code Analysis
47+
48+
| API | Reasoning Effort | Reasoning Tokens | Duration |
49+
|-----|------------------|------------------|----------|
50+
| chat | none | N/A | 6585ms |
51+
| responses | low | 0 | 5637ms |
52+
| responses | medium | **210** | 8080ms |
53+
| responses | high | **277** | 7969ms |
54+
55+
**Key Observation**: Chat (6585ms) is slower than low (5637ms) but faster than medium/high. This pattern suggests Chat may use hidden reasoning between low and medium effort.
56+
57+
## Conclusions
58+
59+
1. **Chat Completions does NOT report reasoning tokens** but takes longer than explicit low reasoning
60+
61+
2. **Duration pattern suggests hidden reasoning**:
62+
- Chat: 4.53s (slower than low, faster than medium)
63+
- This is consistent with Chat using some default reasoning level
64+
65+
3. **Responses API reasoning tokens scale with effort**:
66+
- Low: 21 tokens
67+
- Medium: 259 tokens (12x more)
68+
- High: 356 tokens (17x more)
69+
70+
4. **For maximum transparency, use Responses API** - it reports reasoning tokens in usage stats
71+
72+
## How to Run
73+
74+
```bash
75+
bun run fork/copilot-reasoning/compare-copilot-reasoning.ts
76+
77+
# With options
78+
bun run fork/copilot-reasoning/compare-copilot-reasoning.ts -m gpt-5.2 --timeout-ms 120000
79+
```

0 commit comments

Comments
 (0)