Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Copy this file to .env and fill in your API keys
# .env is gitignored and will not be committed

# OpenRouter API Key (default provider)
OPENROUTER_API_KEY=your-openrouter-api-key-here

# Optional: Use OpenAI directly instead of OpenRouter
# LLM_PROVIDER=openai
# OPENAI_API_KEY=your-openai-api-key-here

# Optional: Override the default model
# LLM_MODEL=openai/gpt-4o
9 changes: 3 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
__pycache__/
*.xml
.env
venv/venv/
__pycache__/
*.xml
.DS_Store
.env
venv/
venv/
venv/
myenv/
logs/
window_dump.xml
14 changes: 11 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ Browser agents can't reach these. Desktop agents don't fit. **Android Use is the
- Python 3.10+
- Android device or emulator (USB debugging enabled)
- ADB (Android Debug Bridge)
- OpenAI API key
- OpenRouter API key (default) **or** OpenAI API key

### Installation

Expand All @@ -176,13 +176,21 @@ brew install android-platform-tools # macOS
# 4. Connect device & verify
adb devices

# 5. Set API key
export OPENAI_API_KEY="sk-..."
# 5. Set API key (OpenRouter is the default provider)
export OPENROUTER_API_KEY="sk-or-..."

# 6. Run your first agent
python kernel.py
```

### Alternative: Use OpenAI Directly

```bash
# Override to use OpenAI instead of OpenRouter
export LLM_PROVIDER=openai
export OPENAI_API_KEY="sk-..."
```

### Try It: Logistics Example

```python
Expand Down
62 changes: 62 additions & 0 deletions docs/IMPLEMENTATION_PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Implementation Plan: OpenRouter Default (GPT-4o via OpenRouter)

## Goal
Make **OpenRouter** the default LLM provider while preserving the current agent loop reports in `README.md` and behavior in `kernel.py`:

- Perception: dump Android accessibility tree via `uiautomator` and sanitize it
- Reasoning: ask an LLM for the next action as **a single JSON object**
- Action: execute via ADB (`tap`, `type`, `home`, `back`, `wait`, `done`)

## Non-Goals
- Changing the agent UX (still `python kernel.py` → prompts for goal)
- Adding new actions/tool calling
- Rewriting the sanitizer logic

## Default Provider Decision
- Default provider: **OpenRouter**
- Default model via OpenRouter: **`openai/gpt-4o`**

## New Configuration (env vars)
- `OPENROUTER_API_KEY` (required by default)
- `LLM_PROVIDER` (optional override; values: `openrouter`, `openai`)
- `LLM_MODEL` (optional override; default depends on provider)
- `OPENAI_API_KEY` (only required if `LLM_PROVIDER=openai`)

## Work Breakdown (milestones)

### Milestone 1 — Add docs-first implementation instructions
- Create docs structure:
- `docs/features/openrouter-default.md`
- `docs/bugs/kernel-known-bugs.md`
- Ensure instructions are atomic and include “why” for each step.

### Milestone 2 — Implement provider abstraction (small refactor)
- Add a small “LLM client factory” that chooses:
- OpenRouter client (default)
- OpenAI client (opt-in)
- Keep call site to `client.chat.completions.create(...)` unchanged.

### Milestone 3 — Preserve JSON-action contract across models/providers
- Keep `response_format={"type":"json_object"}`.
- Add parse/validation + 1 retry if output is invalid JSON.

### Milestone 4 — Fix correctness bugs discovered during review
- Fix issues documented in `docs/bugs/kernel-known-bugs.md`.

### Milestone 5 — Update README and do a smoke test
- Update `README.md` Quick Start to prefer OpenRouter.
- Manual smoke test:
- Run `python kernel.py` with a simple goal (e.g. “go home”).
- Confirm ADB commands work and the model returns valid JSON actions.

## Acceptance Criteria
- Running with **only** `OPENROUTER_API_KEY` set works (OpenRouter default).
- Setting `LLM_PROVIDER=openai` with `OPENAI_API_KEY` works.
- Actions returned by the model are validated (no crashes on missing fields).
- Key ADB actions (`home`, `back`) use correct keycodes.

## Rollback Plan
- If OpenRouter routing/model output is unstable, keep OpenRouter default but allow fallback:
- `LLM_PROVIDER=openai`
- `LLM_MODEL=gpt-4o`

109 changes: 109 additions & 0 deletions docs/bugs/kernel-known-bugs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Bugs: Known Issues in `kernel.py` (and Proposed Fixes)

This document lists bugs discovered during review that will impact correctness and/or stability. Each bug includes a proposed fix and the reason it matters.

## 1) Missing import: `List` used but not imported
**Where**
- `kernel.py`: `def run_adb_command(command: List[str]):`

**Problem**
- `List` is not imported from `typing`, which will raise a `NameError` at runtime.

**Proposed Fix**
- Change typing import to include `List`:
- `from typing import Dict, Any, List`

**Why it matters**
- This prevents the script from running at all.

## 2) Wrong ADB keyevent constants for Home/Back
**Where**
- `kernel.py`:
- `KEYWORDS_HOME`
- `KEYWORDS_BACK`

**Problem**
- The Android keyevent constants are `KEYCODE_HOME` and `KEYCODE_BACK`.
- Current constants will cause ADB to fail (or do nothing) when trying to go home/back.

**Proposed Fix**
- Replace with:
- `KEYCODE_HOME`
- `KEYCODE_BACK`

**Why it matters**
- Navigation actions are core to the agent loop.

## 3) Potential crash: `tap` coordinates unpacking without validation
**Where**
- `execute_action()`:
- `x, y = action.get("coordinates")`

**Problem**
- If `coordinates` is missing or malformed, unpacking throws an exception.

**Proposed Fix**
- Validate the action schema before executing:
- Ensure `coordinates` exists
- Ensure it is a 2-item list/tuple
- Ensure each value can be converted to int

**Why it matters**
- LLMs occasionally return malformed payloads; the agent should fail gracefully.

## 4) Potential crash: `type` action assumes `text` exists
**Where**
- `execute_action()`:
- `text = action.get("text").replace(" ", "%s")`

**Problem**
- If `text` is missing, `action.get("text")` returns `None` and `.replace(...)` crashes.

**Proposed Fix**
- Validate `text` exists and is a string before calling `.replace`.

**Why it matters**
- Prevents agent from crashing mid-run.

## 5) Hard exit inside library function (`exit(0)`) reduces reusability
**Where**
- `execute_action()` on `done`:
- `exit(0)`

**Problem**
- If `run_agent()` is imported and used by another module, `exit(0)` will terminate the entire host process.

**Proposed Fix**
- Prefer returning a sentinel (e.g. `True` for completed) or raising a specific exception that `run_agent()` catches.

**Why it matters**
- Enables embedding this library into other tools/services without unexpected process termination.

## 6) ADB error detection is brittle
**Where**
- `run_adb_command()`:
- checks `if result.stderr and "error" in result.stderr.lower()`

**Problem**
- Many ADB failures show up in stdout or return codes.
- Ignoring `returncode` can hide failures.

**Proposed Fix**
- Check `result.returncode != 0` and include both stdout/stderr in the error message.

**Why it matters**
- Makes debugging device connectivity and ADB issues far easier.

## 7) Ambiguous `focus` usage in sanitizer (minor)
**Where**
- `sanitizer.py`:
- `is_editable = node.attrib.get("focus") == "true" or node.attrib.get("focusable") == "true"`

**Problem**
- `focus/focusable` is not the same as "editable".

**Proposed Fix**
- (Optional) Use attributes like `class` (`EditText`) or `long-clickable`/`enabled` to identify text fields more accurately.

**Why it matters**
- Better context improves LLM decision quality; not required for OpenRouter switch.
121 changes: 121 additions & 0 deletions docs/features/openrouter-default.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Feature: Make OpenRouter the Default LLM Provider (GPT-4o)

## Summary
Refactor `kernel.py` so the default LLM provider is **OpenRouter**, using model **`openai/gpt-4o`**, while keeping the current agent loop and JSON action contract.

## Target Behavior
- Running `python kernel.py` should work with only:
- `OPENROUTER_API_KEY` set
- OpenAI remains available as an override:
- `LLM_PROVIDER=openai` + `OPENAI_API_KEY`

## Atomic Steps (with “Why”)

### 1) Decide and document env var contract
**Do**
- Define these env vars:
- `OPENROUTER_API_KEY` (required by default)
- `LLM_PROVIDER` (optional; default `openrouter`)
- `LLM_MODEL` (optional; default depends on provider)
- `OPENAI_API_KEY` (only required if `LLM_PROVIDER=openai`)

**Why**
- A junior engineer needs a single source of truth for configuration.
- Keeping OpenAI as opt-in reduces risk and makes debugging easier.

### 2) Replace the global `MODEL` constant with provider-aware defaults
**Do**
- Introduce a provider-aware model selection:
- If provider is `openrouter`: default `openai/gpt-4o`
- If provider is `openai`: default `gpt-4o`
- Allow `LLM_MODEL` to override in both cases.

**Why**
- OpenRouter uses namespaced model IDs; OpenAI does not.
- This prevents confusing “model not found” errors.

### 3) Create a tiny “LLM client factory” in `kernel.py`
**Do**
- Add a function, e.g. `get_llm_client_and_model()` that returns:
- `client`
- `model`
- Build the OpenAI SDK client like:
- OpenRouter default:
- `OpenAI(api_key=OPENROUTER_API_KEY, base_url="https://openrouter.ai/api/v1")`
- OpenAI override:
- `OpenAI(api_key=OPENAI_API_KEY)`

**Why**
- Centralizes provider logic.
- Avoids littering conditionals across `get_llm_decision()`.
- Makes future provider additions (Claude/Gemini via OpenRouter, etc.) straightforward.

### 4) Add OpenRouter optional headers (non-blocking)
**Do**
- If the OpenAI SDK version in this repo supports default headers:
- Add `HTTP-Referer` and `X-Title` for OpenRouter requests.
- If it does not, skip this step.

**Why**
- OpenRouter recommends these headers for attribution/analytics.
- Not required for correctness; keep it optional to reduce implementation risk.

### 5) Keep JSON response mode, but add a fallback parsing strategy
**Do**
- Keep `response_format={"type": "json_object"}`.
- Wrap JSON parsing in a try/catch.
- If parsing fails:
- Retry once with a stricter prompt (still requiring only JSON output)
- If it still fails, raise a clear error that includes the raw response text.

**Why**
- Different routed models can be slightly less strict about JSON-only output.
- A single retry often fixes transient “formatting drift” without changing the UX.

### 6) Validate the returned action schema before executing
**Do**
- Before `execute_action(decision)`:
- Validate `decision["action"]` is one of:
- `tap`, `type`, `home`, `back`, `wait`, `done`
- If `tap`, require `coordinates` as a 2-item list of ints.
- If `type`, require `text` as a non-empty string.

**Why**
- Prevents crashes and device misclicks.
- Makes the behavior consistent even when the LLM is imperfect.

### 7) Update README “Quick Start” to prefer OpenRouter
**Do**
- Replace or augment the existing OpenAI setup section with:
- `export OPENROUTER_API_KEY="..."`
- (optional) `export LLM_MODEL="openai/gpt-4o"`
- Add an “OpenAI override” snippet:
- `export LLM_PROVIDER=openai`
- `export OPENAI_API_KEY="..."`

**Why**
- Docs should match the new default so new users don’t get blocked.

### 8) Add a minimal manual smoke test checklist
**Do**
- Validate both modes:
- OpenRouter default
- OpenAI override
- Use a simple goal and verify at least one valid action executes.

**Why**
- Prevents regressions before merging.
- Junior engineers get confidence quickly with concrete steps.

## Expected Code Touch Points
- `kernel.py`
- Add provider config + client factory
- Update model constant usage
- Add JSON parsing fallback + action validation
- `README.md`
- Update environment variable setup instructions

## Definition of Done
- With `OPENROUTER_API_KEY` set, `python kernel.py` starts and makes LLM calls successfully.
- The LLM output is parsed into a JSON dict and validated.
- Actions execute without runtime exceptions for missing fields.
Loading