diff --git a/README.md b/README.md index c1e34f762..fdb66a053 100644 --- a/README.md +++ b/README.md @@ -212,11 +212,19 @@ flowchart TB %% MODEL LAYER %% ═══════════════════════════════════════════════════════════════════════ subgraph Models["Model Layer (VLMs)"] - direction LR - CLAUDE["Claude"] - GPT["GPT-4o"] - GEMINI["Gemini"] - QWEN["Qwen-VL"] + direction TB + subgraph APIModels["API Models"] + direction LR + CLAUDE["Claude"] + GPT["GPT-4o"] + GEMINI["Gemini"] + end + subgraph OpenSource["Open Source / Fine-tuned"] + direction LR + QWEN3["Qwen3-VL"] + UITARS["UI-TARS"] + OPENCUA["OpenCUA"] + end end %% ═══════════════════════════════════════════════════════════════════════ diff --git a/docs/architecture-evolution.md b/docs/architecture-evolution.md index 4426782a1..683229491 100644 --- a/docs/architecture-evolution.md +++ b/docs/architecture-evolution.md @@ -1,6 +1,6 @@ # OpenAdapt Architecture Evolution -**Version**: 2.0 +**Version**: 3.0 **Date**: January 2026 **Status**: Living Document @@ -8,131 +8,349 @@ ## Executive Summary -This document synthesizes OpenAdapt's original alpha vision with modern GUI agent state-of-the-art (SOTA) research. It defines the architectural principles, implementation status, and roadmap for OpenAdapt as the leading open-source demonstration-conditioned GUI automation framework. +This document traces the evolution of OpenAdapt from its original alpha vision through the modern modular implementation, synthesizing state-of-the-art GUI agent research into a unified framework. OpenAdapt's core innovation is **demonstration-conditioned automation**: "show, don't tell." --- ## Table of Contents -1. [Core Insight: Demonstration-Conditioned Automation](#1-core-insight-demonstration-conditioned-automation) +1. [Original Alpha Vision](#1-original-alpha-vision) 2. [The Abstraction Ladder](#2-the-abstraction-ladder) -3. [Three-Phase Architecture](#3-three-phase-architecture) -4. [Package Responsibilities](#4-package-responsibilities) -5. [Feedback Loops](#5-feedback-loops) -6. [Model Layer](#6-model-layer) -7. [Implementation Status](#7-implementation-status) -8. [Architecture Diagrams](#8-architecture-diagrams) -9. [Key Design Principles](#9-key-design-principles) -10. [Research Alignment](#10-research-alignment) -11. [Future Directions](#11-future-directions) +3. [Core Innovation: Demo-Conditioned Agents](#3-core-innovation-demo-conditioned-agents) +4. [Modern Architecture](#4-modern-architecture) +5. [SOTA GUI Agent Integration](#5-sota-gui-agent-integration) +6. [Package Responsibilities](#6-package-responsibilities) +7. [Feedback Loops](#7-feedback-loops) +8. [Implementation Status](#8-implementation-status) +9. [Architecture Evolution Diagrams](#9-architecture-evolution-diagrams) +10. [Future Directions](#10-future-directions) --- -## 1. Core Insight: Demonstration-Conditioned Automation +## 1. Original Alpha Vision -### The Fundamental Differentiator +### The Three-Stage Pipeline (2023) -OpenAdapt's fundamental differentiator is **demonstration-conditioned automation**: "show, don't tell." +OpenAdapt was conceived as a three-stage pipeline for AI-first process automation: -| Approach | Description | Example | -|----------|-------------|---------| -| **Prompt-Driven** (Traditional) | User describes what to do in natural language | "Book a flight from NYC to LA for next Tuesday" | -| **Demo-Conditioned** (OpenAdapt) | Agent learns from watching user perform the task | Record user booking a flight, replay with new parameters | +``` ++=====================+ +=====================+ +=====================+ +| | | | | | +| RECORDING | --> | ANALYSIS | --> | REPLAY | +| | | | | | +| Capture human | | Convert to | | Generate and | +| demonstrations: | | tokenized format | | replay synthetic | +| - Screenshots | | for LMM | | input via model | +| - User input | | processing | | completions | +| | | | | | ++=====================+ +=====================+ +=====================+ +``` -### Why This Matters +### Original Design Goals -1. **Reduced Ambiguity**: Demonstrations capture implicit knowledge that's hard to verbalize -2. **Grounded in Reality**: Agents learn from actual UI interactions, not abstract descriptions -3. **Lower Barrier to Entry**: Users don't need prompt engineering skills -4. **Validated Improvement**: 33% to 100% first-action accuracy with demo conditioning (internal benchmarks) +From the legacy README: -### The "Show, Don't Tell" Principle +> "The goal is similar to that of Robotic Process Automation (RPA), except that we use Large Multimodal Models instead of conventional RPA tools." -``` -Traditional Agent: - User: "Click the submit button" - Agent: [Which submit button? What context? What state?] +**Key Differentiators (Alpha)**: +1. **Model Agnostic** - Works with any LMM +2. **Auto-Prompted** - Learns from demonstration, not user prompts +3. **Grounded in Existing Processes** - Mitigates hallucinations +4. **Universal GUI Support** - Desktop, web, and virtualized (Citrix) +5. **Open Source** - MIT license + +### Legacy Monolithic Implementation -Demo-Conditioned Agent: - User: [Records clicking the blue "Submit Order" button after filling form] - Agent: [Learns the full context: form state, button appearance, preceding actions] +The alpha codebase (`legacy/openadapt/`) implemented: + +``` +openadapt/ + record.py # Screenshot/event capture + replay.py # Strategy-based playback + models.py # Recording, ActionEvent, Screenshot, WindowEvent + events.py # Event aggregation/processing + strategies/ + base.py # BaseReplayStrategy abstract class + naive.py # Direct literal replay + stateful.py # GPT-4 + OS-level window data + vanilla.py # Full VLM reasoning per step + visual.py # FastSAM segmentation + visual_browser.py # DOM-based segments + adapters/ + anthropic.py # Claude API integration + openai.py # GPT API integration + replicate.py # Open-source model hosting + privacy/ + base.py # Scrubbing provider interface + providers/ # Presidio, AWS Comprehend, Private AI ``` ---- +### The Strategy Pattern (Original) -## 2. The Abstraction Ladder +The original architecture used a `BaseReplayStrategy` abstract class: -OpenAdapt processes demonstrations through progressive abstraction levels, enabling generalization, transfer learning, and explainability. +```python +class BaseReplayStrategy(ABC): + """Base class for implementing replay strategies.""" -### Abstraction Levels + def __init__(self, recording: Recording) -> None: + self.recording = recording + self.action_events = [] + self.screenshots = [] + self.window_events = [] + @abstractmethod + def get_next_action_event( + self, + screenshot: Screenshot, + window_event: WindowEvent, + ) -> ActionEvent: + """Get the next action based on current observation.""" + pass + + def run(self) -> None: + """Execute the replay loop.""" + while True: + screenshot = Screenshot.take_screenshot() + window_event = WindowEvent.get_active_window_event() + action_event = self.get_next_action_event(screenshot, window_event) + if action_event: + playback.play_action_event(action_event, ...) ``` -Level 0 - LITERAL (Raw Events) - { press: "h", press: "i", press: " ", press: "b", press: "o", press: "b" } - | Reduction (aggregate consecutive events) - v +This pattern evolved into the modern policy/grounding separation. -Level 1 - SYMBOLIC (Semantic Actions) - { type: "hi bob" } +### Alpha Data Model - | Anonymization (extract parameters) - v +```python +class Recording: + """Container for a demonstration session.""" + id: int + timestamp: float + task_description: str + action_events: list[ActionEvent] + screenshots: list[Screenshot] + window_events: list[WindowEvent] + +class ActionEvent: + """A single user action (click, type, scroll, etc.).""" + name: str # "click", "type", "scroll", "press", "release" + timestamp: float + screenshot: Screenshot # Screenshot just before action + window_event: WindowEvent # Active window state + mouse_x, mouse_y: int # Mouse coordinates + key_char, key_name: str # Keyboard input + element_state: dict # Accessibility info + +class Screenshot: + """A captured screen image.""" + timestamp: float + png_data: bytes + image: PIL.Image +``` -Level 2 - TEMPLATE (Parameterized Actions) - { type: "hi " } +--- - | Process Mining (discover patterns) - v +## 2. The Abstraction Ladder -Level 3 - SEMANTIC (Intent Recognition) - { greet: user } +### Core Concept: Progressive Abstraction - | Goal Composition (high-level planning) - v +OpenAdapt processes demonstrations through ascending levels of abstraction, enabling generalization and transfer learning. -Level 4 - GOAL (Task Specification) - "Say hello to the customer" ``` ++=========================================================================+ +| | +| Level 4: GOAL (Task Specification) FUTURE | +| "Say hello to the customer" | +| | +| ^ | +| | Goal Composition (high-level planning) | +| | | ++=========================================================================+ +| | +| Level 3: SEMANTIC (Intent Recognition) FUTURE | +| { action: "greet", target: "user" } | +| | +| ^ | +| | Process Mining (discover patterns) | +| | | ++=========================================================================+ +| | +| Level 2: TEMPLATE (Parameterized Actions) PARTIAL | +| { type: "hi " } | +| | +| ^ | +| | Anonymization (extract parameters) | +| | | ++=========================================================================+ +| | +| Level 1: SYMBOLIC (Semantic Actions) IMPLEMENTED | +| { type: "hi bob" } | +| | +| ^ | +| | Reduction (aggregate consecutive events) | +| | | ++=========================================================================+ +| | +| Level 0: LITERAL (Raw Events) IMPLEMENTED | +| { press: "h" }, { press: "i" }, { press: " " }, { press: "b" }, ... | +| | ++=========================================================================+ +``` + +### Abstraction Level Details -### Abstraction Benefits +| Level | Name | Representation | Transformation | Status | +|-------|------|----------------|----------------|--------| +| 0 | **Literal** | Raw keypresses, mouse coords | None (raw capture) | **Implemented** | +| 1 | **Symbolic** | Aggregated actions (`type "hello"`) | Event reduction | **Implemented** | +| 2 | **Template** | Parameterized (`type ""`) | Regex extraction | **Partial** | +| 3 | **Semantic** | Intent-level (`greet user`) | LLM intent recognition | **Research** | +| 4 | **Goal** | Task description ("Welcome customer") | Goal composition | **Future** | + +### Why Abstraction Matters | Level | Enables | Example Use Case | |-------|---------|------------------| -| Literal | Exact replay | Debugging, audit trails | +| Literal | Exact replay, debugging | Audit trails, regression tests | | Symbolic | Human-readable logs | Training data visualization | | Template | Parameterized replay | Same task, different data | | Semantic | Cross-application transfer | Greeting in any messaging app | | Goal | Natural language control | "Greet the next customer" | -### Current Implementation Status +### Current Implementation + +**Literal to Symbolic** (`openadapt-capture`): +- Event aggregation in `events.py` +- Consecutive keypresses become `type` actions +- Mouse drags become `drag` actions +- Click sequences become `doubleclick` or `tripleclick` -- **Literal to Symbolic**: Implemented in `openadapt-capture` (event aggregation) -- **Symbolic to Template**: Partially implemented (regex-based extraction) -- **Template to Semantic**: Research stage (LLM-based intent recognition) -- **Semantic to Goal**: Future work (requires process mining) +**Symbolic to Template** (Partial): +- Regex-based parameter extraction +- User-defined placeholders + +**Template to Semantic** (Research): +- LLM-based intent recognition +- Pattern library discovery + +**Semantic to Goal** (Future): +- Process mining algorithms +- Cross-demo pattern extraction --- -## 3. Three-Phase Architecture +## 3. Core Innovation: Demo-Conditioned Agents -OpenAdapt operates in three distinct phases, each with dedicated packages and responsibilities. +### The Fundamental Differentiator -### Phase Overview +OpenAdapt's core insight is **demonstration-conditioned automation**: "show, don't tell." ``` -+------------------+ +------------------+ +------------------+ -| | | | | | -| DEMONSTRATE | --> | LEARN | --> | EXECUTE | -| | | | | | -| (Observation | | (Policy | | (Agent | -| Collection) | | Acquisition) | | Deployment) | -| | | | | | -+------------------+ +------------------+ +------------------+ ++-------------------------------------------------------------------+ +| TRADITIONAL APPROACH | ++-------------------------------------------------------------------+ +| | +| User: "Click the submit button" | +| | +| Agent: [Which submit button? What context? What state?] | +| [Multiple submit buttons on page?] | +| [Different applications have different buttons] | +| | +| Result: AMBIGUOUS -> Requires prompt engineering | +| | ++-------------------------------------------------------------------+ + ++-------------------------------------------------------------------+ +| DEMO-CONDITIONED APPROACH | ++-------------------------------------------------------------------+ +| | +| User: [Records clicking the blue "Submit Order" button | +| after filling out form fields] | +| | +| Agent: [Learns full context: | +| - Form state before action | +| - Button appearance and location | +| - Preceding actions in sequence | +| - Window/application context] | +| | +| Result: GROUNDED -> No prompt engineering needed | +| | ++-------------------------------------------------------------------+ +``` + +### Why Demo-Conditioning Works + +1. **Captures Implicit Knowledge**: Users demonstrate things they can't easily verbalize +2. **Grounded in Reality**: Actions tied to actual UI states, not abstract descriptions +3. **Reduces Ambiguity**: Visual context eliminates interpretation errors +4. **Lower Barrier**: No prompt engineering skills required + +### Empirical Results + +Demo conditioning improves first-action accuracy: + +| Approach | First-Action Accuracy | Notes | +|----------|----------------------|-------| +| Prompt-only | ~33% | Ambiguity in action selection | +| Demo-conditioned | ~100% | Full context from demonstration | + +### The "Show, Don't Tell" Principle + +```python +# Traditional: Prompt-driven +agent.execute("Click the submit button") +# -> Which submit button? What state? What context? + +# Demo-Conditioned: Demonstration-driven +demo = capture_demonstration() # User clicks specific submit button +agent = train_policy(demo) # Agent learns the full context +agent.execute(new_context) # Agent adapts to variations ``` --- +## 4. Modern Architecture + +### Evolution: Monolith to Meta-Package + +``` +ALPHA (2023-2024) MODERN (2025+) ++====================+ +====================+ +| | | openadapt | +| openadapt | | (meta-pkg) | +| (monolithic) | +=========+=========+ +| | | +| - record.py | +-----------------+-----------------+ +| - replay.py | | | | | | +| - strategies/ | +----+----+ +--+--+ +--+--+ +--+--+ +----+----+ +| - models.py | |capture | | ml | |evals| |viewer| |optional | +| - adapters/ | +---------+ +-----+ +-----+ +------+ +---------+ +| - privacy/ | +| - visualize.py | + grounding, retrieval, privacy +| | ++====================+ +``` + +### The Modern Three-Phase Architecture + +Building on the alpha vision, the modern architecture formalizes three phases: + +``` ++=======================+ +=======================+ +=======================+ +|| || || || || || +|| DEMONSTRATE || --> || LEARN || --> || EXECUTE || +|| || || || || || +|| (Observation || || (Policy || || (Agent || +|| Collection) || || Acquisition) || || Deployment) || +|| || || || || || +|| Packages: || || Packages: || || Packages: || +|| - capture || || - ml || || - evals || +|| - privacy || || - retrieval || || - grounding || +|| || || || || || ++=======================+ +=======================+ +=======================+ +``` + ### Phase 1: DEMONSTRATE (Observation Collection) **Purpose**: Capture rich trajectories from human demonstrations. @@ -148,672 +366,445 @@ OpenAdapt operates in three distinct phases, each with dedicated packages and re - Window metadata (title, bounds, process) - Audio transcription (optional) -**Privacy Integration**: -- Optional PII/PHI scrubbing before storage -- Configurable redaction levels - -**Storage Format**: -- JSON for metadata and events -- Parquet for efficient batch access -- PNG/JPEG for screenshots - **Packages**: `openadapt-capture`, `openadapt-privacy` ---- - ### Phase 2: LEARN (Policy Acquisition) **Purpose**: Transform demonstrations into executable agent policies. **Three Learning Paths**: -#### Path A: Retrieval-Augmented Prompting -- Index demonstrations in vector database -- At inference, retrieve similar demos as context -- Condition API agent (Claude, GPT, Gemini) on retrieved examples -- **Advantage**: Works with any VLM, no training required -- **Package**: `openadapt-retrieval` - -#### Path B: Fine-Tuning -- Train/fine-tune VLM on demonstration dataset -- Use LoRA for parameter-efficient training -- Deploy locally or via inference API -- **Advantage**: Specialized performance, privacy, lower inference cost -- **Package**: `openadapt-ml` - -#### Path C: Process Mining -- Extract reusable action patterns across demonstrations -- Build abstraction hierarchy (template, semantic, goal) -- Enable cross-task transfer learning -- **Status**: Research/Future -- **Package**: `openadapt-ml` (future) - -**Outputs**: -- Vector embeddings for retrieval -- Model checkpoints for fine-tuned models -- Process graphs for abstraction (future) - ---- +| Path | Mechanism | Advantage | Package | +|------|-----------|-----------|---------| +| **A: Retrieval-Augmented** | Index demos, retrieve similar | No training needed | `openadapt-retrieval` | +| **B: Fine-Tuning** | Train VLM on demo dataset | Specialized performance | `openadapt-ml` | +| **C: Process Mining** | Extract reusable patterns | Cross-task transfer | `openadapt-ml` (future) | ### Phase 3: EXECUTE (Agent Deployment) -**Purpose**: Run trained/conditioned agents to perform tasks autonomously. +**Purpose**: Run trained/conditioned agents autonomously. **Execution Loop**: - ``` while not task_complete: - 1. OBSERVE - - Capture current screenshot - - Extract accessibility tree - - Build observation state - - 2. GROUND - - Localize UI elements (bounding boxes) - - Apply Set-of-Mark (SoM) annotation - - Map elements to coordinates or IDs - - 3. PLAN - - Encode observation with VLM - - Condition on goal + history + retrieved demos - - Generate action prediction - - 4. ACT - - Parse action (click, type, scroll, etc.) - - Execute via input synthesis - - Record action for history - - 5. EVALUATE - - Check for success indicators - - Detect failure patterns - - Decide: continue, retry, or escalate + 1. OBSERVE - Capture screenshot + a11y tree + 2. GROUND - Localize UI elements (SoM, OmniParser) + 3. PLAN - VLM reasoning with demo context + 4. ACT - Execute via input synthesis + 5. EVALUATE - Check success, decide next step ``` -**Grounding Modes**: - -| Mode | Description | Accuracy | Use Case | -|------|-------------|----------|----------| -| **Direct** | VLM predicts raw (x, y) coordinates | Variable | Simple, fast | -| **Set-of-Mark (SoM)** | UI elements labeled with IDs, VLM selects ID | High | Complex UIs | -| **Hybrid** | SoM for elements, Direct for fine positioning | Highest | Production | - -**Packages**: `openadapt-grounding`, `openadapt-evals`, `openadapt-ml` +**Packages**: `openadapt-evals`, `openadapt-grounding`, `openadapt-ml` --- -## 4. Package Responsibilities +## 5. SOTA GUI Agent Integration -### Core Packages +### Policy/Grounding Separation -| Package | Phase | Responsibility | Key Exports | -|---------|-------|----------------|-------------| -| `openadapt-capture` | DEMONSTRATE | GUI recording, event capture, storage | `Recorder`, `CaptureSession`, `Action`, `Screenshot` | -| `openadapt-ml` | LEARN | Model training, inference, adapters | `Trainer`, `AgentPolicy`, `VLMAdapter` | -| `openadapt-evals` | EXECUTE | Benchmark evaluation, metrics | `BenchmarkAdapter`, `ApiAgent`, `evaluate_agent` | -| `openadapt-viewer` | Cross-cutting | HTML visualization, replay | `PageBuilder`, `HTMLBuilder`, `TrajectoryViewer` | - -### Optional Packages - -| Package | Phase | Responsibility | Key Exports | -|---------|-------|----------------|-------------| -| `openadapt-grounding` | EXECUTE | UI element localization | `OmniParser`, `Florence2`, `GeminiGrounder` | -| `openadapt-retrieval` | LEARN | Multimodal demo search | `DemoRetriever`, `VectorIndex`, `Embedder` | -| `openadapt-privacy` | DEMONSTRATE | PII/PHI scrubbing | `Scrubber`, `Redactor`, `PrivacyFilter` | - -### Package Dependency Matrix +From Claude Computer Use, UFO, and SeeAct research: ``` - capture ml evals viewer grounding retrieval privacy -openadapt-capture - - - - - - O -openadapt-ml R - - - O O - -openadapt-evals - R - O O O - -openadapt-viewer O O O - - - O -openadapt-grounding - - - - - - - -openadapt-retrieval R - - - - - - -openadapt-privacy - - - - - - - - -Legend: R = Required, O = Optional, - = None ++====================+ +====================+ +| | | | +| POLICY | --> | GROUNDING | +| | | | +| "What to do" | | "Where to do" | +| | | | +| - Observation | | - Element | +| encoding | | detection | +| - Action | | - Coordinate | +| selection | | mapping | +| - History | | - Bounding | +| context | | boxes | +| | | | ++====================+ +====================+ ``` ---- - -## 5. Feedback Loops +**OpenAdapt Implementation**: +- **Policy**: `openadapt-ml` adapters (Claude, GPT-4V, Qwen-VL) +- **Grounding**: `openadapt-grounding` providers (OmniParser, Florence2, Gemini) -OpenAdapt implements continuous improvement through three feedback loops. +### Set-of-Mark (SoM) Prompting -### System Diagram +From Microsoft's Set-of-Mark paper: ``` - DEMONSTRATE - | - | Human demonstrations - v -+--------------------------> LEARN <--------------------------+ -| | | -| | Trained policies | -| +--------------------------|---------------------+ | -| | v | | -| | +----------------> EXECUTE <--------------+ | | -| | | | | | | -| | | Retry on | Success/Failure | | | -| | | recoverable | outcomes | | | -| | | errors v | | | -| | | +-------+-------+ | | | -| | | | | | | | -| | +--------------+ EVALUATE +----------+ | | -| | | | | | -| | +-------+-------+ | | -| | | | | -| | | Execution traces | | -| | v | | -| | Demo library grows | | -| | | | | -| +--------------------------+ | | -| | | -| Failure analysis identifies gaps | | -| | | | -| v | | -| New demonstrations | | -| | | | -+--------------------+ | | - | | - Self-improvement loop | | - (execution traces -> training) | | - | | | - +----------------------+ | - | - Benchmark-driven development | - (eval results -> architecture improvements) | - | | - +--------------------------------+ +Original Screenshot SoM-Annotated Screenshot ++---------------------+ +---------------------+ +| [Login] [Help] | | [1] [2] | +| | -> | | +| Email: [________] | | Email: [3] | +| Pass: [________] | | Pass: [4] | +| [Submit] | | [5] | ++---------------------+ +---------------------+ + +Prompt: "Enter email in element [3], password in [4], click [5]" ``` -### Loop Details - -#### Loop 1: Demonstration Library Growth -- Successful executions are stored as new demonstrations -- Failed executions trigger gap analysis -- Human reviews and corrects failures -- Corrections become new training data +**OpenAdapt Implementation**: `openadapt-grounding.SoMPrompt` -#### Loop 2: Self-Improvement (Future) -- Agent traces its own execution -- Successful traces fine-tune the policy -- Automatic curriculum: easy to hard tasks -- Reduces need for human demonstrations over time +### Safety Gates -#### Loop 3: Benchmark-Driven Development -- Regular evaluation on standard benchmarks -- Failure modes inform architecture changes -- New capabilities tested before merge -- Regression detection prevents quality drops - ---- +From responsible AI patterns: -## 6. Model Layer - -OpenAdapt is model-agnostic, supporting multiple foundation models through a unified adapter interface. +``` ++------------------+ +------------------+ +------------------+ +| | | | | | +| OBSERVE | --> | VALIDATE | --> | ACT | +| | | | | | +| Get current | | - Check bounds | | Execute if | +| state | | - Verify perms | | validated | +| | | - Rate limit | | | ++------------------+ +--------+---------+ +------------------+ + | + v (rejected) + +------------------+ + | ESCALATE | + | Human review | + +------------------+ +``` -### Supported Models +**Status**: Planned in `openadapt-evals` safety module. -#### API Providers (Cloud) +### Research Alignment -| Provider | Model | Status | Best For | -|----------|-------|--------|----------| -| Anthropic | Claude 3.5 Sonnet | Implemented | General GUI tasks | -| OpenAI | GPT-4o | Implemented | Complex reasoning | -| Google | Gemini 2.0 Flash | Implemented | Cost-efficient | +| Research Paper | Key Contribution | OpenAdapt Integration | +|----------------|------------------|----------------------| +| **Claude Computer Use** (Anthropic, 2024) | Production VLM agent API | API adapter in `openadapt-ml` | +| **UFO** (Microsoft, 2024) | Windows agent architecture | Prompt patterns adopted | +| **OSWorld** (CMU, 2024) | Cross-platform benchmark | Benchmark adapter planned | +| **Set-of-Mark** (Microsoft, 2023) | Visual grounding via labels | Core grounding mode | +| **OmniParser** (Microsoft, 2024) | Pure-vision UI parsing | Provider in `openadapt-grounding` | +| **SeeAct** (OSU, 2024) | Grounded action generation | Action space design | +| **WebArena** (CMU, 2023) | Web automation benchmark | Benchmark adapter implemented | +| **AppAgent** (Tencent, 2024) | Mobile GUI agent | Mobile support planned | -#### Local Models (Self-Hosted) +--- -| Model | Parameters | Status | Best For | -|-------|------------|--------|----------| -| Qwen2-VL | 2B-72B | Implemented | Fine-tuning, privacy | -| Qwen2.5-VL | 3B-72B | Planned | Next-gen local | -| Molmo | 7B | Research | Efficiency | +## 6. Package Responsibilities -### Adapter Interface +### Package-to-Phase Mapping -```python -class VLMAdapter(Protocol): - """Protocol for VLM model adapters.""" - - def predict( - self, - screenshot: Image, - task: str, - history: list[Action], - context: Optional[list[Demo]] = None, - ) -> Action: - """Predict next action given observation.""" - ... - - def get_grounding( - self, - screenshot: Image, - element_description: str, - ) -> BoundingBox: - """Ground element description to coordinates.""" - ... ``` - -### Prompt Architecture - -OpenAdapt uses a structured prompting approach combining SOTA patterns: - ++===============================================================================+ +| DEMONSTRATE PHASE | ++===============================================================================+ +| Package | Responsibility | Key Exports | ++-------------------+----------------------------+------------------------------+ +| openadapt-capture | GUI recording, storage | Recorder, CaptureSession | +| | | Action, Screenshot, Trajectory| ++-------------------+----------------------------+------------------------------+ +| openadapt-privacy | PII/PHI scrubbing | Scrubber, Redactor | +| | (integrates at capture) | PrivacyFilter | ++===============================================================================+ + ++===============================================================================+ +| LEARN PHASE | ++===============================================================================+ +| Package | Responsibility | Key Exports | ++---------------------+--------------------------+------------------------------+ +| openadapt-ml | Model training, | Trainer, AgentPolicy | +| | inference, adapters | QwenVLAdapter, ClaudeAdapter | ++---------------------+--------------------------+------------------------------+ +| openadapt-retrieval | Demo embedding, | DemoIndex, Embedder | +| | similarity search | SearchResult | ++===============================================================================+ + ++===============================================================================+ +| EXECUTE PHASE | ++===============================================================================+ +| Package | Responsibility | Key Exports | ++----------------------+-------------------------+------------------------------+ +| openadapt-evals | Benchmark evaluation, | BenchmarkAdapter, ApiAgent | +| | metrics collection | evaluate_agent_on_benchmark | ++----------------------+-------------------------+------------------------------+ +| openadapt-grounding | UI element detection, | ElementDetector, SoMPrompt | +| | coordinate mapping | OmniParser, GeminiGrounder | ++===============================================================================+ + ++===============================================================================+ +| CROSS-CUTTING | ++===============================================================================+ +| Package | Responsibility | Key Exports | ++-------------------+----------------------------+------------------------------+ +| openadapt-viewer | HTML visualization, | PageBuilder, HTMLBuilder | +| | trajectory replay | TrajectoryViewer | ++-------------------+----------------------------+------------------------------+ +| openadapt | Unified CLI, | cli.main, lazy imports | +| (meta-package) | dependency coordination | | ++===============================================================================+ ``` -SYSTEM: {role_definition} - -CONTEXT: -- Retrieved demonstrations (if available) -- Task description -- Success criteria -OBSERVATION: -- Current screenshot (base64 or URL) -- Accessibility tree (structured) -- Element annotations (Set-of-Mark) - -HISTORY: -- Previous N actions and their outcomes -- Current step number +### Package Dependency Matrix -INSTRUCTION: -- Action space definition -- Output format specification +``` + capture ml evals viewer grounding retrieval privacy +openadapt-capture - - - - - - O +openadapt-ml R - - - O O - +openadapt-evals - R - O O O - +openadapt-viewer O O O - - - O +openadapt-grounding - - - - - - - +openadapt-retrieval R - - - - - - +openadapt-privacy - - - - - - - -USER: What action should be taken next? +Legend: R = Required, O = Optional, - = None ``` --- -## 7. Implementation Status +## 7. Feedback Loops -### Status Legend - -| Symbol | Meaning | -|--------|---------| -| Solid | Implemented and tested | -| Dashed | In progress or partial | -| Dotted | Planned/Future | - -### Component Status Matrix +### System-Level Feedback Architecture ``` -+----------------------+------------------+------------------+ -| Component | Status | Package | -+----------------------+------------------+------------------+ -| DEMONSTRATE PHASE | -+----------------------+------------------+------------------+ -| Screen capture | Solid | capture | -| Event recording | Solid | capture | -| A11y tree capture | Solid | capture | -| Audio transcription | Dashed | capture | -| Privacy scrubbing | Solid | privacy | -| Demo library storage | Solid | capture | -+----------------------+------------------+------------------+ -| LEARN PHASE | -+----------------------+------------------+------------------+ -| Demo embedding | Solid | retrieval | -| Vector indexing | Solid | retrieval | -| Similarity search | Solid | retrieval | -| API model adapters | Solid | ml | -| Training pipeline | Dashed | ml | -| LoRA fine-tuning | Dashed | ml | -| Process mining | Dotted | ml (future) | -+----------------------+------------------+------------------+ -| EXECUTE PHASE | -+----------------------+------------------+------------------+ -| Action execution | Solid | capture | -| Direct grounding | Solid | grounding | -| SoM grounding | Solid | grounding | -| OmniParser provider | Solid | grounding | -| Florence2 provider | Solid | grounding | -| Gemini grounding | Solid | grounding | -| WAA benchmark | Solid | evals | -| WebArena benchmark | Dashed | evals | -| OSWorld benchmark | Dotted | evals | -| Mock benchmark | Solid | evals | -+----------------------+------------------+------------------+ -| CROSS-CUTTING | -+----------------------+------------------+------------------+ -| Viewer HTML output | Solid | viewer | -| Trajectory replay | Solid | viewer | -| Training dashboard | Dashed | viewer | -| Benchmark viewer | Dashed | viewer | -| Telemetry | Dotted | telemetry (new) | -+----------------------+------------------+------------------+ + DEMONSTRATE + | + | Human demonstrations + v ++-----------------------------> LEARN <----------------------------+ +| | | +| | Trained policies | +| +-----------------------------|---------------------+ | +| | v | | +| | +-----------------> EXECUTE <--------------+ | | +| | | | | | | +| | | Retry on | Success/Failure | | | +| | | recoverable | outcomes | | | +| | | errors v | | | +| | | +-------+-------+ | | | +| | | | | | | | +| | +---------------+ EVALUATE +-----------+ | | +| | (Loop 1: Retry) | | | | +| | +-------+-------+ | | +| | | | | +| | | Execution traces | | +| | v | | +| | Demo library grows | | +| | | | | +| +---------------------------+ | | +| (Loop 2: Library Growth) | | +| | | +| Failure analysis identifies gaps | | +| | | | +| v | | +| Human correction | | +| | | | ++--------------------+ | | +(Loop 3: Human-in-Loop) | | + | | + Self-improvement loop | | + (execution traces -> training) | | + | | | + +------------------------+ | + (Loop 4: Self-Improvement) | + | + Benchmark-driven development | + (eval results -> architecture improvements) | + | | + +-----------------------------------+ + (Loop 5: Benchmark-Driven) ``` -### Priority Roadmap - -#### P0 - This Week -- [x] Capture package with Recorder -- [x] Retrieval with embedding and search -- [x] Evals with WAA benchmark + mock -- [x] Grounding providers (OmniParser, Florence, Gemini) -- [x] Viewer component library -- [x] API baselines (Claude, GPT, Gemini) -- [ ] PyPI releases for all packages -- [ ] WAA baseline metrics - -#### P1 - Next 2 Weeks -- [ ] Fine-tuning pipeline validation -- [ ] Demo conditioning integration in evals -- [ ] Multi-track evaluation (Direct, ReAct, SoM) -- [ ] docs.openadapt.ai launch - -#### P2 - This Month -- [ ] Training dashboard in viewer -- [ ] WebArena benchmark integration -- [ ] Cloud GPU training (Lambda Labs) -- [ ] v1.0.0 meta-package release - -#### P3 - Future -- [ ] Process mining / abstraction -- [ ] Self-improvement from execution traces -- [ ] Multi-agent collaboration -- [ ] Active learning with human feedback -- [ ] OSWorld benchmark integration - ---- - -## 8. Architecture Diagrams - -### Master Architecture Diagram (Evolved) - -This diagram synthesizes the three-phase pipeline with all key concepts: demo-conditioned prompting, policy/grounding separation, safety gate, multi-source data ingestion, the abstraction ladder, and evaluation-driven feedback loops. - -```mermaid -flowchart TB - %% ═══════════════════════════════════════════════════════════════════════ - %% USER LAYER - %% ═══════════════════════════════════════════════════════════════════════ - subgraph UserLayer["User Layer"] - CLI["openadapt CLI"] - UI["Desktop/Web GUI"] - end - - %% ═══════════════════════════════════════════════════════════════════════ - %% MULTI-SOURCE DATA INGESTION - %% ═══════════════════════════════════════════════════════════════════════ - subgraph DataSources["Multi-Source Data Ingestion"] - direction LR - HUMAN["Human
Demonstrations"] - SYNTH["Synthetic
Data"]:::future - BENCH_DATA["Benchmark
Tasks"] - EXTERNAL["External
Datasets"]:::future - end - - %% ═══════════════════════════════════════════════════════════════════════ - %% PHASE 1: DEMONSTRATE (Observation Collection) - %% ═══════════════════════════════════════════════════════════════════════ - subgraph Phase1["DEMONSTRATE (Observation Collection)"] - direction TB - - subgraph CaptureLayer["Capture"] - REC["Recorder
openadapt-capture"] - A11Y["A11y Tree"] - SCREENSHOT["Screenshots"] - EVENTS["Input Events"] - - REC --> A11Y - REC --> SCREENSHOT - REC --> EVENTS - end - - subgraph PrivacyLayer["Privacy"] - SCRUB["Scrubber
openadapt-privacy"] - REDACT["PII/PHI Redaction"] - SCRUB --> REDACT - end - - STORE[("Demo Library
(JSON/Parquet)")] - - A11Y --> SCRUB - SCREENSHOT --> SCRUB - EVENTS --> SCRUB - REDACT --> STORE - end - - %% ═══════════════════════════════════════════════════════════════════════ - %% PHASE 2: LEARN (Policy Acquisition) - %% ═══════════════════════════════════════════════════════════════════════ - subgraph Phase2["LEARN (Policy Acquisition)"] - direction TB - - subgraph RetrievalPath["Path A: Retrieval-Augmented Prompting"] - EMB["Embedder
openadapt-retrieval"] - IDX[("Vector Index")] - SEARCH["Similarity Search"] - - EMB --> IDX - IDX --> SEARCH - end - - subgraph TrainingPath["Path B: Fine-Tuning"] - LOADER["Data Loader"] - TRAINER["Model Trainer
openadapt-ml"] - LORA["LoRA Adapters"] - CKPT[("Model Checkpoints")] - - LOADER --> TRAINER - TRAINER --> LORA - LORA --> CKPT - end - - subgraph MiningPath["Path C: Process Mining"]:::futureBlock - ABSTRACT["Abstractor"]:::future - PATTERNS["Pattern Library"]:::future - - ABSTRACT --> PATTERNS - end - end - - %% ═══════════════════════════════════════════════════════════════════════ - %% PHASE 3: EXECUTE (Agent Deployment) - %% ═══════════════════════════════════════════════════════════════════════ - subgraph Phase3["EXECUTE (Agent Deployment)"] - direction TB - - subgraph AgentLoop["Agent Execution Loop"] - OBS["1. OBSERVE
(Screenshot + A11y)"] - GROUND["2. GROUND
openadapt-grounding"] - PLAN["3. PLAN
(Demo-Conditioned Policy)"] - ACT["4. ACT
(Input Synthesis)"] - - OBS --> GROUND - GROUND --> PLAN - PLAN --> ACT - end - - subgraph SafetyGate["Safety Gate (Runtime Layer)"] - VALIDATE["Action Validation"] - RISK["Risk Assessment"] - CONFIRM["Human Confirm"]:::future - - VALIDATE --> RISK - RISK --> CONFIRM - end - - subgraph Evaluation["Evaluation"] - EVALS["Benchmark Runner
openadapt-evals"] - METRICS["Metrics
(Success, Steps, Time)"] - COMPARE["Model Comparison"] - - EVALS --> METRICS - METRICS --> COMPARE - end - - ACT --> VALIDATE - CONFIRM --> EVALS - end +### Feedback Loop Details - %% ═══════════════════════════════════════════════════════════════════════ - %% THE ABSTRACTION LADDER - %% ═══════════════════════════════════════════════════════════════════════ - subgraph AbstractionLadder["The Abstraction Ladder"] - direction TB - L0["Level 0: LITERAL
(Raw Events)
{ press: 'h', press: 'i' }"] - L1["Level 1: SYMBOLIC
(Semantic Actions)
{ type: 'hi bob' }"] - L2["Level 2: TEMPLATE
(Parameterized)
{ type: 'hi <name>' }"] - L3["Level 3: SEMANTIC
(Intent Recognition)
{ greet: user }"]:::future - L4["Level 4: GOAL
(Task Specification)
'Greet customer'"]:::future - - L0 -->|"Reduction"| L1 - L1 -->|"Anonymization"| L2 - L2 -.->|"Process Mining"| L3 - L3 -.->|"Goal Composition"| L4 - end - - %% ═══════════════════════════════════════════════════════════════════════ - %% MODEL LAYER (VLM Adapters) - %% ═══════════════════════════════════════════════════════════════════════ - subgraph Models["Model Layer (VLM Adapters)"] - direction LR - - subgraph CloudModels["Cloud APIs"] - CLAUDE["Claude 3.5"] - GPT["GPT-4o"] - GEMINI["Gemini 2.0"] - end - - subgraph LocalModels["Local Models"] - QWEN["Qwen2-VL"] - CUSTOM["Custom Fine-tuned"] - end - end - - %% ═══════════════════════════════════════════════════════════════════════ - %% VIEWER (Cross-Cutting) - %% ═══════════════════════════════════════════════════════════════════════ - subgraph Viewer["Cross-Cutting: Viewer"] - VIZ["Trajectory
Visualization"] - REPLAY["Demo
Replay"] - DASH["Training
Dashboard"]:::partialImpl - end +| Loop | Name | Trigger | Outcome | Status | +|------|------|---------|---------|--------| +| 1 | **Retry** | Recoverable error | Re-attempt action | **Implemented** | +| 2 | **Library Growth** | Successful execution | New demo added | **Implemented** | +| 3 | **Human-in-Loop** | Unrecoverable failure | Human correction -> demo | **Implemented** | +| 4 | **Self-Improvement** | Execution traces | Fine-tuning | **Research** | +| 5 | **Benchmark-Driven** | Eval metrics | Architecture changes | **Active** | - %% ═══════════════════════════════════════════════════════════════════════ - %% DATA FLOW CONNECTIONS - %% ═══════════════════════════════════════════════════════════════════════ - - %% User interactions - CLI --> REC - UI --> REC - CLI --> TRAINER - CLI --> EVALS - - %% Multi-source ingestion - HUMAN --> REC - SYNTH -.-> LOADER - BENCH_DATA --> EVALS - EXTERNAL -.-> LOADER +--- - %% Demo flow to learning - STORE --> EMB - STORE --> LOADER - STORE -.-> ABSTRACT - - %% ═══════════════════════════════════════════════════════════════════════ - %% DEMO-CONDITIONED PROMPTING (Core Innovation) - %% Retrieval used in BOTH training AND evaluation - %% ═══════════════════════════════════════════════════════════════════════ - SEARCH -->|"demo context
(training)"| PLAN - SEARCH -->|"demo context
(evaluation)"| EVALS - CKPT -->|"trained policy"| PLAN - PATTERNS -.->|"templates"| PLAN - - %% Model connections (Policy/Grounding Separation) - PLAN -->|"action prediction"| Models - GROUND -->|"element localization"| Models - - %% ═══════════════════════════════════════════════════════════════════════ - %% EVALUATION-DRIVEN FEEDBACK LOOPS - %% ═══════════════════════════════════════════════════════════════════════ - METRICS -->|"success traces
(new demos)"| STORE - METRICS -.->|"training signal
(self-improvement)"| TRAINER - COMPARE -->|"failure analysis"| UserLayer +## 8. Implementation Status - %% Viewer connections - STORE -.-> VIZ - STORE -.-> REPLAY - CKPT -.-> DASH - METRICS -.-> DASH +### What's Implemented vs Future Work - %% ═══════════════════════════════════════════════════════════════════════ - %% STYLING - %% ═══════════════════════════════════════════════════════════════════════ - - %% Layer colors - classDef userLayer fill:#E74C3C,stroke:#A93226,color:#fff - classDef dataSource fill:#16A085,stroke:#0E6655,color:#fff - classDef phase1 fill:#3498DB,stroke:#1A5276,color:#fff - classDef phase2 fill:#27AE60,stroke:#1E8449,color:#fff - classDef phase3 fill:#9B59B6,stroke:#6C3483,color:#fff - classDef models fill:#F39C12,stroke:#B7950B,color:#fff - classDef viewer fill:#1ABC9C,stroke:#148F77,color:#fff - classDef safetyGate fill:#E74C3C,stroke:#922B21,color:#fff - - %% Implementation status - classDef implemented fill:#2ECC71,stroke:#1E8449,color:#fff - classDef partialImpl fill:#F4D03F,stroke:#B7950B,color:#000 - classDef future fill:#95A5A6,stroke:#707B7C,color:#fff,stroke-dasharray: 5 5 - classDef futureBlock fill:#EAECEE,stroke:#95A5A6,stroke-dasharray: 5 5 - - %% Apply layer styles - class CLI,UI userLayer - class HUMAN,BENCH_DATA dataSource - class REC,A11Y,SCREENSHOT,EVENTS,SCRUB,REDACT,STORE phase1 - class EMB,IDX,SEARCH,LOADER,TRAINER,LORA,CKPT phase2 - class OBS,GROUND,PLAN,ACT,VALIDATE,RISK,EVALS,METRICS,COMPARE phase3 - class CLAUDE,GPT,GEMINI,QWEN,CUSTOM models - class VIZ,REPLAY viewer - - %% Apply abstraction ladder styles (implemented vs future) - class L0,L1,L2 implemented +``` ++==============================================================================+ +| IMPLEMENTED (Solid) | ++==============================================================================+ +| Component | Package | Notes | ++--------------------------+------------------+--------------------------------+ +| Screen capture | capture | macOS, Windows, Linux | +| Event recording | capture | Mouse, keyboard, touch | +| Event aggregation | capture | Literal -> Symbolic | +| A11y tree capture | capture | macOS, Windows | +| Demo storage | capture | JSON/Parquet/PNG | +| Privacy scrubbing | privacy | Presidio, AWS Comprehend | +| Demo embedding | retrieval | CLIP, SigLIP | +| Vector indexing | retrieval | FAISS, Annoy | +| Similarity search | retrieval | Top-k retrieval | +| API model adapters | ml | Claude, GPT-4V, Gemini | +| Element detection | grounding | OmniParser, Florence2 | +| SoM annotation | grounding | Numbered element labels | +| WAA benchmark | evals | Full integration | +| Mock benchmark | evals | Testing infrastructure | +| HTML visualization | viewer | Trajectory replay | +| Unified CLI | openadapt | capture/train/eval/view | ++==============================================================================+ + ++==============================================================================+ +| IN PROGRESS (Dashed) | ++==============================================================================+ +| Component | Package | Notes | ++--------------------------+------------------+--------------------------------+ +| Training pipeline | ml | Qwen-VL fine-tuning | +| LoRA adapters | ml | Parameter-efficient training | +| Template extraction | capture | Regex-based parameterization | +| WebArena benchmark | evals | Browser automation | +| Training dashboard | viewer | Loss/metrics visualization | +| Audio transcription | capture | Whisper integration | ++--------------------------+------------------+--------------------------------+ + ++==============================================================================+ +| FUTURE WORK (Dotted) | ++==============================================================================+ +| Component | Package | Notes | ++--------------------------+------------------+--------------------------------+ +| Process mining | ml (future) | Semantic action discovery | +| Goal composition | ml (future) | High-level task planning | +| Self-improvement | ml (future) | Training on execution traces | +| OSWorld benchmark | evals | Cross-platform desktop | +| Multi-agent collaboration| ml (future) | Agent coordination | +| Active learning | ml (future) | Human feedback integration | +| Mobile platform | capture | iOS, Android | +| Safety gates | evals | Action validation layer | ++==============================================================================+ ``` -### Key Architectural Insights - -#### 1. Demo-Conditioned Prompting (Core Innovation) - -The diagram shows how **retrieval** feeds into BOTH: -- **Training path**: Similar demos condition the fine-tuning process -- **Evaluation path**: Retrieved demos provide in-context examples for API agents - -This "show, don't tell" approach improves first-action accuracy from 33% to 100%. - -#### 2. Policy/Grounding Separation +### Abstraction Ladder Implementation Status -The EXECUTE phase clearly separates: -- **Policy** (PLAN): Decides *what* action to take (uses VLM reasoning) -- **Grounding**: Determines *where* to execute (UI element localization via SoM, OmniParser, etc.) +| Level | Name | Status | Implementation | +|-------|------|--------|----------------| +| 0 | Literal | **Implemented** | Raw event recording in `capture` | +| 1 | Symbolic | **Implemented** | Event aggregation in `capture` | +| 2 | Template | **Partial** | Regex extraction in `capture` | +| 3 | Semantic | **Research** | LLM intent recognition | +| 4 | Goal | **Future** | Process mining | -#### 3. Safety Gate as Runtime Layer +--- -Before action execution, the Safety Gate provides: -- Action validation (sanity checks) -- Risk assessment (destructive action detection) -- Human confirmation (future: for high-risk actions) +## 9. Architecture Evolution Diagrams -#### 4. The Abstraction Ladder +### Era 1: Alpha Monolith (2023) -Progressive generalization from raw events to goals: -- **Implemented**: Literal -> Symbolic -> Template -- **Future**: Semantic -> Goal (requires process mining) +``` ++=========================================================================+ +| ALPHA ARCHITECTURE (2023) | ++=========================================================================+ +| | +| +------------------------------------------------------------------+ | +| | openadapt (monolithic) | | +| +------------------------------------------------------------------+ | +| | | | +| | +-------------+ +-------------+ +-------------+ | | +| | | record | -> | visualize | -> | replay | | | +| | +-------------+ +-------------+ +-------------+ | | +| | | | | | | +| | v v v | | +| | +-------------+ +-------------+ +------------------+ | | +| | | models | | plotting | | strategies/ | | | +| | | - Recording | | - HTML gen | | - base.py | | | +| | | - ActionEvt | | | | - naive.py | | | +| | | - Screenshot| | | | - vanilla.py | | | +| | | - WindowEvt | | | | - visual.py | | | +| | +-------------+ +-------------+ +------------------+ | | +| | | | | | +| | v v | | +| | +-------------+ +---------------+ | | +| | | db/ | | adapters/ | | | +| | | - SQLite | | - anthropic | | | +| | | - CRUD ops | | - openai | | | +| | +-------------+ | - replicate | | | +| | +---------------+ | | +| +------------------------------------------------------------------+ | +| | ++=========================================================================+ + +Characteristics: +- Single repository, single package +- Tightly coupled components +- Strategy pattern for replay variants +- SQLite + Alembic migrations +- Prompts embedded in code +``` -#### 5. Evaluation-Driven Feedback Loops +### Era 2: Transition (2024) -Three feedback mechanisms: -1. **Demo Library Growth**: Success traces become new training data -2. **Self-Improvement**: Training signal from execution metrics (future) -3. **Failure Analysis**: Human review of failed executions +``` ++=========================================================================+ +| TRANSITION ARCHITECTURE (2024) | ++=========================================================================+ +| | +| Legacy codebase frozen -> /legacy/ | +| | +| New modular packages designed: | +| | +| +-------------+ +-------------+ +-------------+ +-------------+ | +| | capture | | ml | | evals | | viewer | | +| +-------------+ +-------------+ +-------------+ +-------------+ | +| | privacy | | retrieval | | grounding | | +| +-------------+ +-------------+ +-------------+ | +| | +| Key changes: | +| - Separate PyPI packages | +| - Lazy imports for optional deps | +| - Unified CLI in meta-package | +| - Policy/grounding separation | +| - Benchmark-first development | +| | ++=========================================================================+ +``` ---- +### Era 3: Modern Meta-Package (2025+) -### Legacy Master Architecture Diagram +``` ++=========================================================================+ +| MODERN ARCHITECTURE (2025+) | ++=========================================================================+ +| | +| +------------------+ | +| | User Layer | | +| | CLI / Web UI | | +| +--------+---------+ | +| | | +| v | +| +------------------+ | +| | openadapt | | +| | (meta-package) | | +| +--------+---------+ | +| | | +| +------------------------+------------------------+ | +| | | | | | | +| v v v v v | +| +---------+ +---------+ +---------+ +---------+ +--------+ | +| | capture | | ml | | evals | | viewer | |optional| | +| +---------+ +---------+ +---------+ +---------+ +--------+ | +| | | | | | | +| v v v v v | +| +---------------------------------------------------------------+ | +| | Shared Interfaces | | +| | - Trajectory format (JSON/Parquet) | | +| | - Action space specification | | +| | - Observation schema | | +| | - Benchmark protocols | | +| +---------------------------------------------------------------+ | +| | | +| v | +| +---------------------------------------------------------------+ | +| | Model Layer | | +| | +----------+ +----------+ +----------+ +----------+ | | +| | | Claude | | GPT-4V | | Gemini | | Qwen-VL | | | +| | +----------+ +----------+ +----------+ +----------+ | | +| +---------------------------------------------------------------+ | +| | ++=========================================================================+ +``` -For reference, the previous architecture diagram: +### Full System Architecture (Mermaid) ```mermaid flowchart TB @@ -824,8 +815,8 @@ flowchart TB subgraph Phase1["DEMONSTRATE"] direction TB - REC[Recorder] - SCRUB[Privacy Scrubber] + REC[Recorder
openadapt-capture] + SCRUB[Privacy Scrubber
openadapt-privacy] STORE[(Demo Library)] REC --> SCRUB @@ -855,11 +846,11 @@ flowchart TB subgraph Phase3["EXECUTE"] direction TB - OBS[Observer] - GROUND[Grounder] - PLAN[Planner] - ACT[Actuator] - EVAL[Evaluator] + OBS[1. OBSERVE] + GROUND[2. GROUND
openadapt-grounding] + PLAN[3. PLAN
Demo-Conditioned] + ACT[4. ACT] + EVAL[5. EVALUATE
openadapt-evals] OBS --> GROUND GROUND --> PLAN @@ -874,7 +865,6 @@ flowchart TB GPT[GPT-4o] GEMINI[Gemini] QWEN[Qwen-VL] - CUSTOM[Custom] end subgraph Viewer["Cross-Cutting: Viewer"] @@ -900,8 +890,8 @@ flowchart TB TRAINER --> CKPT ABSTRACT --> PATTERNS - %% Execution flow - SEARCH -.->|context| PLAN + %% Execution flow (demo-conditioning) + SEARCH -->|demo context| PLAN CKPT -->|policy| PLAN PATTERNS -.->|templates| PLAN @@ -932,10 +922,51 @@ flowchart TB class EMB,IDX,SEARCH,LOADER,TRAINER,CKPT phase2 class ABSTRACT,PATTERNS future class OBS,GROUND,PLAN,ACT,EVAL phase3 - class CLAUDE,GPT,GEMINI,QWEN,CUSTOM models + class CLAUDE,GPT,GEMINI,QWEN models class VIZ,REPLAY,DASH viewer ``` +### Execution Loop Evolution + +``` +ALPHA: Strategy-Based MODERN: Policy/Grounding +================================ ================================ + ++------------------+ +------------------+ +| BaseReplay | | OBSERVE | +| Strategy | | (Screenshot + | +| | | A11y tree) | +| while True: | +--------+---------+ +| screenshot = | | +| take() | v +| action = | +------------------+ +| get_next() | ------> | GROUND | +| play(action) | | (Element detect | +| | | + SoM annotate)| ++------------------+ +--------+---------+ + | + v + +------------------+ + | PLAN | + | (VLM reasoning | + | + demo context)| + +--------+---------+ + | + v + +------------------+ + | ACT | + | (Input synth + | + | safety check) | + +--------+---------+ + | + v + +------------------+ + | EVALUATE | + | (Success check | + | + feedback) | + +------------------+ +``` + ### Package Responsibility Diagram ```mermaid @@ -1001,57 +1032,6 @@ flowchart LR class OA meta ``` -### Execution Loop Diagram - -```mermaid -stateDiagram-v2 - [*] --> Observe - - Observe --> Ground: screenshot + a11y - Ground --> Plan: elements + coordinates - Plan --> Act: action prediction - Act --> Evaluate: action result - - Evaluate --> Observe: continue - Evaluate --> Success: task complete - Evaluate --> Retry: recoverable error - Evaluate --> Escalate: unrecoverable - - Retry --> Observe - Escalate --> [*] - Success --> [*] - - note right of Observe - Capture screenshot - Extract a11y tree - Build observation - end note - - note right of Ground - Detect UI elements - Apply SoM labels - Get coordinates - end note - - note right of Plan - Encode with VLM - Retrieve similar demos - Generate action - end note - - note right of Act - Parse action type - Execute input - Record for history - end note - - note right of Evaluate - Check success - Detect failures - Decide next step - end note -``` - ### Feedback Loop Diagram ```mermaid @@ -1094,151 +1074,42 @@ flowchart TB --- -## 9. Key Design Principles - -### Principle 1: Model Agnostic - -OpenAdapt works with any VLM that can process images and generate text. - -**Implementation**: -- Adapter pattern for model integration -- Unified prompt format across providers -- Switchable at runtime via configuration - -**Rationale**: -- Avoid vendor lock-in -- Enable cost optimization -- Future-proof against model evolution - -### Principle 2: Demo-Conditioned - -Agents learn from human examples, not just prompts. - -**Implementation**: -- Retrieval-augmented prompting -- Fine-tuning on demonstration datasets -- Context windows include similar past examples - -**Rationale**: -- Captures implicit knowledge -- Reduces ambiguity -- Enables transfer learning - -### Principle 3: Abstraction-Aware - -Progress from literal replay to semantic understanding. - -**Implementation**: -- Abstraction ladder (literal -> symbolic -> template -> semantic -> goal) -- Incremental abstraction during processing -- Human-readable intermediate representations - -**Rationale**: -- Enables generalization -- Supports explanation and debugging -- Allows cross-application transfer - -### Principle 4: Evaluation-Driven - -Rigorous benchmarking on standard tasks. - -**Implementation**: -- WAA, WebArena, OSWorld benchmark integrations -- Automated regression detection -- Public leaderboard metrics - -**Rationale**: -- Objective progress measurement -- Community comparability -- Quality assurance - -### Principle 5: Privacy-First - -Optional PII/PHI scrubbing at every stage. - -**Implementation**: -- `openadapt-privacy` package -- Configurable scrubbing levels -- Local-only deployment option - -**Rationale**: -- Enterprise compliance (HIPAA, GDPR) -- User trust -- Responsible AI - -### Principle 6: Open Source - -MIT license, community-driven development. - -**Implementation**: -- All packages on GitHub -- Public roadmap and issues -- Contribution guidelines - -**Rationale**: -- Transparency -- Community innovation -- No vendor lock-in - ---- - -## 10. Research Alignment - -OpenAdapt's architecture aligns with and builds upon recent GUI agent research. - -### Key Research Papers - -| Paper | Contribution | OpenAdapt Integration | -|-------|--------------|----------------------| -| Claude Computer Use (Anthropic, 2024) | Production VLM agent API | API adapter in `openadapt-ml` | -| UFO (Microsoft, 2024) | Windows agent architecture | Prompt patterns adopted | -| OSWorld (CMU, 2024) | Cross-platform benchmark | Benchmark adapter planned | -| Set-of-Mark (Microsoft, 2023) | Visual grounding via labels | Core grounding mode | -| OmniParser (Microsoft, 2024) | Pure-vision UI parsing | Provider in `openadapt-grounding` | -| WebArena (CMU, 2023) | Web automation benchmark | Benchmark adapter implemented | -| Mind2Web (OSU, 2023) | Web action prediction | Dataset format compatible | - -### Research Contributions - -OpenAdapt contributes to the research community through: - -1. **Open Benchmark Infrastructure**: Standardized evaluation setup -2. **Demonstration Dataset Format**: Interoperable trajectory format -3. **Retrieval-Augmented Agents**: Demo conditioning research -4. **Grounding Comparison**: Multi-provider benchmarks -5. **Abstraction Research**: Process mining for GUI agents - ---- - -## 11. Future Directions +## 10. Future Directions ### Near-Term (Q1 2026) -- Complete fine-tuning pipeline validation -- Achieve competitive WAA benchmark scores -- Launch docs.openadapt.ai -- Release v1.0.0 meta-package +| Priority | Goal | Package | Status | +|----------|------|---------|--------| +| P0 | PyPI releases for all packages | all | In progress | +| P0 | WAA baseline metrics established | evals | Pending | +| P1 | Fine-tuning pipeline validated | ml | In progress | +| P1 | Demo conditioning in evals | evals + retrieval | Pending | +| P2 | docs.openadapt.ai launched | docs | Pending | ### Medium-Term (2026) -- Process mining implementation -- Self-improvement loop activation -- Multi-benchmark evaluation suite -- Enterprise deployment guides +| Goal | Description | +|------|-------------| +| **Process Mining** | Automatic extraction of semantic actions from demos | +| **Self-Improvement** | Training on successful execution traces | +| **Multi-Benchmark** | WebArena + OSWorld integration | +| **Enterprise Deployment** | Production deployment guides | ### Long-Term (2026+) -- Multi-agent collaboration -- Active learning with human feedback -- Mobile platform support -- Cross-platform transfer learning +| Goal | Description | +|------|-------------| +| **Cross-App Transfer** | Demos from Excel help with Google Sheets | +| **Multi-Agent** | Coordinated agents for complex workflows | +| **Active Learning** | Agents request human help strategically | +| **Mobile Platforms** | iOS and Android capture/replay | -### Research Agenda +### Research Questions -1. **Abstraction Hierarchy**: Can we automatically extract semantic actions from demonstrations? -2. **Transfer Learning**: How do demos from one app help in another? -3. **Active Learning**: When should the agent ask for human help? -4. **Explanation**: How do we make agent decisions interpretable? +1. **Abstraction Discovery**: Can we automatically extract semantic actions from literal event sequences? +2. **Transfer Learning**: How much does demo conditioning help across different applications? +3. **Explanation**: How do we make agent decisions interpretable to users? +4. **Safety**: What guardrails prevent harmful autonomous actions? --- @@ -1246,12 +1117,13 @@ OpenAdapt contributes to the research community through: | Term | Definition | |------|------------| -| **A11y Tree** | Accessibility tree - structured representation of UI elements | +| **A11y Tree** | Accessibility tree - structured UI element representation | | **Demo** | Recorded human demonstration (trajectory) | -| **Grounding** | Mapping text descriptions to UI coordinates | +| **Grounding** | Mapping text/intent to specific UI coordinates | | **LoRA** | Low-Rank Adaptation - efficient fine-tuning method | -| **SoM** | Set-of-Mark - visual grounding via element labels | -| **Trajectory** | Sequence of observations and actions | +| **Policy** | Decision function mapping observations to actions | +| **SoM** | Set-of-Mark - visual grounding via numbered labels | +| **Trajectory** | Sequence of (observation, action) pairs | | **VLM** | Vision-Language Model | | **WAA** | Windows Agent Arena benchmark | @@ -1259,14 +1131,15 @@ OpenAdapt contributes to the research community through: - [Architecture Overview](./architecture.md) - Package structure and data flow - [Roadmap Priorities](./roadmap-priorities.md) - Current development priorities -- [Telemetry Design](./design/telemetry-design.md) - Telemetry implementation -- [Landing Page Strategy](./design/landing-page-strategy.md) - Messaging and positioning +- [Package Documentation](./packages/index.md) - Individual package guides +- [Legacy Freeze](./legacy/freeze.md) - Migration from monolith ## Appendix C: Version History | Version | Date | Changes | |---------|------|---------| -| 2.0 | Jan 2026 | Comprehensive redesign, SOTA alignment | +| 3.0 | Jan 2026 | Alpha vision synthesis, evolution diagrams, SOTA alignment | +| 2.0 | Jan 2026 | Comprehensive redesign, modular architecture | | 1.0 | Dec 2025 | Initial modular architecture | | 0.x | 2023-2024 | Legacy monolithic design | diff --git a/docs/architecture.md b/docs/architecture.md index 6604fbb5e..664eddcdc 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -79,26 +79,26 @@ flowchart TB ```mermaid flowchart LR - subgraph Record["1. Record"] - A[User Demo] --> B[Capture Session] - B --> C[Screenshots + Events] + subgraph Demonstrate["1. Demonstrate"] + A[Human Trajectory] --> B[Capture Session] + B --> C[Observations + Actions] end subgraph Store["2. Store"] C --> D[JSON/Parquet Files] - D --> E[Demo Library] + D --> E[Demonstration Library] end - subgraph Train["3. Train"] - E --> F[Data Loading] - F --> G[Model Training] + subgraph Learn["3. Learn"] + E --> F[Trajectory Abstraction] + F --> G[Policy Learning] G --> H[Checkpoint] end - subgraph Deploy["4. Deploy"] - H --> I[Agent Policy] + subgraph Execute["4. Execute"] + H --> I[Trained Policy] I --> J[Inference] - J --> K[Action Replay] + J --> K[Agent Deployment] end subgraph Evaluate["5. Evaluate"] @@ -164,17 +164,17 @@ graph TD | Package | Responsibility | Key Exports | |---------|---------------|-------------| -| **openadapt-capture** | GUI recording, event capture, storage | `CaptureSession`, `Recorder`, `Action` | -| **openadapt-ml** | Model training, inference, adapters | `QwenVLAdapter`, `Trainer`, `AgentPolicy` | +| **openadapt-capture** | Demonstration collection, observation-action capture, storage | `CaptureSession`, `Recorder`, `Action` | +| **openadapt-ml** | Policy learning, training, inference | `QwenVLAdapter`, `Trainer`, `AgentPolicy` | | **openadapt-evals** | Benchmark evaluation, metrics | `ApiAgent`, `BenchmarkAdapter`, `evaluate_agent_on_benchmark` | -| **openadapt-viewer** | HTML visualization, replay viewer | `PageBuilder`, `HTMLBuilder` | +| **openadapt-viewer** | Trajectory visualization | `PageBuilder`, `HTMLBuilder` | ### Optional Packages | Package | Responsibility | Use Case | |---------|---------------|----------| -| **openadapt-grounding** | UI element localization | Improved click accuracy with element detection | -| **openadapt-retrieval** | Multimodal demo search | Find similar demonstrations for few-shot prompting | +| **openadapt-grounding** | UI element grounding | Improved action accuracy with element detection | +| **openadapt-retrieval** | Multimodal trajectory search | Find similar demonstrations for few-shot policy learning | | **openadapt-privacy** | PII/PHI scrubbing | Redact sensitive data before storage/training | ## Evaluation Loop @@ -275,14 +275,14 @@ graph LR pip install openadapt # Individual packages -pip install openadapt[capture] # GUI capture/recording -pip install openadapt[ml] # ML training and inference +pip install openadapt[capture] # Demonstration collection +pip install openadapt[ml] # Policy learning and inference pip install openadapt[evals] # Benchmark evaluation -pip install openadapt[viewer] # HTML visualization +pip install openadapt[viewer] # Trajectory visualization # Optional packages -pip install openadapt[grounding] # UI element localization -pip install openadapt[retrieval] # Demo search/retrieval +pip install openadapt[grounding] # UI element grounding +pip install openadapt[retrieval] # Trajectory retrieval pip install openadapt[privacy] # PII/PHI scrubbing # Bundles diff --git a/docs/assets/architecture-diagram.png b/docs/assets/architecture-diagram.png index 1232c988d..2a8ebf1d3 100644 Binary files a/docs/assets/architecture-diagram.png and b/docs/assets/architecture-diagram.png differ diff --git a/docs/cli.md b/docs/cli.md index 996cbf21b..02ac8bdc4 100644 --- a/docs/cli.md +++ b/docs/cli.md @@ -42,11 +42,11 @@ This verifies: ## Capture Commands -Commands for recording user demonstrations. +Commands for collecting human demonstrations. ### capture start -Start a new recording session. +Start a new demonstration collection session. ```bash openadapt capture start --name [options] @@ -64,25 +64,25 @@ openadapt capture start --name [options] **Examples:** ```bash -# Basic recording +# Basic demonstration collection openadapt capture start --name login-task -# Recording without screenshots +# Demonstration collection without screenshots openadapt capture start --name audio-task --no-screenshots -# Recording with slower screenshot interval +# Demonstration collection with slower screenshot interval openadapt capture start --name slow-task --interval 1.0 ``` ### capture stop -Stop the current recording. +Stop the current demonstration collection. ```bash openadapt capture stop ``` -Alternatively, press `Ctrl+C` in the recording terminal. +Alternatively, press `Ctrl+C` in the capture terminal. ### capture list @@ -103,7 +103,7 @@ form-fill 89 5m 42s 2026-01-14 ### capture view -Open the viewer for a capture. +Open the trajectory viewer for a demonstration. ```bash openadapt capture view [options] @@ -113,13 +113,13 @@ openadapt capture view [options] | Argument | Required | Description | |----------|----------|-------------| -| `` | Yes | Name of the capture to view | +| `` | Yes | Name of the demonstration to view | | `--port` | No | Server port (default: 8080) | | `--no-browser` | No | Don't open browser automatically | ### capture delete -Delete a capture. +Delete a demonstration. ```bash openadapt capture delete @@ -129,11 +129,11 @@ openadapt capture delete ## Train Commands -Commands for training ML models. +Commands for policy learning from demonstrations. ### train start -Start training a model on a capture. +Start policy learning from a demonstration. ```bash openadapt train start --capture --model [options] @@ -143,7 +143,7 @@ openadapt train start --capture --model [options] | Argument | Required | Description | |----------|----------|-------------| -| `--capture` | Yes | Name of the capture to train on | +| `--capture` | Yes | Name of the demonstration to train on | | `--model` | Yes | Model architecture | | `--epochs` | No | Number of training epochs (default: 10) | | `--batch-size` | No | Batch size (default: 4) | @@ -159,10 +159,10 @@ openadapt train start --capture --model [options] **Examples:** ```bash -# Basic training +# Basic policy learning openadapt train start --capture login-task --model qwen3vl-2b -# Training with custom parameters +# Policy learning with custom parameters openadapt train start \ --capture login-task \ --model qwen3vl-7b \ @@ -173,7 +173,7 @@ openadapt train start \ ### train status -Check training progress. +Check policy learning progress. ```bash openadapt train status @@ -191,7 +191,7 @@ ETA: 15 minutes ### train stop -Stop the current training. +Stop the current policy learning. ```bash openadapt train stop diff --git a/docs/design/landing-page-strategy.md b/docs/design/landing-page-strategy.md new file mode 100644 index 000000000..ca6e31556 --- /dev/null +++ b/docs/design/landing-page-strategy.md @@ -0,0 +1,712 @@ +# OpenAdapt.ai Landing Page Strategy + +**Document Version**: 1.0 +**Date**: January 2026 +**Author**: Generated with AI assistance +**Status**: Proposal for Review + +--- + +## Table of Contents + +1. [Current State Analysis](#1-current-state-analysis) +2. [Target Audience Definitions](#2-target-audience-definitions) +3. [Core Messaging Strategy](#3-core-messaging-strategy) +4. [Competitive Positioning](#4-competitive-positioning) +5. [Page Section Recommendations](#5-page-section-recommendations) +6. [Copy Suggestions](#6-copy-suggestions) +7. [Wireframe Concepts](#7-wireframe-concepts) +8. [Social Proof Strategy](#8-social-proof-strategy) +9. [Call-to-Action Strategy](#9-call-to-action-strategy) +10. [Implementation Priorities](#10-implementation-priorities) + +--- + +## 1. Current State Analysis + +### 1.1 What OpenAdapt IS Today + +OpenAdapt has evolved from a monolithic application (v0.46.0) to a **modular meta-package architecture** (v1.0+). This is a significant architectural maturation that should be reflected in messaging. + +**Core Value Proposition (Current Reality)**: +- The **open** source software **adapt**er between Large Multimodal Models (LMMs) and desktop/web GUIs +- Record demonstrations, train models, evaluate agents via unified CLI +- Works with any VLM: Claude, GPT-4V, Gemini, Qwen, or custom fine-tuned models + +**Technical Differentiators (Verified)**: +1. **Model Agnostic**: Not locked to one AI provider +2. **Demo-Prompted, Not User-Prompted**: Learn from human demonstration, not complex prompt engineering +3. **Universal GUI Support**: Native apps, web browsers, virtualized environments +4. **Open Source (MIT License)**: Full transparency, no vendor lock-in + +**Key Innovation**: +- **Trajectory-conditioned disambiguation of UI affordances** - validated experiment showing 33% -> 100% first-action accuracy with demo conditioning +- **Set-of-Marks (SoM) mode**: 100% accuracy on synthetic benchmarks using element IDs instead of coordinates + +### 1.2 Current Landing Page Assessment + +**What's Working Well**: +- Clean, professional design with dark theme +- Video demo at hero section +- GitHub star/fork buttons for social proof +- Platform-specific installation instructions (auto-detects OS) +- PyPI download statistics showing traction +- Industry use cases grid (HR, Law, Insurance, etc.) +- Email signup for updates + +**What's Missing or Unclear**: +1. **No clear "what is this?"** - Visitors need to watch a video to understand +2. **Tagline "AI for Desktops" is vague** - Doesn't differentiate from competitors +3. **No comparison to alternatives** - Why choose OpenAdapt over Anthropic Computer Use? +4. **No technical credibility indicators** - No benchmark scores, no research citations +5. **Industry grid is generic** - Same features could apply to any automation tool +6. **No developer/researcher angle** - Focuses only on end-user automation +7. **Architecture transition is hidden** - v1.0+ modular design is a major selling point +8. **No clear "Who is this for?"** - Tries to appeal to everyone, resonates with no one + +**Carousel Messages Analysis**: +- "Show, don't tell." - Good but cryptic +- "Perform, don't prompt." - Best differentiator, should be prominent +- "Record, replay, and share." - Functional but not compelling + +### 1.3 Technical Accuracy Issues + +The current site doesn't reflect: +- The modular package architecture (7 focused sub-packages) +- The evaluation infrastructure (WAA, WebArena benchmarks) +- The ML training capabilities (VLM fine-tuning, LoRA) +- The retrieval-augmented prompting (demo library search) +- The privacy scrubbing capabilities (PII/PHI redaction) + +--- + +## 2. Target Audience Definitions + +### 2.1 Primary Audiences + +#### A. Developers Building Automation Agents + +**Profile**: +- Building AI-powered tools that interact with GUIs +- May be creating internal tools, startup products, or client solutions +- Comfortable with Python, CLI tools, ML concepts +- Want flexibility, not black-box solutions + +**Pain Points**: +- API-only agents (Claude Computer Use) lack customization +- Building from scratch is too slow +- Need to run locally for privacy/security +- Want to fine-tune models on specific workflows + +**What They Need to See**: +- Clear architecture diagrams +- Code examples (pip install, quick start) +- Benchmark scores vs. alternatives +- Extensibility points (adapters, plugins) + +**Key Message**: "The open source SDK for building GUI automation agents" + +#### B. Enterprise Process Automation Buyers + +**Profile**: +- Looking to automate repetitive knowledge work +- Concerned about security, privacy, compliance +- Need to justify ROI and integrate with existing systems +- Often have IT/security review requirements + +**Pain Points**: +- Existing RPA is brittle and expensive to maintain +- Cloud-only AI raises data privacy concerns +- Need clear enterprise support options +- Require compliance with industry regulations + +**What They Need to See**: +- Privacy features (PII/PHI scrubbing) +- On-premise deployment options +- Enterprise support/contact information +- Industry-specific use case studies +- Security architecture information + +**Key Message**: "AI-first automation that runs where your data lives" + +#### C. ML Researchers Studying GUI Agents + +**Profile**: +- Academic researchers or industry R&D teams +- Working on VLM capabilities, agent architectures, benchmarks +- Need reproducible baselines and evaluation infrastructure +- Want to contribute to or build upon open research + +**Pain Points**: +- Existing benchmarks are hard to set up +- Need standardized evaluation metrics +- Want to compare models fairly +- Limited open-source alternatives to proprietary agent frameworks + +**What They Need to See**: +- Benchmark integration (WAA, WebArena, OSWorld) +- Published metrics and methodology +- Research paper citations (if any) +- Clear contribution pathways +- Schema/data format documentation + +**Key Message**: "Open infrastructure for GUI agent research and benchmarking" + +#### D. ML Engineers Interested in VLM Fine-Tuning + +**Profile**: +- Want to train custom models for specific GUI tasks +- Familiar with training infrastructure (LoRA, PEFT, etc.) +- Looking for training data and pipelines +- Want efficient local or cloud training options + +**Pain Points**: +- Collecting GUI interaction data is tedious +- Setting up VLM training pipelines is complex +- Need baselines to compare against +- Cloud GPU costs add up quickly + +**What They Need to See**: +- Training pipeline documentation +- Supported models (Qwen3-VL, etc.) +- Training results (before/after fine-tuning) +- Cloud GPU integration (Lambda Labs, Azure) +- Data format specifications + +**Key Message**: "Record demonstrations, train specialized GUI agents" + +### 2.2 Audience Prioritization + +For the landing page, prioritize in this order: +1. **Developers** (highest volume, most likely to convert to users/contributors) +2. **Enterprise buyers** (revenue potential, require dedicated section) +3. **ML engineers** (overlaps with developers, training angle) +4. **Researchers** (smaller audience, but important for credibility) + +--- + +## 3. Core Messaging Strategy + +### 3.1 Primary Tagline Options + +**Option A (Recommended)**: +> **"Teach AI to use any software."** + +Why: Simple, benefit-focused, implies the key differentiator (demonstration-based learning) + +**Option B**: +> **"The open source adapter between AI and any GUI."** + +Why: Explains the technical position, highlights open source + +**Option C**: +> **"Perform, don't prompt."** + +Why: Clever contrast to prompt engineering, memorable + +**Option D**: +> **"Record. Train. Automate."** + +Why: Clear 3-step process, action-oriented + +### 3.2 Supporting Taglines (Subheadlines) + +- "Show AI how to do a task once. Let it handle the rest." +- "From human demonstration to AI automation in minutes." +- "Open source GUI automation with the AI model of your choice." +- "Works with Claude, GPT-4V, Gemini, Qwen, or your own fine-tuned models." + +### 3.3 Key Differentiators to Emphasize + +1. **Demonstration-Based Learning** + - Not: "Use natural language to describe tasks" + - But: "Just do the task and OpenAdapt learns from watching" + - Proof: 33% -> 100% first-action accuracy with demo conditioning + +2. **Model Agnostic** + - Not: "Works with [specific AI]" + - But: "Your choice: Claude, GPT-4V, Gemini, Qwen, or custom models" + - Proof: Adapters for multiple VLM backends + +3. **Runs Anywhere** + - Not: "Cloud-powered automation" + - But: "Run locally, in the cloud, or hybrid" + - Proof: CLI-based, works offline + +4. **Open Source** + - Not: "Try our free tier" + - But: "MIT licensed, fully transparent, community-driven" + - Proof: GitHub, PyPI, active Discord + +### 3.4 Messaging Framework + +**For Developers**: +> "Build GUI automation agents with a modular Python SDK. Record demonstrations, train models, evaluate on benchmarks. Works with any VLM." + +**For Enterprise**: +> "AI-first process automation that learns from your team. Privacy-first architecture with PII/PHI scrubbing. Deploy where your data lives." + +**For Researchers**: +> "Open infrastructure for GUI agent research. Standardized benchmarks, reproducible baselines, extensible architecture." + +**For ML Engineers**: +> "Fine-tune VLMs on real GUI workflows. Record data, train with LoRA, evaluate accuracy. Local or cloud training." + +--- + +## 4. Competitive Positioning + +### 4.1 Primary Competitors + +| Competitor | Strengths | Weaknesses | Our Advantage | +|------------|-----------|------------|---------------| +| **Anthropic Computer Use** | First-mover, Claude integration, simple API | Proprietary, cloud-only, no customization | Open source, model-agnostic, trainable | +| **UI-TARS (ByteDance)** | Strong benchmark scores, research backing | Closed source, not productized | Open source, deployable, extensible | +| **Traditional RPA (UiPath, etc.)** | Enterprise-proven, large ecosystems | Brittle selectors, no AI reasoning, expensive | AI-first, learns from demos, affordable | +| **GPT-4V + Custom Code** | Powerful model, flexibility | Requires building everything, no structure | Ready-made SDK, training pipeline, benchmarks | + +### 4.2 Positioning Statement + +> "OpenAdapt is the **open source alternative** to proprietary GUI automation APIs. Unlike cloud-only solutions, OpenAdapt lets you **train custom models** on your workflows and **deploy anywhere**. Unlike traditional RPA, OpenAdapt uses **AI reasoning** and **learns from demonstrations** instead of brittle scripts." + +### 4.3 Comparison Talking Points + +**vs. Anthropic Computer Use**: +- "Model-agnostic - not locked to one provider" +- "Fine-tune on your specific workflows" +- "Run locally for privacy-sensitive data" +- "Open source with community contributions" + +**vs. Traditional RPA**: +- "AI understands intent, not just element selectors" +- "Adapts to UI changes without manual updates" +- "Learn from demonstrations, not scripting" +- "Fraction of the cost, faster to deploy" + +--- + +## 5. Page Section Recommendations + +### 5.1 Proposed Page Structure + +1. **Hero Section** (Above the fold) +2. **How It Works** (3-step process) +3. **Key Differentiators** (3-4 value props) +4. **For Developers** (SDK/CLI features) +5. **For Enterprise** (Security, privacy, support) +6. **Use Cases** (Specific, concrete examples) +7. **Comparison** (Why OpenAdapt) +8. **Social Proof** (Metrics, testimonials, logos) +9. **Getting Started** (Install, docs, community) +10. **Footer** (Links, legal, contact) + +### 5.2 Hero Section Redesign + +**Current**: "OpenAdapt.AI - AI for Desktops. Automate your workflows. No coding required." + +**Proposed**: + +``` +[Logo] OpenAdapt.AI + +# Teach AI to use any software. + +Show it once. Let it handle the rest. + +[Video Demo - Keep current] + +[Install in 30 seconds] [View on GitHub] [Join Discord] + +"Works with Claude, GPT-4V, Gemini, Qwen, or your own fine-tuned models" + +{GitHub Stars} {PyPI Downloads} {Discord Members} +``` + +### 5.3 How It Works Section + +**Current**: Carousel with "Show, don't tell" / "Perform, don't prompt" / "Record, replay, share" + +**Proposed**: Clear 3-step process with visuals + +``` +## How OpenAdapt Works + +1. RECORD + [Icon: Screen recording] + Demonstrate the task by doing it yourself. + OpenAdapt captures screenshots, mouse clicks, and keystrokes. + +2. TRAIN + [Icon: Neural network] + Train an AI model on your demonstration. + Fine-tune Qwen-VL, use Claude/GPT-4V, or bring your own model. + +3. DEPLOY + [Icon: Play button] + Run the trained agent to automate the task. + Evaluate with standardized benchmarks. +``` + +### 5.4 Differentiators Section + +``` +## Why OpenAdapt? + +### Demonstration-Based Learning +No prompt engineering required. OpenAdapt learns from how you actually do tasks. +[Stat: 33% -> 100% first-action accuracy with demo conditioning] + +### Model Agnostic +Your choice of AI: Claude, GPT-4V, Gemini, Qwen-VL, or fine-tune your own. +Not locked to any single provider. + +### Run Anywhere +CLI-based, works offline. Deploy locally, in the cloud, or hybrid. +Your data stays where you want it. + +### Fully Open Source +MIT licensed. Transparent, auditable, community-driven. +No vendor lock-in, ever. +``` + +### 5.5 For Developers Section + +``` +## Built for Developers + +### Modular Architecture +Seven focused packages you can install individually: +- openadapt-capture: Recording +- openadapt-ml: Training & inference +- openadapt-evals: Benchmarking +- openadapt-viewer: Visualization +- openadapt-grounding: UI element detection +- openadapt-retrieval: Demo library search +- openadapt-privacy: PII/PHI scrubbing + +### Quick Start +```bash +# Install +pip install openadapt[all] + +# Record a demonstration +openadapt capture start --name my-task + +# Train a model +openadapt train start --capture my-task --model qwen3vl-2b + +# Evaluate +openadapt eval run --checkpoint model.pt --benchmark waa +``` + +### Benchmark Ready +Integrated with Windows Agent Arena (WAA), WebArena, and OSWorld. +Compare your models against published baselines. + +[View Documentation] [GitHub Repository] +``` + +### 5.6 For Enterprise Section + +``` +## Enterprise-Ready Automation + +### Privacy First +Built-in PII/PHI scrubbing with AWS Comprehend, Microsoft Presidio, or Private AI. +Your sensitive data never leaves your infrastructure. + +### Deploy Your Way +Run entirely on-premise, in your cloud, or hybrid. +No data leaves your environment unless you want it to. + +### Compliance Ready +Audit logging, reproducible recordings, explainable AI decisions. +Built for regulated industries. + +### Enterprise Support +Custom development, training, and support packages available. + +[Contact Sales: sales@openadapt.ai] +``` + +### 5.7 Use Cases Section (Refined) + +**Current**: Generic industry grid + +**Proposed**: Specific, concrete use cases with workflows + +``` +## Real-World Automation + +### Data Entry Across Systems +Transfer information between applications that don't integrate. +Example: Copy customer data from CRM to billing system. + +### Report Generation +Compile data from multiple sources into standardized reports. +Example: Monthly sales reports from Salesforce + Excel + internal tools. + +### Legacy System Integration +Automate workflows in applications without APIs. +Example: Mainframe data entry, proprietary healthcare systems. + +### Quality Assurance Testing +Record manual test procedures, replay with validation. +Example: Regression testing across UI updates. + +### Process Documentation +Record workflows to create training materials automatically. +Example: Onboarding guides for complex internal tools. +``` + +--- + +## 6. Copy Suggestions + +### 6.1 Headlines + +| Section | Headline | Subheadline | +|---------|----------|-------------| +| Hero | "Teach AI to use any software." | "Show it once. Let it handle the rest." | +| How It Works | "Three Steps to Automation" | "Record, train, deploy." | +| Differentiators | "Why OpenAdapt?" | "Open source, model-agnostic, demonstration-based." | +| Developers | "Built for Developers" | "A modular SDK for building GUI automation agents." | +| Enterprise | "Enterprise-Ready" | "AI automation that runs where your data lives." | +| Use Cases | "Automate Any Workflow" | "From data entry to testing to legacy integration." | +| Install | "Get Started in 30 Seconds" | "One command installs everything you need." | + +### 6.2 CTAs (Calls to Action) + +| Context | Primary CTA | Secondary CTA | +|---------|-------------|---------------| +| Hero | "Get Started" | "View Demo" | +| Developers | "View Documentation" | "Star on GitHub" | +| Enterprise | "Contact Sales" | "Download Whitepaper" | +| Footer | "Join Discord" | "View on GitHub" | + +### 6.3 Proof Points to Include + +- "33% -> 100% first-action accuracy with demonstration conditioning" +- "[X,XXX] PyPI downloads this month" (dynamic) +- "[XXX] GitHub stars" (dynamic) +- "7 modular packages, 1 unified CLI" +- "Integrated with Windows Agent Arena, WebArena, OSWorld benchmarks" +- "MIT licensed, fully open source" + +--- + +## 7. Wireframe Concepts + +### 7.1 Desktop Layout + +``` ++------------------------------------------------------------------+ +| [Logo] [Docs] [GitHub] [Discord] [Enterprise] | ++------------------------------------------------------------------+ +| | +| # Teach AI to use any software. | +| Show it once. Let it handle the rest. | +| | +| [==================== Video Demo ====================] | +| | +| [Get Started] [View on GitHub] | +| | +| Works with: [Claude] [GPT-4V] [Gemini] [Qwen] [Custom] | +| | +| [GitHub Stars] [PyPI Downloads] [Discord Members] | +| | ++------------------------------------------------------------------+ +| | +| ## How OpenAdapt Works | +| | +| [1. RECORD] [2. TRAIN] [3. DEPLOY] | +| [Screenshot] [Neural Net] [Automation] | +| Demonstrate Train on your Run the agent | +| the task. demonstration. to automate. | +| | ++------------------------------------------------------------------+ +| | +| ## Why OpenAdapt? | +| | +| [Demo-Based] [Model Agnostic] [Run Anywhere] [Open Source] | +| Learn from Your choice of Local, cloud, MIT licensed | +| examples. AI provider. or hybrid. forever. | +| | ++------------------------------------------------------------------+ +| | +| [For Developers Tab] [For Enterprise Tab] [For Researchers Tab]| +| | +| Content switches based on selected audience... | +| | ++------------------------------------------------------------------+ +| | +| ## Get Started | +| | +| [macOS] [Windows] [Linux] | +| | +| $ curl -LsSf https://astral.sh/uv/install.sh | sh | +| $ uv tool install openadapt | +| $ openadapt --help | +| | +| [X,XXX installs this month] | +| | ++------------------------------------------------------------------+ +| | +| [Footer: Links, Social, Legal] | +| | ++------------------------------------------------------------------+ +``` + +### 7.2 Mobile Considerations + +- Stack hero elements vertically +- Collapse model logos into scrollable row +- Use accordion for audience tabs +- Keep video demo prominent +- Simplify code blocks (single command with copy button) + +--- + +## 8. Social Proof Strategy + +### 8.1 Metrics to Display + +**Live Metrics** (fetch from APIs): +- GitHub stars (currently showing, keep) +- PyPI downloads per month (currently showing, keep) +- Discord member count (add if available) +- Number of GitHub contributors (add) + +**Static Metrics** (update manually): +- "7 modular packages" +- "100% synthetic benchmark accuracy (SoM mode)" +- "3 benchmark integrations (WAA, WebArena, OSWorld)" + +### 8.2 Testimonials Strategy + +**Priority Order**: +1. Named enterprise user quotes (if available) +2. Named developer testimonials from Discord +3. Anonymous industry testimonials +4. Community member quotes + +**Template for Gathering**: +> "How has OpenAdapt helped you? Reply to be featured on our website." + +### 8.3 Logo Wall + +**Target logos to seek permission for**: +- Companies using OpenAdapt in production +- Universities using for research +- Partner organizations + +**Fallback** (if no logos available): +- Featured in media logos (if covered) +- Integration partner logos (AWS, Azure, etc.) +- "Trusted by teams at Fortune 500 companies" (if true) + +--- + +## 9. Call-to-Action Strategy + +### 9.1 Primary Conversion Goals + +1. **GitHub star** (low friction, high visibility) +2. **PyPI install** (product usage) +3. **Discord join** (community engagement) +4. **Email signup** (for updates) +5. **Enterprise contact** (revenue) + +### 9.2 CTA Placement + +| Location | Primary CTA | Secondary CTA | +|----------|-------------|---------------| +| Hero | "Get Started" -> Install section | "View on GitHub" | +| After video | "Try it yourself" -> Install | "Join Discord" | +| Developers section | "View Docs" | "Star on GitHub" | +| Enterprise section | "Contact Sales" | "Request Demo" | +| Bottom of page | "Join Discord" | "View Documentation" | +| Sticky header (scroll) | "Get Started" | | + +### 9.3 Email Capture Strategy + +**Current**: "Register for updates" + +**Proposed**: More specific value prop +- "Get early access to new features" +- "Join [X,XXX] developers automating with AI" +- "Subscribe to the OpenAdapt newsletter (monthly, no spam)" + +--- + +## 10. Implementation Priorities + +### 10.1 Phase 1: Quick Wins (1-2 weeks) + +1. **Update hero tagline** to "Teach AI to use any software." +2. **Add "How It Works" section** with 3-step process +3. **Update differentiators** to 4-card grid (current features but better copy) +4. **Add Discord member count** to social proof +5. **Add GitHub contributors count** + +### 10.2 Phase 2: Messaging Clarity (2-4 weeks) + +1. **Add "For Developers" section** with code examples and architecture +2. **Add "For Enterprise" section** with privacy/security messaging +3. **Replace generic industry grid** with specific use case examples +4. **Add comparison table** vs. alternatives +5. **Update email signup copy** to be more specific + +### 10.3 Phase 3: Credibility Building (4-8 weeks) + +1. **Add benchmark scores** (once published) +2. **Collect and display testimonials** +3. **Create case studies** (1-2 real examples) +4. **Add logo wall** (if logos available) +5. **Add "Research" or "Publications" section** (if applicable) + +### 10.4 Phase 4: Conversion Optimization (Ongoing) + +1. **A/B test hero messaging** +2. **Track install conversion rates** +3. **Optimize CTA placement** +4. **Add video transcripts/captions for SEO** +5. **Create landing page variants** for different audiences (developer vs. enterprise) + +--- + +## Appendix A: Messaging Don'ts + +- **Don't say "AI for Desktops"** - too vague, doesn't differentiate +- **Don't say "No coding required"** - true for end users, but alienates developers +- **Don't list every industry** - pick 3-4 with real stories +- **Don't hide the CLI** - developers want to see it +- **Don't over-promise** - be honest about current capabilities + +## Appendix B: Technical Content to Add + +1. **Architecture diagram** showing package relationships +2. **Mermaid flowchart** of Record -> Train -> Deploy cycle +3. **Comparison table** of model backends (Claude, GPT, Qwen, etc.) +4. **Benchmark table** showing accuracy scores +5. **API reference link** to documentation site + +## Appendix C: SEO Keywords + +Primary: +- "GUI automation AI" +- "desktop automation AI" +- "RPA alternative AI" +- "VLM GUI agent" +- "open source computer use" + +Secondary: +- "train AI on screenshots" +- "demonstration-based automation" +- "model-agnostic automation" +- "Claude computer use alternative" +- "AI workflow automation" + +--- + +*This document is a living strategy guide. Updates should be made as OpenAdapt capabilities evolve and as user feedback is collected.* diff --git a/docs/design/openadapt-tray.md b/docs/design/openadapt-tray.md new file mode 100644 index 000000000..6e347814d --- /dev/null +++ b/docs/design/openadapt-tray.md @@ -0,0 +1,1220 @@ +# openadapt-tray Package Design + +## Overview + +`openadapt-tray` is a cross-platform system tray application that provides a graphical interface for the OpenAdapt ecosystem. It serves as a thin orchestration layer, allowing users to control recording, monitor training, view captures, and access settings without using the command line. + +## Legacy Implementation Analysis + +### Current Features (Legacy `openadapt/app/tray.py`) + +The legacy implementation uses **PySide6/Qt** for cross-platform system tray functionality: + +**Architecture:** +- `QSystemTrayIcon` for the system tray icon +- `QMenu` for context menu +- `QDialog` for configuration dialogs (replay strategy, delete confirmation) +- `pyqttoast` for toast notifications +- Multiprocessing pipes (`multiprocessing.Pipe`) for IPC with recording process +- `QThread` + `Worker` pattern for async signal handling +- Platform-specific Dock hiding on macOS via `AppKit` + +**Menu Structure:** +- Record / Stop Recording (toggle) +- Visualize submenu (lists all recordings) +- Replay submenu (lists all recordings, opens strategy dialog) +- Delete submenu (lists all recordings, confirms deletion) +- Quit + +**Key Patterns:** +- `TrackedQAction` - wraps `QAction` to send analytics events via PostHog +- Signal-based state updates (`record.starting`, `record.started`, `record.stopping`, `record.stopped`, `replay.*`) +- Toast notifications for status updates (recording started/stopped, etc.) +- Dashboard launched automatically as a background thread +- Recording process runs in a separate `multiprocessing.Process` + +**Stop Sequences:** +- Typing `oa.stop` or pressing `Ctrl` three times stops recording +- Configurable via `STOP_SEQUENCES` in config + +### Limitations of Legacy Implementation + +1. **Heavyweight dependency** - PySide6 is a large dependency (~100MB+) +2. **No global hotkeys** - Recording can only be stopped via stop sequences or tray menu +3. **Tightly coupled** - Direct imports of internal modules (crud, models, etc.) +4. **No status icons** - Same icon regardless of state +5. **No auto-start** - Manual setup required for login startup +6. **Single dashboard** - Only supports the legacy Next.js dashboard + +## New Architecture Design + +### Design Principles + +1. **Thin wrapper** - Minimal business logic; delegate to CLI or sub-packages +2. **Cross-platform first** - Consistent behavior on macOS, Windows, and Linux +3. **Lightweight** - Prefer smaller dependencies (pystray ~50KB vs PySide6 ~100MB) +4. **Event-driven** - Async status updates via IPC +5. **Configurable** - User-customizable hotkeys, icons, and behaviors + +### Package Structure + +``` +openadapt-tray/ +├── src/openadapt_tray/ +│ ├── __init__.py # Package exports, version +│ ├── __main__.py # Entry point: python -m openadapt_tray +│ ├── app.py # Main TrayApplication class +│ ├── menu.py # Menu construction and actions +│ ├── icons.py # Icon loading and status icons +│ ├── notifications.py # Cross-platform notifications +│ ├── shortcuts.py # Global hotkey handling +│ ├── config.py # Tray-specific configuration +│ ├── ipc.py # Inter-process communication +│ ├── state.py # Application state machine +│ └── platform/ +│ ├── __init__.py # Platform detection and abstraction +│ ├── base.py # Abstract base class +│ ├── macos.py # macOS-specific (AppKit, rumps optional) +│ ├── windows.py # Windows-specific (win32api) +│ └── linux.py # Linux-specific (AppIndicator) +├── assets/ +│ ├── icons/ +│ │ ├── idle.png # Default state +│ │ ├── idle@2x.png # Retina support +│ │ ├── recording.png # Recording active +│ │ ├── recording@2x.png +│ │ ├── training.png # Training in progress +│ │ ├── training@2x.png +│ │ ├── error.png # Error state +│ │ └── error@2x.png +│ └── logo.ico # Windows icon format +├── pyproject.toml +├── README.md +└── tests/ + ├── test_app.py + ├── test_menu.py + ├── test_shortcuts.py + └── test_platform.py +``` + +### Dependencies + +**Required:** +```toml +[project] +dependencies = [ + "pystray>=0.19.0", # Cross-platform system tray + "Pillow>=9.0.0", # Icon handling + "pynput>=1.7.0", # Global hotkeys + "click>=8.0.0", # CLI integration (consistent with meta-package) +] +``` + +**Optional Platform Enhancements:** +```toml +[project.optional-dependencies] +macos-native = [ + "rumps>=0.4.0", # Native macOS menu bar +] +all = [ + "openadapt-tray[macos-native]", +] +``` + +**Why pystray over PySide6/Qt:** +- Dramatically smaller (~50KB vs ~100MB) +- Pure Python, easier to install +- Sufficient for system tray use case +- Works well with pynput for hotkeys + +### Core Components + +#### 1. State Machine (`state.py`) + +```python +from enum import Enum, auto +from dataclasses import dataclass +from typing import Optional, Callable + +class TrayState(Enum): + """Application states.""" + IDLE = auto() + RECORDING_STARTING = auto() + RECORDING = auto() + RECORDING_STOPPING = auto() + TRAINING = auto() + TRAINING_PAUSED = auto() + ERROR = auto() + +@dataclass +class AppState: + """Current application state.""" + state: TrayState = TrayState.IDLE + current_capture: Optional[str] = None + training_progress: Optional[float] = None + error_message: Optional[str] = None + + def can_start_recording(self) -> bool: + return self.state == TrayState.IDLE + + def can_stop_recording(self) -> bool: + return self.state == TrayState.RECORDING + +class StateManager: + """Manages application state transitions.""" + + def __init__(self): + self._state = AppState() + self._listeners: list[Callable[[AppState], None]] = [] + + def add_listener(self, callback: Callable[[AppState], None]): + self._listeners.append(callback) + + def transition(self, new_state: TrayState, **kwargs): + """Transition to a new state and notify listeners.""" + self._state = AppState(state=new_state, **kwargs) + for listener in self._listeners: + listener(self._state) + + @property + def current(self) -> AppState: + return self._state +``` + +#### 2. Main Application (`app.py`) + +```python +import sys +import threading +from typing import Optional + +import pystray +from PIL import Image + +from openadapt_tray.state import StateManager, TrayState +from openadapt_tray.menu import MenuBuilder +from openadapt_tray.icons import IconManager +from openadapt_tray.shortcuts import HotkeyManager +from openadapt_tray.notifications import NotificationManager +from openadapt_tray.ipc import IPCClient +from openadapt_tray.config import TrayConfig +from openadapt_tray.platform import get_platform_handler + +class TrayApplication: + """Main system tray application.""" + + def __init__(self, config: Optional[TrayConfig] = None): + self.config = config or TrayConfig.load() + self.state = StateManager() + self.platform = get_platform_handler() + + # Initialize components + self.icons = IconManager() + self.notifications = NotificationManager() + self.menu_builder = MenuBuilder(self) + self.hotkeys = HotkeyManager(self.config.hotkeys) + self.ipc = IPCClient() + + # Create tray icon + self.icon = pystray.Icon( + name="openadapt", + icon=self.icons.get(TrayState.IDLE), + title="OpenAdapt", + menu=self.menu_builder.build(), + ) + + # Register state change handler + self.state.add_listener(self._on_state_change) + + # Register hotkey handlers + self._setup_hotkeys() + + def _setup_hotkeys(self): + """Configure global hotkeys.""" + self.hotkeys.register( + self.config.hotkeys.toggle_recording, + self._toggle_recording + ) + self.hotkeys.register( + self.config.hotkeys.open_dashboard, + self._open_dashboard + ) + + def _on_state_change(self, state): + """Handle state changes.""" + # Update icon + self.icon.icon = self.icons.get(state.state) + + # Update menu + self.icon.menu = self.menu_builder.build() + + # Show notification if appropriate + self._show_state_notification(state) + + def _show_state_notification(self, state): + """Show notification for state transitions.""" + messages = { + TrayState.RECORDING: ("Recording Started", f"Capturing: {state.current_capture}"), + TrayState.IDLE: ("Recording Stopped", "Capture saved"), + TrayState.TRAINING: ("Training Started", "Model training in progress"), + TrayState.ERROR: ("Error", state.error_message or "An error occurred"), + } + if state.state in messages: + title, body = messages[state.state] + self.notifications.show(title, body) + + def _toggle_recording(self): + """Toggle recording state.""" + if self.state.current.can_start_recording(): + self.start_recording() + elif self.state.current.can_stop_recording(): + self.stop_recording() + + def start_recording(self, name: Optional[str] = None): + """Start a new capture session.""" + if not self.state.current.can_start_recording(): + return + + # Prompt for name if not provided (platform-specific) + if name is None: + name = self.platform.prompt_input( + "New Recording", + "Enter a name for this capture:" + ) + if not name: + return + + self.state.transition(TrayState.RECORDING_STARTING, current_capture=name) + + # Start capture via CLI subprocess or direct API + threading.Thread( + target=self._run_capture, + args=(name,), + daemon=True + ).start() + + def _run_capture(self, name: str): + """Run capture in background thread.""" + try: + # Option 1: Via subprocess (preferred for isolation) + import subprocess + self.capture_process = subprocess.Popen( + ["openadapt", "capture", "start", "--name", name], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + ) + self.state.transition(TrayState.RECORDING, current_capture=name) + + except Exception as e: + self.state.transition(TrayState.ERROR, error_message=str(e)) + + def stop_recording(self): + """Stop the current capture session.""" + if not self.state.current.can_stop_recording(): + return + + self.state.transition(TrayState.RECORDING_STOPPING) + + # Send stop signal to capture process + if hasattr(self, 'capture_process') and self.capture_process: + self.capture_process.terminate() + + self.state.transition(TrayState.IDLE) + + def _open_dashboard(self): + """Open the web dashboard.""" + import webbrowser + webbrowser.open(f"http://localhost:{self.config.dashboard_port}") + + def run(self): + """Run the application.""" + # Start hotkey listener + self.hotkeys.start() + + # Platform-specific setup + self.platform.setup() + + # Run the tray icon (blocks) + self.icon.run() + + def quit(self): + """Quit the application.""" + self.hotkeys.stop() + self.ipc.close() + self.icon.stop() + +def main(): + """Entry point.""" + app = TrayApplication() + try: + app.run() + except KeyboardInterrupt: + app.quit() + +if __name__ == "__main__": + main() +``` + +#### 3. Menu Builder (`menu.py`) + +```python +from typing import TYPE_CHECKING, Callable, Optional +from functools import partial + +import pystray +from pystray import MenuItem as Item, Menu + +if TYPE_CHECKING: + from openadapt_tray.app import TrayApplication + +from openadapt_tray.state import TrayState + +class MenuBuilder: + """Builds the system tray context menu.""" + + def __init__(self, app: "TrayApplication"): + self.app = app + + def build(self) -> Menu: + """Build the current menu based on application state.""" + state = self.app.state.current + + items = [ + self._build_recording_item(state), + Menu.SEPARATOR, + self._build_captures_submenu(), + self._build_training_item(state), + Menu.SEPARATOR, + Item("Open Dashboard", self._open_dashboard), + Item("Settings...", self._open_settings), + Menu.SEPARATOR, + Item("Quit", self._quit), + ] + + return Menu(*items) + + def _build_recording_item(self, state) -> Item: + """Build record/stop recording menu item.""" + if state.state == TrayState.RECORDING: + return Item( + f"Stop Recording ({state.current_capture})", + self.app.stop_recording, + ) + elif state.state in (TrayState.RECORDING_STARTING, TrayState.RECORDING_STOPPING): + return Item( + "Recording..." if state.state == TrayState.RECORDING_STARTING else "Stopping...", + None, + enabled=False, + ) + else: + return Item( + f"Start Recording ({self.app.config.hotkeys.toggle_recording})", + self.app.start_recording, + enabled=state.can_start_recording(), + ) + + def _build_captures_submenu(self) -> Item: + """Build captures submenu.""" + captures = self._get_recent_captures() + + if not captures: + return Item( + "Recent Captures", + Menu(Item("No captures", None, enabled=False)), + ) + + capture_items = [ + Item( + f"{c.name} ({c.timestamp})", + Menu( + Item("View", partial(self._view_capture, c.path)), + Item("Delete", partial(self._delete_capture, c.path)), + ), + ) + for c in captures[:10] # Limit to 10 most recent + ] + + capture_items.append(Menu.SEPARATOR) + capture_items.append(Item("View All...", self._open_captures_list)) + + return Item("Recent Captures", Menu(*capture_items)) + + def _build_training_item(self, state) -> Item: + """Build training status/control item.""" + if state.state == TrayState.TRAINING: + progress = state.training_progress or 0 + return Item( + f"Training: {progress:.0%}", + Menu( + Item("View Progress", self._open_training_dashboard), + Item("Stop Training", self._stop_training), + ), + ) + else: + return Item( + "Training", + Menu( + Item("Start Training...", self._start_training), + Item("View Last Results", self._view_training_results), + ), + ) + + def _get_recent_captures(self): + """Get list of recent captures.""" + try: + from pathlib import Path + from openadapt_tray.config import TrayConfig + + captures_dir = Path(TrayConfig.load().captures_directory) + if not captures_dir.exists(): + return [] + + # Simple capture detection - look for capture directories + captures = [] + for d in sorted(captures_dir.iterdir(), key=lambda x: x.stat().st_mtime, reverse=True): + if d.is_dir() and (d / "metadata.json").exists(): + from dataclasses import dataclass + from datetime import datetime + + @dataclass + class CaptureInfo: + name: str + path: str + timestamp: str + + mtime = datetime.fromtimestamp(d.stat().st_mtime) + captures.append(CaptureInfo( + name=d.name, + path=str(d), + timestamp=mtime.strftime("%Y-%m-%d %H:%M"), + )) + + return captures + except Exception: + return [] + + def _open_dashboard(self): + self.app._open_dashboard() + + def _open_settings(self): + """Open settings dialog.""" + self.app.platform.open_settings_dialog(self.app.config) + + def _quit(self): + self.app.quit() + + def _view_capture(self, path: str): + """View a capture.""" + import subprocess + subprocess.run(["openadapt", "capture", "view", path]) + + def _delete_capture(self, path: str): + """Delete a capture after confirmation.""" + if self.app.platform.confirm_dialog( + "Delete Capture", + f"Are you sure you want to delete this capture?\n{path}" + ): + import shutil + shutil.rmtree(path) + self.app.notifications.show("Capture Deleted", "The capture has been removed.") + + def _open_captures_list(self): + """Open captures list in dashboard.""" + import webbrowser + webbrowser.open(f"http://localhost:{self.app.config.dashboard_port}/captures") + + def _open_training_dashboard(self): + """Open training dashboard.""" + import webbrowser + webbrowser.open(f"http://localhost:{self.app.config.dashboard_port}/training") + + def _start_training(self): + """Open training configuration dialog.""" + # This would open a dialog to select capture and model + self.app.platform.open_training_dialog() + + def _stop_training(self): + """Stop current training.""" + import subprocess + subprocess.run(["openadapt", "train", "stop"]) + self.app.state.transition(TrayState.IDLE) + + def _view_training_results(self): + """View last training results.""" + import subprocess + subprocess.run(["openadapt", "train", "status"]) +``` + +#### 4. Global Hotkeys (`shortcuts.py`) + +```python +from dataclasses import dataclass +from typing import Callable, Dict, Optional +import threading + +from pynput import keyboard + +@dataclass +class HotkeyConfig: + """Hotkey configuration.""" + toggle_recording: str = "++r" + open_dashboard: str = "++d" + stop_recording: str = "++" # Triple ctrl (legacy compat) + +class HotkeyManager: + """Manages global hotkeys.""" + + def __init__(self, config: Optional[HotkeyConfig] = None): + self.config = config or HotkeyConfig() + self._handlers: Dict[str, Callable] = {} + self._listener: Optional[keyboard.GlobalHotKeys] = None + self._ctrl_count = 0 + self._ctrl_timer: Optional[threading.Timer] = None + + def register(self, hotkey: str, handler: Callable): + """Register a hotkey handler.""" + self._handlers[hotkey] = handler + + def start(self): + """Start listening for hotkeys.""" + # Build hotkey dict for pynput + hotkeys = {} + for combo, handler in self._handlers.items(): + if combo == "++": + # Special handling for triple-ctrl + continue + hotkeys[combo] = handler + + self._listener = keyboard.GlobalHotKeys(hotkeys) + self._listener.start() + + # Also listen for triple-ctrl pattern + if "++" in self._handlers: + self._start_ctrl_listener() + + def _start_ctrl_listener(self): + """Start listener for triple-ctrl pattern.""" + def on_press(key): + if key == keyboard.Key.ctrl_l or key == keyboard.Key.ctrl_r: + self._on_ctrl_press() + + def on_release(key): + pass + + self._key_listener = keyboard.Listener( + on_press=on_press, + on_release=on_release, + ) + self._key_listener.start() + + def _on_ctrl_press(self): + """Handle ctrl key press for triple-ctrl detection.""" + self._ctrl_count += 1 + + # Reset timer + if self._ctrl_timer: + self._ctrl_timer.cancel() + + if self._ctrl_count >= 3: + self._ctrl_count = 0 + handler = self._handlers.get("++") + if handler: + handler() + else: + # Reset count after 500ms + self._ctrl_timer = threading.Timer(0.5, self._reset_ctrl_count) + self._ctrl_timer.start() + + def _reset_ctrl_count(self): + self._ctrl_count = 0 + + def stop(self): + """Stop listening for hotkeys.""" + if self._listener: + self._listener.stop() + if hasattr(self, '_key_listener'): + self._key_listener.stop() + if self._ctrl_timer: + self._ctrl_timer.cancel() +``` + +#### 5. Platform Abstraction (`platform/`) + +**Base class (`platform/base.py`):** + +```python +from abc import ABC, abstractmethod +from typing import Optional + +class PlatformHandler(ABC): + """Abstract base class for platform-specific functionality.""" + + @abstractmethod + def setup(self): + """Platform-specific setup.""" + pass + + @abstractmethod + def prompt_input(self, title: str, message: str) -> Optional[str]: + """Show input dialog and return user input.""" + pass + + @abstractmethod + def confirm_dialog(self, title: str, message: str) -> bool: + """Show confirmation dialog and return result.""" + pass + + @abstractmethod + def open_settings_dialog(self, config): + """Open settings dialog.""" + pass + + @abstractmethod + def open_training_dialog(self): + """Open training configuration dialog.""" + pass + + def setup_autostart(self, enabled: bool): + """Configure auto-start on login.""" + pass +``` + +**macOS implementation (`platform/macos.py`):** + +```python +import subprocess +from typing import Optional + +from .base import PlatformHandler + +class MacOSHandler(PlatformHandler): + """macOS-specific functionality.""" + + def setup(self): + """Hide from Dock, show only in menu bar.""" + try: + from AppKit import NSApplication, NSApplicationActivationPolicyAccessory + NSApplication.sharedApplication().setActivationPolicy_( + NSApplicationActivationPolicyAccessory + ) + except ImportError: + pass # AppKit not available + + def prompt_input(self, title: str, message: str) -> Optional[str]: + """Show native macOS input dialog.""" + script = f''' + tell application "System Events" + display dialog "{message}" default answer "" with title "{title}" + return text returned of result + end tell + ''' + try: + result = subprocess.run( + ["osascript", "-e", script], + capture_output=True, + text=True, + ) + if result.returncode == 0: + return result.stdout.strip() + except Exception: + pass + return None + + def confirm_dialog(self, title: str, message: str) -> bool: + """Show native macOS confirmation dialog.""" + script = f''' + tell application "System Events" + display dialog "{message}" with title "{title}" buttons {{"Cancel", "OK"}} default button "OK" + return button returned of result + end tell + ''' + try: + result = subprocess.run( + ["osascript", "-e", script], + capture_output=True, + text=True, + ) + return result.returncode == 0 and "OK" in result.stdout + except Exception: + return False + + def open_settings_dialog(self, config): + """Open settings in default browser.""" + import webbrowser + webbrowser.open(f"http://localhost:{config.dashboard_port}/settings") + + def open_training_dialog(self): + """Open training dialog in browser.""" + import webbrowser + webbrowser.open("http://localhost:8080/training/new") + + def setup_autostart(self, enabled: bool): + """Configure Launch Agent for auto-start.""" + import os + from pathlib import Path + + plist_path = Path.home() / "Library/LaunchAgents/ai.openadapt.tray.plist" + + if enabled: + plist_content = ''' + + + + Label + ai.openadapt.tray + ProgramArguments + + /usr/local/bin/openadapt-tray + + RunAtLoad + + KeepAlive + + +''' + plist_path.parent.mkdir(parents=True, exist_ok=True) + plist_path.write_text(plist_content) + subprocess.run(["launchctl", "load", str(plist_path)]) + else: + if plist_path.exists(): + subprocess.run(["launchctl", "unload", str(plist_path)]) + plist_path.unlink() +``` + +**Windows implementation (`platform/windows.py`):** + +```python +import ctypes +from typing import Optional + +from .base import PlatformHandler + +class WindowsHandler(PlatformHandler): + """Windows-specific functionality.""" + + def setup(self): + """Windows-specific setup.""" + pass # No special setup needed + + def prompt_input(self, title: str, message: str) -> Optional[str]: + """Show Windows input dialog using ctypes.""" + try: + import tkinter as tk + from tkinter import simpledialog + + root = tk.Tk() + root.withdraw() + result = simpledialog.askstring(title, message) + root.destroy() + return result + except Exception: + return None + + def confirm_dialog(self, title: str, message: str) -> bool: + """Show Windows confirmation dialog.""" + MB_OKCANCEL = 0x01 + MB_ICONQUESTION = 0x20 + IDOK = 1 + + result = ctypes.windll.user32.MessageBoxW( + 0, message, title, MB_OKCANCEL | MB_ICONQUESTION + ) + return result == IDOK + + def open_settings_dialog(self, config): + import webbrowser + webbrowser.open(f"http://localhost:{config.dashboard_port}/settings") + + def open_training_dialog(self): + import webbrowser + webbrowser.open("http://localhost:8080/training/new") + + def setup_autostart(self, enabled: bool): + """Configure Windows Registry for auto-start.""" + import winreg + + key_path = r"Software\Microsoft\Windows\CurrentVersion\Run" + app_name = "OpenAdapt" + + try: + key = winreg.OpenKey(winreg.HKEY_CURRENT_USER, key_path, 0, winreg.KEY_ALL_ACCESS) + + if enabled: + import sys + exe_path = sys.executable.replace("python.exe", "Scripts\\openadapt-tray.exe") + winreg.SetValueEx(key, app_name, 0, winreg.REG_SZ, exe_path) + else: + try: + winreg.DeleteValue(key, app_name) + except FileNotFoundError: + pass + + winreg.CloseKey(key) + except Exception: + pass +``` + +#### 6. Configuration (`config.py`) + +```python +from dataclasses import dataclass, field +from pathlib import Path +from typing import Optional +import json + +from openadapt_tray.shortcuts import HotkeyConfig + +@dataclass +class TrayConfig: + """Tray application configuration.""" + + # Hotkeys + hotkeys: HotkeyConfig = field(default_factory=HotkeyConfig) + + # Paths + captures_directory: str = "~/openadapt/captures" + training_output_directory: str = "~/openadapt/training" + + # Dashboard + dashboard_port: int = 8080 + auto_launch_dashboard: bool = True + + # Behavior + auto_start_on_login: bool = False + minimize_to_tray: bool = True + show_notifications: bool = True + notification_duration_ms: int = 5000 + + # Recording + default_record_audio: bool = True + default_transcribe: bool = True + stop_on_triple_ctrl: bool = True + + # Appearance + use_native_dialogs: bool = True + + @classmethod + def config_path(cls) -> Path: + """Get configuration file path.""" + return Path.home() / ".config" / "openadapt" / "tray.json" + + @classmethod + def load(cls) -> "TrayConfig": + """Load configuration from file.""" + path = cls.config_path() + if path.exists(): + try: + data = json.loads(path.read_text()) + hotkeys_data = data.pop("hotkeys", {}) + return cls( + hotkeys=HotkeyConfig(**hotkeys_data), + **data + ) + except Exception: + pass + return cls() + + def save(self): + """Save configuration to file.""" + path = self.config_path() + path.parent.mkdir(parents=True, exist_ok=True) + + data = { + "hotkeys": { + "toggle_recording": self.hotkeys.toggle_recording, + "open_dashboard": self.hotkeys.open_dashboard, + "stop_recording": self.hotkeys.stop_recording, + }, + "captures_directory": self.captures_directory, + "training_output_directory": self.training_output_directory, + "dashboard_port": self.dashboard_port, + "auto_launch_dashboard": self.auto_launch_dashboard, + "auto_start_on_login": self.auto_start_on_login, + "minimize_to_tray": self.minimize_to_tray, + "show_notifications": self.show_notifications, + "notification_duration_ms": self.notification_duration_ms, + "default_record_audio": self.default_record_audio, + "default_transcribe": self.default_transcribe, + "stop_on_triple_ctrl": self.stop_on_triple_ctrl, + "use_native_dialogs": self.use_native_dialogs, + } + + path.write_text(json.dumps(data, indent=2)) +``` + +#### 7. Notifications (`notifications.py`) + +```python +import sys +from typing import Optional + +class NotificationManager: + """Cross-platform notification manager.""" + + def __init__(self): + self._backend = self._detect_backend() + + def _detect_backend(self) -> str: + """Detect best notification backend for platform.""" + if sys.platform == "darwin": + return "macos" + elif sys.platform == "win32": + return "windows" + else: + return "linux" + + def show( + self, + title: str, + body: str, + icon_path: Optional[str] = None, + duration_ms: int = 5000, + ): + """Show a notification.""" + if self._backend == "macos": + self._show_macos(title, body) + elif self._backend == "windows": + self._show_windows(title, body, icon_path, duration_ms) + else: + self._show_linux(title, body, icon_path) + + def _show_macos(self, title: str, body: str): + """Show notification on macOS.""" + import subprocess + script = f''' + display notification "{body}" with title "{title}" + ''' + subprocess.run(["osascript", "-e", script], capture_output=True) + + def _show_windows(self, title: str, body: str, icon_path: Optional[str], duration_ms: int): + """Show notification on Windows using pystray's built-in notify.""" + # pystray handles this via icon.notify() + pass + + def _show_linux(self, title: str, body: str, icon_path: Optional[str]): + """Show notification on Linux.""" + try: + import subprocess + cmd = ["notify-send", title, body] + if icon_path: + cmd.extend(["-i", icon_path]) + subprocess.run(cmd, capture_output=True) + except Exception: + pass +``` + +### pyproject.toml + +```toml +[project] +name = "openadapt-tray" +version = "0.1.0" +description = "System tray application for OpenAdapt" +readme = "README.md" +requires-python = ">=3.10" +license = "MIT" +authors = [ + {name = "MLDSAI Inc.", email = "richard@mldsai.com"} +] +keywords = ["gui", "system-tray", "menu-bar", "openadapt"] +classifiers = [ + "Development Status :: 3 - Alpha", + "Intended Audience :: Developers", + "License :: OSI Approved :: MIT License", + "Operating System :: MacOS", + "Operating System :: Microsoft :: Windows", + "Operating System :: POSIX :: Linux", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", +] + +dependencies = [ + "pystray>=0.19.0", + "Pillow>=9.0.0", + "pynput>=1.7.0", + "click>=8.0.0", +] + +[project.optional-dependencies] +macos-native = [ + "rumps>=0.4.0", + "pyobjc-framework-Cocoa>=9.0", +] +dev = [ + "pytest>=8.0.0", + "pytest-mock>=3.10.0", + "ruff>=0.1.0", +] +all = [ + "openadapt-tray[macos-native]", +] + +[project.scripts] +openadapt-tray = "openadapt_tray.app:main" + +[project.gui-scripts] +openadapt-tray-gui = "openadapt_tray.app:main" + +[project.urls] +Homepage = "https://openadapt.ai" +Documentation = "https://docs.openadapt.ai" +Repository = "https://github.com/OpenAdaptAI/openadapt-tray" + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[tool.hatch.build.targets.wheel] +packages = ["src/openadapt_tray"] + +[tool.ruff] +line-length = 88 +target-version = "py310" + +[tool.pytest.ini_options] +testpaths = ["tests"] +``` + +## User Experience + +### First-Run Experience + +1. **Installation**: `pip install openadapt-tray` +2. **Launch**: `openadapt-tray` or via Applications menu +3. **First Run Dialog** (if no config exists): + - Welcome message + - Option to configure hotkeys + - Option to enable auto-start + - Link to documentation +4. **Tray Icon**: Appears in system tray/menu bar +5. **Dashboard**: Auto-opens (configurable) + +### Menu Structure + +``` +[OpenAdapt Tray Icon] +├── Start Recording (Ctrl+Shift+R) +│ └── [When recording: "Stop Recording (task-name)"] +├── ───────────── +├── Recent Captures +│ ├── login-flow (2024-01-15 14:30) +│ │ ├── View +│ │ └── Delete +│ ├── checkout (2024-01-15 10:15) +│ │ ├── View +│ │ └── Delete +│ ├── ... (up to 10 items) +│ ├── ───────────── +│ └── View All... +├── Training +│ ├── Start Training... +│ └── View Last Results +│ └── [When training: "Training: 45% | View Progress | Stop"] +├── ───────────── +├── Open Dashboard (Ctrl+Shift+D) +├── Settings... +├── ───────────── +└── Quit +``` + +### Status Icons + +| State | Icon Description | Color | +|-------|------------------|-------| +| Idle | OpenAdapt logo | Blue/Gray | +| Recording | Pulsing red dot overlay | Red | +| Recording Starting | Spinning indicator | Yellow | +| Training | Gear icon | Purple | +| Error | Exclamation mark | Red | + +### Keyboard Shortcuts + +| Action | Default Shortcut | Configurable | +|--------|------------------|--------------| +| Toggle Recording | `Ctrl+Shift+R` | Yes | +| Open Dashboard | `Ctrl+Shift+D` | Yes | +| Stop Recording | `Ctrl Ctrl Ctrl` (triple tap) | Yes | + +### Notifications + +| Event | Title | Body | +|-------|-------|------| +| Recording Started | "Recording Started" | "Capturing: {task-name}" | +| Recording Stopped | "Recording Stopped" | "Capture saved" | +| Training Started | "Training Started" | "Model training in progress" | +| Training Complete | "Training Complete" | "Model saved to {path}" | +| Error | "Error" | "{error message}" | + +## Integration with Ecosystem + +### CLI Integration + +The tray app delegates to the `openadapt` CLI for all operations: + +```python +# Starting a capture +subprocess.Popen(["openadapt", "capture", "start", "--name", name]) + +# Stopping a capture +subprocess.Popen(["openadapt", "capture", "stop"]) + +# Starting training +subprocess.Popen(["openadapt", "train", "start", "--capture", capture_path]) + +# Checking training status +result = subprocess.run(["openadapt", "train", "status"], capture_output=True) +``` + +### Direct API Integration (Alternative) + +For tighter integration, the tray can import sub-packages directly: + +```python +try: + from openadapt_capture import CaptureSession + + session = CaptureSession(name=name, record_audio=True) + session.start() +except ImportError: + # Fall back to CLI + subprocess.Popen(["openadapt", "capture", "start", "--name", name]) +``` + +### Dashboard Integration + +- Auto-launches the dashboard web server on startup (configurable) +- "Open Dashboard" opens browser to `http://localhost:8080` +- Settings page accessible via tray menu + +## Future Enhancements + +1. **Native macOS app** using `rumps` for a more native feel +2. **Electron wrapper** for consistent cross-platform UI +3. **Recording preview** - show recent screenshot in menu +4. **Quick actions** - right-click for immediate actions +5. **Status bar text** - show recording duration on macOS +6. **Multi-monitor support** - select which monitor to record +7. **Cloud sync** - sync captures and settings across devices +8. **Plugin system** - allow third-party menu extensions + +## Migration from Legacy + +### Compatibility + +The new tray app maintains backward compatibility with: +- Legacy stop sequences (`oa.stop`, triple-ctrl) +- PostHog analytics events +- Configuration file locations + +### Migration Path + +1. Install `openadapt-tray` alongside legacy +2. Both can coexist (different process names) +3. Legacy can be deprecated when new tray is stable +4. Configuration migration script provided + +--- + +*This design enables a lightweight, cross-platform system tray experience while maintaining integration with the OpenAdapt ecosystem's CLI-first architecture.* diff --git a/docs/design/repo-rename-analysis.md b/docs/design/repo-rename-analysis.md new file mode 100644 index 000000000..e66dac023 --- /dev/null +++ b/docs/design/repo-rename-analysis.md @@ -0,0 +1,286 @@ +# Repository Rename Analysis: OpenAdapt to openadapt + +**Date:** January 2026 +**Status:** Decision Document +**Author:** Engineering Team + +--- + +## Executive Summary + +This document analyzes whether to rename the main OpenAdapt GitHub repository from `OpenAdapt` (mixed case) to `openadapt` (lowercase) to align with Python conventions and existing sub-packages. + +**Recommendation: DO NOT RENAME at this time.** + +The costs and risks of renaming outweigh the benefits. The minor consistency improvement does not justify the potential for broken links, documentation updates, and brand dilution. + +--- + +## Current State + +| Component | Current Name | Case | +|-----------|-------------|------| +| **Main Repository** | `OpenAdaptAI/OpenAdapt` | Mixed | +| **GitHub Organization** | `OpenAdaptAI` | Mixed | +| **Sub-packages** | `openadapt-ml`, `openadapt-capture`, etc. | Lowercase | +| **PyPI Package** | `openadapt` | Lowercase | +| **Python Imports** | `import openadapt` | Lowercase | +| **pyproject.toml Repository URL** | Already points to `openadapt` (lowercase) | Lowercase | + +**Key Observation:** The `pyproject.toml` already uses lowercase in the Repository URL: +```toml +Repository = "https://github.com/OpenAdaptAI/openadapt" +``` + +This suggests the team anticipated or intended lowercase naming, but GitHub currently shows `OpenAdapt`. + +--- + +## Industry Research: How Major Python Projects Handle Repository Naming + +| Project | Organization | Repository | PyPI Package | Notes | +|---------|-------------|------------|--------------|-------| +| **LangChain** | `langchain-ai` | `langchain` | `langchain` | All lowercase | +| **PyTorch** | `pytorch` | `pytorch` | `torch` | All lowercase | +| **TensorFlow** | `tensorflow` | `tensorflow` | `tensorflow` | All lowercase | +| **Hugging Face** | `huggingface` | `transformers` | `transformers` | All lowercase | +| **FastAPI** | `tiangolo` | `fastapi` | `fastapi` | All lowercase | +| **scikit-learn** | `scikit-learn` | `scikit-learn` | `scikit-learn` | All lowercase with hyphen | + +**Conclusion:** The overwhelming convention in Python open-source projects is **all lowercase** for repository names. + +--- + +## GitHub Redirect Behavior + +Based on [GitHub's documentation](https://docs.github.com/en/repositories/creating-and-managing-repositories/renaming-a-repository): + +### What Gets Redirected (Indefinitely) +- Web traffic to the old URL +- `git clone`, `git fetch`, `git push` operations +- Issues, wikis, stars, followers + +### What Breaks Immediately +- **GitHub Actions** referencing the repository by name will fail with "repository not found" +- **GitHub Pages** custom domain URLs are not automatically redirected + +### Redirect Persistence +- Redirects persist **indefinitely** unless: + 1. A new repository is created with the old name + 2. GitHub support is asked to remove them + +### Important Warning +From [GitHub Community discussions](https://github.com/orgs/community/discussions/22669): "If you create a new repository under your account in the future, do not reuse the original name of the renamed repository. If you do, redirects to the renamed repository will no longer work." + +--- + +## Detailed Analysis + +### Arguments FOR Renaming to Lowercase + +| Argument | Weight | Rationale | +|----------|--------|-----------| +| **Consistency with sub-packages** | Medium | All sub-packages use lowercase (`openadapt-ml`, `openadapt-capture`, etc.) | +| **Python convention** | Medium | Standard practice in Python ecosystem (see industry research) | +| **PyPI alignment** | Medium | Package name is `openadapt` (lowercase) | +| **Import alignment** | Low | `import openadapt` works regardless of repo name | +| **URL simplicity** | Low | `github.com/OpenAdaptAI/openadapt` slightly cleaner | +| **Already in pyproject.toml** | High | Repository URL already shows lowercase intent | + +### Arguments AGAINST Renaming + +| Argument | Weight | Rationale | +|----------|--------|-----------| +| **Brand recognition** | High | "OpenAdapt" as two words (Open + Adapt) reinforces brand identity | +| **Breaking changes risk** | High | External links, bookmarks, documentation, blog posts, academic citations | +| **GitHub org inconsistency** | Medium | Organization is `OpenAdaptAI` (mixed case) - renaming repo creates inconsistency | +| **Documentation updates** | Medium | 1,343 occurrences of "OpenAdapt" across 78 files need review | +| **SEO impact** | Medium | Existing search rankings tied to "OpenAdapt" | +| **Minimal actual benefit** | High | GitHub URLs are case-insensitive for access purposes | +| **Legacy code references** | Medium | Legacy directory has extensive "OpenAdapt" references | + +--- + +## Technical Impact Assessment + +### Files Requiring Updates if Renamed + +Based on codebase analysis: + +| Category | File Count | Occurrences | Update Required? | +|----------|------------|-------------|------------------| +| Documentation (*.md) | 37 | ~200+ | Review each | +| GitHub workflows (*.yml) | 10 | ~50+ | Critical review | +| Python source files | 15 | ~50+ | Review imports | +| Configuration files | 5 | ~20+ | Review URLs | +| Legacy code | 20+ | ~900+ | May leave as-is | + +### CI/CD Impact + +Current workflows use relative paths and don't hard-code the repository name, so **minimal CI/CD impact expected**. + +However, any external workflows or actions referencing `OpenAdaptAI/OpenAdapt` would need updates. + +### Impact on Forks and Clones + +- **Existing clones:** Continue working via redirects, but should update with `git remote set-url` +- **Existing forks:** Maintain their existing names and remotes +- **New forks:** Would fork from the new lowercase name + +--- + +## Risk Assessment + +| Risk | Likelihood | Impact | Mitigation | +|------|------------|--------|------------| +| Broken external links | Medium | Medium | GitHub redirects handle most cases | +| Academic citation issues | Low | Medium | Papers cite DOIs or specific versions | +| SEO ranking drop | Low | Low | Temporary if any; redirects preserve link equity | +| User confusion | Medium | Low | Clear communication and documentation | +| GitHub Actions failures | Low | High | Audit and update before rename | +| Brand dilution | Medium | Medium | None - cannot mitigate if lowercase chosen | + +--- + +## Alternative Approaches + +### Option A: Do Nothing (RECOMMENDED) +- Keep repository as `OpenAdapt` +- Accept minor inconsistency with sub-packages +- No risk, no disruption + +### Option B: Rename to Lowercase +- Change repository to `openadapt` +- Update documentation +- Communicate to users +- Accept brand/visual trade-off + +### Option C: Rename Organization and Repository +- Change `OpenAdaptAI` to `openadaptai` +- Change `OpenAdapt` to `openadapt` +- Complete consistency, but much higher disruption +- **NOT RECOMMENDED** - organization rename is significantly more disruptive + +### Option D: Create Alias via Transfer +- Transfer repository to a new `openadapt` repo +- Keep `OpenAdapt` as a redirect-only stub +- **NOT RECOMMENDED** - unnecessarily complex + +--- + +## Recommendation + +**Recommendation: Do Not Rename (Option A)** + +### Rationale + +1. **GitHub URLs are case-insensitive** - Users can access via `github.com/OpenAdaptAI/openadapt` or `github.com/openadaptai/OpenAdapt` interchangeably + +2. **Brand value** - "OpenAdapt" with capitalization clearly shows the "Open" + "Adapt" word composition, which is meaningful for the project's identity + +3. **Risk/benefit ratio** - The benefits are cosmetic while the risks (broken links, confusion, documentation churn) are concrete + +4. **Organization inconsistency** - Renaming only the repo while keeping `OpenAdaptAI` creates a new inconsistency + +5. **Industry examples** - While most Python projects use lowercase, several successful projects (like early versions of major projects) maintained mixed-case names without issue + +6. **pyproject.toml already lowercase** - The `Repository` URL in `pyproject.toml` already shows lowercase, providing implicit consistency for programmatic access + +--- + +## If Renaming is Chosen: Migration Plan + +Should the decision be made to rename despite the recommendation, here is the migration plan: + +### Phase 1: Preparation (1 week before) +1. Audit all GitHub Actions and CI/CD workflows +2. Document all external references (blog posts, papers, etc.) +3. Prepare communication for Discord and mailing lists +4. Create redirect documentation + +### Phase 2: Execution (Day of) +1. Perform the rename via GitHub Settings +2. Update `pyproject.toml` repository URL (if needed) +3. Update README.md badge URLs +4. Push updated documentation + +### Phase 3: Communication (Day of + 1 week) +1. Announce on Discord +2. Post on social media +3. Email contributors +4. Update any linked resources + +### Phase 4: Follow-up (1 month) +1. Monitor for broken links +2. Update external documentation (readthedocs, etc.) +3. Check Google Search Console for indexing issues + +--- + +## Timeline + +| Milestone | Date | Notes | +|-----------|------|-------| +| Decision | TBD | Pending team discussion | +| If renaming: Preparation | T+0 to T+7 days | Audit and documentation | +| If renaming: Execution | T+7 days | Actual rename | +| If renaming: Stabilization | T+7 to T+30 days | Monitor and fix issues | + +--- + +## Conclusion + +While lowercase repository naming is the Python convention and would create better consistency with sub-packages, the **costs outweigh the benefits** for the main OpenAdapt repository. The recommendation is to **keep the current `OpenAdapt` naming** for the following key reasons: + +1. Brand recognition and identity +2. Risk of breaking external references +3. GitHub URLs are case-insensitive anyway +4. Organization name would remain inconsistent regardless +5. The `pyproject.toml` already uses lowercase, providing programmatic consistency + +If consistency is deemed critical in the future, consider renaming the organization and all repositories together as a single coordinated effort, rather than piecemeal changes. + +--- + +## References + +- [GitHub: Renaming a repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/renaming-a-repository) +- [GitHub Community: How long does GitHub forward renamed repos?](https://github.com/orgs/community/discussions/22669) +- [GitHub Community: Duration of Web Traffic Redirection](https://github.com/orgs/community/discussions/110367) +- [LangChain GitHub](https://github.com/langchain-ai/langchain) +- [Hugging Face Transformers](https://github.com/huggingface/transformers) + +--- + +## Appendix A: Files Containing "OpenAdapt" References + +Key files with the highest occurrence counts: + +| File | Count | Notes | +|------|-------|-------| +| `legacy/CHANGELOG.md` | 911 | Historical, may leave unchanged | +| `README.md` | 21 | Brand mentions, badges | +| `docs/contributing.md` | 18 | Contribution guidelines | +| `legacy/build.py` | 19 | Build scripts | +| `docs/design/landing-page-strategy.md` | 20 | Strategy document | +| `docs/architecture-evolution.md` | 14 | Architecture docs | + +Total: **1,343 occurrences across 78 files** + +--- + +## Appendix B: Sub-package Repository Naming + +All sub-packages follow lowercase convention: + +| Repository | PyPI Package | +|------------|--------------| +| `openadapt-capture` | `openadapt-capture` | +| `openadapt-ml` | `openadapt-ml` | +| `openadapt-evals` | `openadapt-evals` | +| `openadapt-viewer` | `openadapt-viewer` | +| `openadapt-grounding` | `openadapt-grounding` | +| `openadapt-retrieval` | `openadapt-retrieval` | +| `openadapt-privacy` | `openadapt-privacy` | + +This consistency is desirable but not critical enough to justify renaming the main repository. diff --git a/docs/design/telemetry-design.md b/docs/design/telemetry-design.md new file mode 100644 index 000000000..cd0ecc343 --- /dev/null +++ b/docs/design/telemetry-design.md @@ -0,0 +1,895 @@ +# Telemetry Design for OpenAdapt Packages + +## Overview + +This document outlines the design for adding optional telemetry to all OpenAdapt packages. The system is designed to be: + +- **Opt-in by default** (or easily disabled) +- **Privacy-respecting** (no PII, no screenshots, minimal data) +- **Developer-aware** (internal usage tagged for filtering) +- **Unified** (shared module across all packages) + +## Table of Contents + +1. [Service Recommendation](#service-recommendation) +2. [Architecture](#architecture) +3. [Implementation Approach](#implementation-approach) +4. [Configuration Options](#configuration-options) +5. [Privacy Considerations](#privacy-considerations) +6. [Internal Usage Tagging](#internal-usage-tagging) +7. [Code Examples](#code-examples) +8. [Migration Plan](#migration-plan) +9. [References](#references) + +--- + +## Service Recommendation + +### Recommendation: GlitchTip (Self-Hosted) + Sentry SDK + +After evaluating both options, we recommend **continuing with GlitchTip** (already in use in the legacy codebase) with the Sentry Python SDK. + +### Comparison + +| Feature | GlitchTip | Sentry | +|---------|-----------|--------| +| **Pricing** | Free (self-hosted) or $15/mo (100K errors) | Free tier limited, paid plans start higher | +| **Self-Hosting** | Simple (4 components: backend, workers, Redis, PostgreSQL) | Complex (12+ components including Kafka, Zookeeper, ClickHouse) | +| **Resource Requirements** | Minimal (1GB RAM, 1 CPU core) | Heavy (requires significant infrastructure) | +| **SDK Compatibility** | Uses Sentry SDK (drop-in compatible) | Native SDK | +| **Open Source** | Fully open source | Partially open source | +| **Features** | Error tracking, uptime monitoring, basic performance | Full APM, session replay, distributed tracing | +| **Privacy** | Self-hosted = full data control | Cloud = data sent to Sentry servers | + +### Rationale + +1. **Existing Integration**: The legacy OpenAdapt codebase already uses GlitchTip (DSN: `app.glitchtip.com`) +2. **Privacy-First**: Self-hosting ensures complete control over sensitive automation data +3. **Cost-Effective**: Free for self-hosted or very affordable cloud option +4. **SDK Compatibility**: Uses the battle-tested Sentry Python SDK +5. **Simplicity**: Easier to deploy and maintain than self-hosted Sentry +6. **Open Source Alignment**: Matches OpenAdapt's open-source philosophy + +### GlitchTip Cloud vs Self-Hosted + +| Option | Pros | Cons | +|--------|------|------| +| **Cloud (glitchtip.com)** | Zero maintenance, instant setup | Monthly cost, data leaves your infrastructure | +| **Self-Hosted** | Free, full data control, customizable | Requires server, maintenance overhead | + +**Recommendation**: Start with GlitchTip Cloud for simplicity, migrate to self-hosted if needed. + +--- + +## Architecture + +### Shared Telemetry Module + +We propose a new package `openadapt-telemetry` that provides a unified telemetry interface for all OpenAdapt packages. + +``` +openadapt-telemetry/ +├── src/openadapt_telemetry/ +│ ├── __init__.py # Public API exports +│ ├── config.py # Configuration management +│ ├── client.py # Telemetry client (Sentry wrapper) +│ ├── events.py # Event types and helpers +│ ├── privacy.py # PII filtering and scrubbing +│ └── decorators.py # Convenience decorators +└── pyproject.toml +``` + +### Package Integration + +```mermaid +graph TD + subgraph Packages["OpenAdapt Packages"] + CAP[openadapt-capture] + ML[openadapt-ml] + EVL[openadapt-evals] + VWR[openadapt-viewer] + GRD[openadapt-grounding] + RET[openadapt-retrieval] + PRV[openadapt-privacy] + end + + subgraph Telemetry["Telemetry Layer"] + TEL[openadapt-telemetry] + CONFIG[Config Manager] + FILTER[Privacy Filter] + end + + subgraph Backend["Backend"] + GT[GlitchTip] + end + + CAP --> TEL + ML --> TEL + EVL --> TEL + VWR --> TEL + GRD --> TEL + RET --> TEL + PRV --> TEL + + TEL --> CONFIG + TEL --> FILTER + TEL --> GT +``` + +--- + +## Implementation Approach + +### Option A: Shared Package (Recommended) + +Create `openadapt-telemetry` as a dependency for all packages. + +**Pros:** +- Single source of truth for telemetry logic +- Consistent behavior across all packages +- Easy to update and maintain +- Centralized privacy controls + +**Cons:** +- Additional dependency +- Version coordination required + +### Option B: Per-Package Implementation + +Each package implements its own telemetry. + +**Pros:** +- Package independence +- No cross-package dependencies + +**Cons:** +- Code duplication +- Inconsistent implementations +- Harder to maintain privacy controls + +### Decision: Option A (Shared Package) + +The shared package approach aligns with the meta-package architecture and ensures consistency. + +--- + +## Configuration Options + +### Environment Variables + +```bash +# Primary opt-out mechanism (industry standard) +OPENADAPT_TELEMETRY_ENABLED=false # Disable all telemetry +DO_NOT_TRACK=1 # Universal opt-out (alternative) + +# Internal/developer mode +OPENADAPT_INTERNAL=true # Tag as internal usage +OPENADAPT_DEV=true # Development mode (alternative) + +# Configuration overrides +OPENADAPT_TELEMETRY_DSN= # Custom DSN +OPENADAPT_TELEMETRY_ENVIRONMENT=dev # Environment name +OPENADAPT_TELEMETRY_SAMPLE_RATE=0.1 # Sampling rate (0.0-1.0) +``` + +### Configuration File + +```json +// ~/.config/openadapt/telemetry.json +{ + "enabled": true, + "internal": false, + "dsn": null, + "environment": "production", + "sample_rate": 1.0, + "error_tracking": true, + "performance_tracking": false, + "feature_usage": true +} +``` + +### Priority Order + +1. Environment variables (highest priority) +2. Configuration file +3. Package defaults (lowest priority) + +### Default Configuration + +```python +DEFAULTS = { + "enabled": True, # Enabled by default, easy opt-out + "internal": False, # External user by default + "dsn": "https://xxx@app.glitchtip.com/XXXX", + "environment": "production", + "sample_rate": 1.0, # 100% for errors + "traces_sample_rate": 0.01, # 1% for performance + "error_tracking": True, + "performance_tracking": True, + "feature_usage": True, + "send_default_pii": False, # Never send PII by default +} +``` + +--- + +## Privacy Considerations + +### What We Collect (Ethical Data) + +| Category | Data Collected | Purpose | +|----------|---------------|---------| +| **Error Tracking** | Exception type, stack trace, error message | Bug fixing, stability monitoring | +| **Performance** | Function timing, memory usage | Optimization, bottleneck detection | +| **Feature Usage** | Feature names, operation counts | Prioritize development, understand needs | +| **Environment** | OS, Python version, package versions | Compatibility testing, support | +| **Session** | Anonymous session ID, duration | Usage patterns, engagement | + +### What We Never Collect + +| Category | Data NOT Collected | Reason | +|----------|-------------------|--------| +| **PII** | Names, emails, IP addresses | Privacy violation | +| **Screenshots** | Screen captures, images | Highly sensitive | +| **User Content** | Text typed, file contents | Privacy violation | +| **Credentials** | API keys, passwords, tokens | Security risk | +| **File Paths** | Full paths (especially with usernames) | PII leakage | +| **Network Data** | URLs, request bodies | Sensitive information | +| **Biometrics** | Mouse patterns, typing cadence | Privacy violation | + +### PII Scrubbing + +```python +# Automatically scrubbed from all events +PII_DENYLIST = [ + "password", + "secret", + "token", + "api_key", + "authorization", + "cookie", + "session", + "email", + "phone", + "address", + "ssn", + "credit_card", +] + +# Path sanitization +def sanitize_path(path: str) -> str: + """Remove username from file paths.""" + # /Users/john/code/file.py -> /Users//code/file.py + return re.sub(r'/Users/[^/]+/', '/Users//', path) +``` + +### GDPR Compliance + +1. **Consent**: Telemetry is opt-in or easily disabled +2. **Data Minimization**: Collect only necessary data +3. **Purpose Limitation**: Use only for stated purposes +4. **Transparency**: Document what is collected +5. **Right to Erasure**: Provide way to request data deletion +6. **Data Protection**: Self-hosted option for full control + +--- + +## Internal Usage Tagging + +### Tagging Strategy + +Internal OpenAdapt developers and testers should be tagged so their usage can be filtered out when analyzing real user behavior. + +### Detection Methods + +```python +def is_internal_user() -> bool: + """Determine if current usage is from internal team.""" + + # Method 1: Explicit environment variable + if os.getenv("OPENADAPT_INTERNAL", "").lower() in ("true", "1", "yes"): + return True + + # Method 2: Development environment + if os.getenv("OPENADAPT_DEV", "").lower() in ("true", "1", "yes"): + return True + + # Method 3: Not running from executable (dev mode) + if not is_running_from_executable(): + return True + + # Method 4: Git repository present (development checkout) + if Path(".git").exists(): + return True + + # Method 5: Known internal email domain (if user identified) + # Note: Only if user voluntarily provided email + + # Method 6: CI/CD environment + ci_env_vars = ["CI", "GITHUB_ACTIONS", "GITLAB_CI", "JENKINS_URL"] + if any(os.getenv(var) for var in ci_env_vars): + return True + + return False +``` + +### Tag Application + +```python +def get_telemetry_tags() -> dict: + """Get standard tags for all telemetry events.""" + return { + "internal": is_internal_user(), + "environment": get_environment(), + "package_version": get_version(), + "python_version": platform.python_version(), + "os": platform.system(), + "os_version": platform.release(), + } +``` + +### Filtering in GlitchTip + +``` +# Filter out internal usage +tag:internal IS false + +# View only internal usage +tag:internal IS true + +# Combine with environment +tag:environment IS production AND tag:internal IS false +``` + +--- + +## Code Examples + +### Package Installation + +```toml +# pyproject.toml for any OpenAdapt package +[project] +dependencies = [ + "openadapt-telemetry>=0.1.0", +] + +[project.optional-dependencies] +# Telemetry is optional for those who want zero tracking +minimal = [] # Install without telemetry +``` + +### Telemetry Client Implementation + +```python +# src/openadapt_telemetry/client.py +"""Telemetry client for OpenAdapt packages.""" + +from __future__ import annotations + +import os +import platform +from functools import lru_cache +from pathlib import Path +from typing import Any, Callable, Optional + +import sentry_sdk +from sentry_sdk.types import Event, Hint + + +class TelemetryClient: + """Unified telemetry client for all OpenAdapt packages.""" + + _instance: Optional["TelemetryClient"] = None + + def __init__(self): + self._initialized = False + self._enabled = self._check_enabled() + self._internal = self._check_internal() + + @classmethod + def get_instance(cls) -> "TelemetryClient": + """Get singleton instance.""" + if cls._instance is None: + cls._instance = cls() + return cls._instance + + def _check_enabled(self) -> bool: + """Check if telemetry should be enabled.""" + # Universal opt-out + if os.getenv("DO_NOT_TRACK", "").lower() in ("1", "true"): + return False + + # Package-specific opt-out + if os.getenv("OPENADAPT_TELEMETRY_ENABLED", "").lower() in ("false", "0", "no"): + return False + + return True + + def _check_internal(self) -> bool: + """Check if this is internal usage.""" + # Explicit flag + if os.getenv("OPENADAPT_INTERNAL", "").lower() in ("true", "1", "yes"): + return True + + # Development mode + if os.getenv("OPENADAPT_DEV", "").lower() in ("true", "1", "yes"): + return True + + # Git repo present (development checkout) + if Path(".git").exists(): + return True + + # CI environment + ci_vars = ["CI", "GITHUB_ACTIONS", "GITLAB_CI", "JENKINS_URL", "TRAVIS"] + if any(os.getenv(var) for var in ci_vars): + return True + + return False + + def initialize( + self, + dsn: Optional[str] = None, + package_name: str = "openadapt", + package_version: str = "unknown", + **kwargs, + ) -> None: + """Initialize the telemetry client.""" + if not self._enabled: + return + + if self._initialized: + return + + dsn = dsn or os.getenv( + "OPENADAPT_TELEMETRY_DSN", + "https://xxx@app.glitchtip.com/XXXX" # Default DSN + ) + + environment = os.getenv("OPENADAPT_TELEMETRY_ENVIRONMENT", "production") + sample_rate = float(os.getenv("OPENADAPT_TELEMETRY_SAMPLE_RATE", "1.0")) + traces_sample_rate = float(os.getenv("OPENADAPT_TELEMETRY_TRACES_SAMPLE_RATE", "0.01")) + + sentry_sdk.init( + dsn=dsn, + environment=environment, + sample_rate=sample_rate, + traces_sample_rate=traces_sample_rate, + send_default_pii=False, + before_send=self._before_send, + before_send_transaction=self._before_send_transaction, + **kwargs, + ) + + # Set default tags + sentry_sdk.set_tag("internal", self._internal) + sentry_sdk.set_tag("package", package_name) + sentry_sdk.set_tag("package_version", package_version) + sentry_sdk.set_tag("python_version", platform.python_version()) + sentry_sdk.set_tag("os", platform.system()) + sentry_sdk.set_tag("os_version", platform.release()) + + self._initialized = True + + def _before_send(self, event: Event, hint: Hint) -> Optional[Event]: + """Filter and sanitize events before sending.""" + # Scrub PII from stack traces + if "exception" in event: + self._scrub_exception(event["exception"]) + + return event + + def _before_send_transaction(self, event: Event, hint: Hint) -> Optional[Event]: + """Filter performance events.""" + return event + + def _scrub_exception(self, exception_data: dict) -> None: + """Remove PII from exception data.""" + if "values" not in exception_data: + return + + for value in exception_data["values"]: + if "stacktrace" in value and "frames" in value["stacktrace"]: + for frame in value["stacktrace"]["frames"]: + # Sanitize file paths + if "filename" in frame: + frame["filename"] = self._sanitize_path(frame["filename"]) + if "abs_path" in frame: + frame["abs_path"] = self._sanitize_path(frame["abs_path"]) + + @staticmethod + def _sanitize_path(path: str) -> str: + """Remove username from file paths.""" + import re + # macOS/Linux: /Users/username/ or /home/username/ + path = re.sub(r'/Users/[^/]+/', '/Users//', path) + path = re.sub(r'/home/[^/]+/', '/home//', path) + # Windows: C:\Users\username\ + path = re.sub(r'C:\\Users\\[^\\]+\\', 'C:\\Users\\\\', path) + return path + + def capture_exception(self, exception: Optional[Exception] = None, **kwargs) -> None: + """Capture an exception.""" + if not self._enabled: + return + sentry_sdk.capture_exception(exception, **kwargs) + + def capture_message(self, message: str, level: str = "info", **kwargs) -> None: + """Capture a message.""" + if not self._enabled: + return + sentry_sdk.capture_message(message, level=level, **kwargs) + + def capture_event(self, event_name: str, properties: Optional[dict] = None) -> None: + """Capture a custom event (feature usage).""" + if not self._enabled: + return + + properties = properties or {} + properties["event_name"] = event_name + sentry_sdk.capture_message( + f"event:{event_name}", + level="info", + extras=properties, + ) + + def set_user(self, user_id: str, **kwargs) -> None: + """Set user context (anonymous ID only).""" + if not self._enabled: + return + sentry_sdk.set_user({"id": user_id, **kwargs}) + + def set_tag(self, key: str, value: str) -> None: + """Set a custom tag.""" + if not self._enabled: + return + sentry_sdk.set_tag(key, value) + + def add_breadcrumb(self, message: str, category: str = "default", **kwargs) -> None: + """Add a breadcrumb for context.""" + if not self._enabled: + return + sentry_sdk.add_breadcrumb(message=message, category=category, **kwargs) + + +# Convenience singleton access +def get_telemetry() -> TelemetryClient: + """Get the telemetry client instance.""" + return TelemetryClient.get_instance() +``` + +### Decorator for Function Tracking + +```python +# src/openadapt_telemetry/decorators.py +"""Convenience decorators for telemetry.""" + +import functools +import time +from typing import Callable, Optional + +import sentry_sdk + +from .client import get_telemetry + + +def track_performance(name: Optional[str] = None): + """Decorator to track function performance.""" + def decorator(func: Callable) -> Callable: + operation_name = name or func.__name__ + + @functools.wraps(func) + def wrapper(*args, **kwargs): + telemetry = get_telemetry() + + with sentry_sdk.start_transaction(op="function", name=operation_name): + start = time.perf_counter() + try: + return func(*args, **kwargs) + finally: + duration = time.perf_counter() - start + sentry_sdk.set_measurement("duration_ms", duration * 1000) + + return wrapper + return decorator + + +def track_errors(reraise: bool = True): + """Decorator to automatically capture exceptions.""" + def decorator(func: Callable) -> Callable: + @functools.wraps(func) + def wrapper(*args, **kwargs): + try: + return func(*args, **kwargs) + except Exception as e: + get_telemetry().capture_exception(e) + if reraise: + raise + return wrapper + return decorator + + +def track_feature(feature_name: str): + """Decorator to track feature usage.""" + def decorator(func: Callable) -> Callable: + @functools.wraps(func) + def wrapper(*args, **kwargs): + get_telemetry().capture_event( + f"feature:{feature_name}", + {"function": func.__name__}, + ) + return func(*args, **kwargs) + return wrapper + return decorator +``` + +### Package Integration Example + +```python +# src/openadapt_retrieval/__init__.py +"""OpenAdapt Retrieval - Multimodal demo retrieval.""" + +from openadapt_retrieval.embeddings import ( + BaseEmbedder, + CLIPEmbedder, + Qwen3VLEmbedder, + get_embedder, +) +from openadapt_retrieval.retriever import ( + DemoMetadata, + MultimodalDemoRetriever, + RetrievalResult, + VectorIndex, +) +from openadapt_retrieval.storage import EmbeddingStorage + +__version__ = "0.1.0" + +# Initialize telemetry on import (lazy, respects opt-out) +try: + from openadapt_telemetry import get_telemetry + get_telemetry().initialize( + package_name="openadapt-retrieval", + package_version=__version__, + ) +except ImportError: + # Telemetry package not installed (minimal install) + pass + +__all__ = [ + "BaseEmbedder", + "Qwen3VLEmbedder", + "CLIPEmbedder", + "get_embedder", + "MultimodalDemoRetriever", + "VectorIndex", + "RetrievalResult", + "DemoMetadata", + "EmbeddingStorage", +] +``` + +### Feature Usage Tracking Example + +```python +# In openadapt-retrieval/retriever/demo_retriever.py + +from openadapt_telemetry import get_telemetry, track_feature, track_performance + + +class MultimodalDemoRetriever: + """Retriever for multimodal demo search.""" + + @track_feature("retrieval.add_demo") + def add_demo( + self, + demo_id: str, + task: str, + screenshot: Optional[Union[str, Path, Image.Image]] = None, + **metadata, + ) -> None: + """Add a demo to the retrieval library.""" + # Implementation... + + @track_performance("retrieval.build_index") + def build_index(self) -> None: + """Build the FAISS index from stored demos.""" + try: + # Implementation... + get_telemetry().capture_event( + "retrieval.index_built", + {"num_demos": len(self._demos)}, + ) + except Exception as e: + get_telemetry().capture_exception(e) + raise + + @track_performance("retrieval.search") + def retrieve( + self, + task: str, + screenshot: Optional[Union[str, Path, Image.Image]] = None, + top_k: int = 5, + ) -> List[RetrievalResult]: + """Find similar demos for a given query.""" + # Implementation... +``` + +### CLI Opt-Out Information + +```python +# In CLI help text + +TELEMETRY_HELP = """ +OpenAdapt collects anonymous usage data to improve the software. + +What we collect: + - Error reports (exception types, stack traces) + - Performance metrics (timing, memory usage) + - Feature usage counts (which features are popular) + +What we NEVER collect: + - Screenshots or images + - Text you type or file contents + - Personal information (names, emails, IPs) + - API keys or passwords + +To disable telemetry: + - Set OPENADAPT_TELEMETRY_ENABLED=false + - Or set DO_NOT_TRACK=1 (universal standard) + +For more info: https://docs.openadapt.ai/telemetry +""" +``` + +--- + +## Migration Plan + +### Phase 1: Create Telemetry Package + +1. Create `openadapt-telemetry` package +2. Implement core client with GlitchTip/Sentry SDK +3. Add privacy filtering and scrubbing +4. Write comprehensive tests +5. Publish to PyPI + +### Phase 2: Update Meta-Package + +1. Add `openadapt-telemetry` as optional dependency +2. Update documentation +3. Add CLI telemetry status command + +### Phase 3: Integrate with Packages + +For each package (`capture`, `ml`, `evals`, `viewer`, `grounding`, `retrieval`, `privacy`): + +1. Add `openadapt-telemetry` dependency +2. Initialize telemetry in `__init__.py` +3. Add tracking to key operations +4. Test with telemetry enabled/disabled + +### Phase 4: Legacy Migration + +1. Update legacy error_reporting.py to use new module +2. Migrate PostHog events to unified system +3. Deprecate old telemetry code + +### Timeline + +| Phase | Duration | Milestone | +|-------|----------|-----------| +| Phase 1 | 1 week | Telemetry package published | +| Phase 2 | 2 days | Meta-package updated | +| Phase 3 | 2 weeks | All packages integrated | +| Phase 4 | 1 week | Legacy migration complete | + +--- + +## Testing Strategy + +### Unit Tests + +```python +# tests/test_telemetry.py + +import os +from unittest.mock import patch, MagicMock + +import pytest + +from openadapt_telemetry import TelemetryClient, get_telemetry + + +class TestTelemetryOptOut: + """Test that telemetry respects opt-out settings.""" + + def test_do_not_track_env(self): + """DO_NOT_TRACK=1 should disable telemetry.""" + with patch.dict(os.environ, {"DO_NOT_TRACK": "1"}): + client = TelemetryClient() + assert not client._enabled + + def test_explicit_disable(self): + """OPENADAPT_TELEMETRY_ENABLED=false should disable.""" + with patch.dict(os.environ, {"OPENADAPT_TELEMETRY_ENABLED": "false"}): + client = TelemetryClient() + assert not client._enabled + + def test_internal_detection(self): + """Internal users should be detected.""" + with patch.dict(os.environ, {"OPENADAPT_INTERNAL": "true"}): + client = TelemetryClient() + assert client._internal + + +class TestPrivacyScrubbing: + """Test that PII is properly scrubbed.""" + + def test_path_sanitization(self): + """File paths should have usernames removed.""" + client = TelemetryClient() + + assert client._sanitize_path("/Users/john/code/file.py") == "/Users//code/file.py" + assert client._sanitize_path("/home/alice/app/main.py") == "/home//app/main.py" + assert client._sanitize_path("C:\\Users\\bob\\code\\file.py") == "C:\\Users\\\\code\\file.py" +``` + +--- + +## References + +### GlitchTip + +- [GlitchTip Documentation](https://glitchtip.com/documentation/) +- [GlitchTip Installation Guide](https://glitchtip.com/documentation/install/) +- [Sentry SDK Documentation (GlitchTip compatible)](https://glitchtip.com/sdkdocs/python/) + +### Privacy & Ethics + +- [GDPR Telemetry Data Guidelines](https://www.activemind.legal/guides/telemetry-data/) +- [Linux Foundation Telemetry Policy](https://www.linuxfoundation.org/legal/telemetry-data-policy) +- [OpenTelemetry Handling Sensitive Data](https://opentelemetry.io/docs/security/handling-sensitive-data/) + +### Industry Standards + +- [DO_NOT_TRACK Environment Variable](https://consoledonottrack.com/) +- [Kedro Telemetry Plugin](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry) + +### Sentry SDK + +- [Sentry Python SDK](https://docs.sentry.io/platforms/python/) +- [Sentry Filtering](https://docs.sentry.io/platforms/python/configuration/filtering/) +- [Sentry Tags](https://docs.sentry.io/platforms/python/enriching-events/tags/) + +--- + +## Appendix: Configuration Reference + +### All Environment Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `DO_NOT_TRACK` | - | Universal opt-out (1 = disabled) | +| `OPENADAPT_TELEMETRY_ENABLED` | `true` | Enable/disable telemetry | +| `OPENADAPT_INTERNAL` | `false` | Tag as internal usage | +| `OPENADAPT_DEV` | `false` | Development mode | +| `OPENADAPT_TELEMETRY_DSN` | (default) | GlitchTip DSN | +| `OPENADAPT_TELEMETRY_ENVIRONMENT` | `production` | Environment name | +| `OPENADAPT_TELEMETRY_SAMPLE_RATE` | `1.0` | Error sampling rate | +| `OPENADAPT_TELEMETRY_TRACES_SAMPLE_RATE` | `0.01` | Performance sampling rate | + +### DSN Configuration + +The DSN (Data Source Name) should be stored securely and not committed to version control: + +```bash +# Development (use separate project) +export OPENADAPT_TELEMETRY_DSN="https://xxx@app.glitchtip.com/dev-project" + +# Production (use production project) +export OPENADAPT_TELEMETRY_DSN="https://xxx@app.glitchtip.com/prod-project" + +# Self-hosted +export OPENADAPT_TELEMETRY_DSN="https://xxx@glitchtip.your-domain.com/project" +``` diff --git a/docs/design/tray-logging.md b/docs/design/tray-logging.md new file mode 100644 index 000000000..b8ae7d937 --- /dev/null +++ b/docs/design/tray-logging.md @@ -0,0 +1,801 @@ +# OpenAdapt Tray: Logging & Action Storage + +This document supplements the main `openadapt-tray` design document with detailed specifications for logging, action history, telemetry integration, and storage considerations. + +## Table of Contents + +1. [Local Logging](#local-logging) +2. [Action History](#action-history) +3. [Telemetry Integration](#telemetry-integration) +4. [Privacy Considerations](#privacy-considerations) +5. [Storage Locations](#storage-locations) +6. [Integration with Existing Packages](#integration-with-existing-packages) + +--- + +## Local Logging + +### Platform-Specific Log Paths + +The tray application stores logs in platform-appropriate locations following OS conventions: + +| Platform | Log Directory | +|----------|---------------| +| macOS | `~/Library/Application Support/OpenAdapt/logs/` | +| Windows | `%APPDATA%/OpenAdapt/logs/` | +| Linux | `~/.local/share/openadapt/logs/` | + +### Log File Naming + +``` +openadapt-tray.log # Current log file +openadapt-tray.log.1 # Previous rotation (newest) +openadapt-tray.log.2 # Older rotation +... +openadapt-tray.log.5 # Oldest rotation +``` + +### Log Rotation Policy + +| Setting | Value | Rationale | +|---------|-------|-----------| +| **Max File Size** | 10 MB | Prevents disk space issues | +| **Max Backup Count** | 5 files | ~50 MB total log storage | +| **Rotation Trigger** | Size-based | Predictable disk usage | +| **Compression** | gzip for backups | Reduces storage footprint | + +### Log Retention Policy + +- **Active logs**: Rotated based on size (10 MB threshold) +- **Rotated logs**: Kept for 30 days or 5 rotations, whichever comes first +- **Crash logs**: Retained for 90 days for debugging +- **Automatic cleanup**: Old logs purged on app startup + +### Log Levels + +| Environment | Level | Description | +|-------------|-------|-------------| +| **Production** | `INFO` | Normal operations, errors, warnings | +| **Debug** | `DEBUG` | Verbose output including state changes | +| **Trace** | `TRACE` | Extremely verbose, including IPC messages | + +### Log Level Configuration + +```python +# Environment variable override +OPENADAPT_TRAY_LOG_LEVEL=DEBUG + +# Or via config.json +{ + "logging": { + "level": "INFO", + "console": false, + "file": true + } +} +``` + +### Log Format + +``` +2024-01-15 10:30:45.123 | INFO | tray.main:start_recording:42 - Recording session started +2024-01-15 10:30:45.456 | DEBUG | tray.menu:update_state:78 - Menu state updated: recording=True +2024-01-15 10:31:12.789 | ERROR | tray.capture:on_error:156 - Capture failed: Permission denied +``` + +Format specification: +``` +{timestamp} | {level:8} | {module}:{function}:{line} - {message} +``` + +### Implementation Example + +```python +import logging +from logging.handlers import RotatingFileHandler +from pathlib import Path +import platform +import sys + +def get_log_directory() -> Path: + """Get platform-appropriate log directory.""" + if platform.system() == "Darwin": + base = Path.home() / "Library" / "Application Support" + elif platform.system() == "Windows": + base = Path(os.environ.get("APPDATA", Path.home() / "AppData" / "Roaming")) + else: # Linux and others + base = Path(os.environ.get("XDG_DATA_HOME", Path.home() / ".local" / "share")) + + log_dir = base / "OpenAdapt" / "logs" + log_dir.mkdir(parents=True, exist_ok=True) + return log_dir + +def setup_logging(level: str = "INFO") -> logging.Logger: + """Configure logging for the tray application.""" + logger = logging.getLogger("openadapt.tray") + logger.setLevel(getattr(logging, level.upper())) + + # File handler with rotation + log_file = get_log_directory() / "openadapt-tray.log" + file_handler = RotatingFileHandler( + log_file, + maxBytes=10 * 1024 * 1024, # 10 MB + backupCount=5, + encoding="utf-8", + ) + file_handler.setFormatter(logging.Formatter( + "{asctime} | {levelname:8} | {name}:{funcName}:{lineno} - {message}", + style="{", + datefmt="%Y-%m-%d %H:%M:%S", + )) + logger.addHandler(file_handler) + + return logger +``` + +--- + +## Action History + +### Overview + +The tray app maintains a local history of user interactions for: +- Auditing user actions +- Supporting undo/redo functionality +- Debugging session issues +- Syncing state with other OpenAdapt components + +### Tracked Actions + +| Action Type | Data Captured | Purpose | +|-------------|---------------|---------| +| `recording.start` | timestamp, task_name, settings | Session tracking | +| `recording.stop` | timestamp, duration, frame_count | Session completion | +| `recording.pause` | timestamp | Session state | +| `recording.resume` | timestamp | Session state | +| `training.start` | timestamp, model_type, demo_ids | Training tracking | +| `training.complete` | timestamp, duration, success | Training outcomes | +| `training.cancel` | timestamp, reason | Training interruptions | +| `settings.changed` | key, old_value, new_value | Configuration audit | +| `app.start` | timestamp, version, os_info | Lifecycle tracking | +| `app.stop` | timestamp, exit_reason | Lifecycle tracking | +| `error.occurred` | timestamp, error_type, context | Error tracking | + +### Storage Format + +Action history is stored in a local SQLite database for efficient querying and reliable storage. + +#### Database Schema + +```sql +-- Action history table +CREATE TABLE action_history ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + timestamp TEXT NOT NULL, -- ISO 8601 format + action_type TEXT NOT NULL, -- e.g., 'recording.start' + session_id TEXT, -- Groups related actions + data TEXT, -- JSON blob for action-specific data + synced INTEGER DEFAULT 0, -- Sync status with capture DB + created_at TEXT DEFAULT CURRENT_TIMESTAMP +); + +-- Index for common queries +CREATE INDEX idx_action_timestamp ON action_history(timestamp); +CREATE INDEX idx_action_type ON action_history(action_type); +CREATE INDEX idx_session_id ON action_history(session_id); +CREATE INDEX idx_synced ON action_history(synced); + +-- Session metadata table +CREATE TABLE sessions ( + id TEXT PRIMARY KEY, -- UUID + task_name TEXT, + started_at TEXT NOT NULL, + ended_at TEXT, + status TEXT DEFAULT 'active', -- active, completed, cancelled, error + frame_count INTEGER DEFAULT 0, + duration_seconds REAL, + capture_db_id TEXT -- Reference to openadapt-capture DB +); +``` + +#### Example Records + +```json +{ + "id": 1, + "timestamp": "2024-01-15T10:30:45.123Z", + "action_type": "recording.start", + "session_id": "550e8400-e29b-41d4-a716-446655440000", + "data": { + "task_name": "Fill out expense report", + "settings": { + "capture_screenshots": true, + "capture_audio": false, + "fps": 1 + } + }, + "synced": 0 +} +``` + +### Sync with openadapt-capture Database + +The tray app synchronizes action history with the capture package's database to maintain a unified record: + +```python +from pathlib import Path +import sqlite3 +from typing import Optional +import json + +class ActionHistorySync: + """Sync tray action history with capture database.""" + + def __init__(self, tray_db_path: Path, capture_db_path: Optional[Path] = None): + self.tray_db = tray_db_path + self.capture_db = capture_db_path + + def sync_session(self, session_id: str) -> bool: + """Sync a completed session to capture database.""" + if not self.capture_db or not self.capture_db.exists(): + return False + + with sqlite3.connect(self.tray_db) as tray_conn: + # Get unsynced actions for this session + actions = tray_conn.execute( + """ + SELECT id, timestamp, action_type, data + FROM action_history + WHERE session_id = ? AND synced = 0 + ORDER BY timestamp + """, + (session_id,) + ).fetchall() + + if not actions: + return True + + with sqlite3.connect(self.capture_db) as capture_conn: + # Insert into capture database's session_events table + for action_id, timestamp, action_type, data in actions: + capture_conn.execute( + """ + INSERT INTO session_events (timestamp, event_type, data, source) + VALUES (?, ?, ?, 'tray') + """, + (timestamp, action_type, data) + ) + capture_conn.commit() + + # Mark as synced + with sqlite3.connect(self.tray_db) as tray_conn: + tray_conn.executemany( + "UPDATE action_history SET synced = 1 WHERE id = ?", + [(a[0],) for a in actions] + ) + tray_conn.commit() + + return True +``` + +### Retention Policy + +| Data Type | Retention Period | Rationale | +|-----------|------------------|-----------| +| Action history | 90 days | Debugging and audit trail | +| Session metadata | 1 year | Long-term usage patterns | +| Synced records | 30 days (then delete) | Reduce redundancy | + +--- + +## Telemetry Integration + +### Reference Design + +For detailed telemetry implementation, see the comprehensive telemetry design at [docs/design/telemetry-design.md](./telemetry-design.md). + +### GlitchTip/Sentry Integration + +The tray app uses the shared `openadapt-telemetry` module for crash reporting and error tracking. + +```python +# Initialize telemetry in tray app +from openadapt_telemetry import get_telemetry + +def init_app(): + """Initialize the tray application.""" + telemetry = get_telemetry() + telemetry.initialize( + package_name="openadapt-tray", + package_version=__version__, + ) +``` + +### Error and Crash Reporting + +```python +from openadapt_telemetry import get_telemetry, track_errors + +class TrayApp: + @track_errors(reraise=True) + def start_recording(self, task_name: str) -> None: + """Start a recording session.""" + try: + # Recording logic... + pass + except PermissionError as e: + get_telemetry().capture_exception(e, tags={ + "action": "start_recording", + "platform": platform.system(), + }) + raise +``` + +### Anonymous Usage Analytics (Opt-In) + +Usage analytics are strictly opt-in and collect only aggregate, non-identifying data. + +#### Events Tracked + +| Event | Data Collected | Purpose | +|-------|----------------|---------| +| `tray.app_start` | timestamp, version, os, internal_flag | App lifecycle | +| `tray.app_stop` | timestamp, uptime_seconds, exit_reason | App lifecycle | +| `tray.recording_session` | duration_seconds, success, frame_count | Usage patterns | +| `tray.training_initiated` | model_type, demo_count | Feature usage | +| `tray.error` | error_type (no message), context | Error patterns | + +#### Event Implementation + +```python +from openadapt_telemetry import get_telemetry + +def track_recording_session(duration: float, success: bool, frame_count: int): + """Track recording session metrics (opt-in only).""" + telemetry = get_telemetry() + + if not telemetry.is_analytics_enabled(): + return + + telemetry.capture_event( + "tray.recording_session", + { + "duration_seconds": round(duration, 1), + "success": success, + "frame_count_bucket": bucket_count(frame_count), # 0-10, 10-50, 50-100, 100+ + } + ) + +def bucket_count(count: int) -> str: + """Bucket counts to avoid exact numbers (privacy).""" + if count <= 10: + return "0-10" + elif count <= 50: + return "10-50" + elif count <= 100: + return "50-100" + else: + return "100+" +``` + +--- + +## Privacy Considerations + +### Core Principles + +1. **Local-First**: All data stored locally by default +2. **No PII**: Never collect personally identifiable information +3. **No Content**: Never collect screenshots, recordings, or user input +4. **Explicit Consent**: Cloud sync and analytics require opt-in +5. **Transparency**: Users can inspect all stored data + +### What Is Never Collected or Transmitted + +| Data Type | Reason | +|-----------|--------| +| Screenshots | Highly sensitive, potential PII | +| Recorded actions | Contains user behavior data | +| Typed text | PII and sensitive content | +| File paths with usernames | PII leakage | +| IP addresses | Location identification | +| Hardware identifiers | Device fingerprinting | +| Window titles | May contain sensitive info | + +### Opt-In/Opt-Out Settings + +```json +// config.json +{ + "telemetry": { + "crash_reporting": true, // Enabled by default, can disable + "anonymous_analytics": false, // Disabled by default, opt-in + "cloud_sync": false // Disabled by default, opt-in + } +} +``` + +### Settings UI Integration + +The tray app settings menu should include clear telemetry controls: + +``` +Settings > Privacy +├── [x] Send crash reports (helps improve stability) +├── [ ] Share anonymous usage statistics +├── [ ] Sync settings across devices +└── [View collected data...] -> Opens local data directory +``` + +### Data Inspection + +Users can inspect all locally stored data: + +```python +def open_data_directory(): + """Open the OpenAdapt data directory in file explorer.""" + import subprocess + import platform + + data_dir = get_data_directory() + + if platform.system() == "Darwin": + subprocess.run(["open", str(data_dir)]) + elif platform.system() == "Windows": + subprocess.run(["explorer", str(data_dir)]) + else: + subprocess.run(["xdg-open", str(data_dir)]) +``` + +### Data Deletion + +Users can delete all local data: + +```python +def clear_all_data(keep_config: bool = True): + """Delete all OpenAdapt local data.""" + data_dir = get_data_directory() + + for item in data_dir.iterdir(): + if keep_config and item.name == "config.json": + continue + if item.is_dir(): + shutil.rmtree(item) + else: + item.unlink() + + logger.info("All local data cleared") +``` + +--- + +## Storage Locations + +### Directory Structure + +``` +macOS: ~/Library/Application Support/OpenAdapt/ +Windows: %APPDATA%/OpenAdapt/ +Linux: ~/.local/share/openadapt/ + +Contents: +├── logs/ # Application logs +│ ├── openadapt-tray.log # Current tray app log +│ ├── openadapt-tray.log.1 # Rotated logs +│ └── crash/ # Crash dumps +├── config.json # User settings and preferences +├── history.db # Action history (SQLite) +├── cache/ # Temporary files +│ ├── icons/ # Cached tray icons +│ └── temp/ # Temporary processing files +└── state/ # Persistent state + └── session.json # Current session state (for crash recovery) +``` + +### Storage Path Resolution + +```python +import os +import platform +from pathlib import Path +from typing import Dict + +def get_storage_paths() -> Dict[str, Path]: + """Get all storage paths for the current platform.""" + + if platform.system() == "Darwin": + base = Path.home() / "Library" / "Application Support" / "OpenAdapt" + elif platform.system() == "Windows": + appdata = os.environ.get("APPDATA", Path.home() / "AppData" / "Roaming") + base = Path(appdata) / "OpenAdapt" + else: # Linux and others + xdg_data = os.environ.get("XDG_DATA_HOME", Path.home() / ".local" / "share") + base = Path(xdg_data) / "openadapt" + + paths = { + "base": base, + "logs": base / "logs", + "crash_logs": base / "logs" / "crash", + "config": base / "config.json", + "history_db": base / "history.db", + "cache": base / "cache", + "state": base / "state", + } + + # Ensure directories exist + for key, path in paths.items(): + if key not in ("config", "history_db"): # Don't create files + path.mkdir(parents=True, exist_ok=True) + + return paths +``` + +### Config File Schema + +```json +{ + "$schema": "https://openadapt.ai/schemas/tray-config-v1.json", + "version": 1, + "logging": { + "level": "INFO", + "console": false, + "file": true, + "max_size_mb": 10, + "backup_count": 5 + }, + "telemetry": { + "crash_reporting": true, + "anonymous_analytics": false, + "cloud_sync": false + }, + "recording": { + "default_fps": 1, + "capture_audio": false, + "capture_screenshots": true, + "auto_pause_on_idle": true, + "idle_threshold_seconds": 30 + }, + "ui": { + "show_notifications": true, + "start_minimized": false, + "start_on_login": false + }, + "advanced": { + "capture_db_path": null, + "ml_model_path": null + } +} +``` + +--- + +## Integration with Existing Packages + +### Shared Telemetry Module + +The tray app uses the shared `openadapt-telemetry` module (see [telemetry-design.md](./telemetry-design.md)) for consistent telemetry across all OpenAdapt packages. + +```python +# pyproject.toml +[project] +dependencies = [ + "openadapt-telemetry>=0.1.0", +] +``` + +### Coordination with openadapt-capture + +The tray app coordinates with `openadapt-capture` for recording functionality: + +```python +from openadapt_capture import RecordingSession, CaptureConfig +from openadapt_tray.history import ActionHistory + +class TrayRecordingController: + """Bridge between tray UI and capture backend.""" + + def __init__(self): + self.history = ActionHistory() + self.current_session: Optional[RecordingSession] = None + + def start_recording(self, task_name: str, config: CaptureConfig) -> str: + """Start a new recording session.""" + import uuid + + session_id = str(uuid.uuid4()) + + # Log to action history + self.history.log_action( + action_type="recording.start", + session_id=session_id, + data={"task_name": task_name, "config": config.to_dict()} + ) + + # Start capture backend + self.current_session = RecordingSession( + session_id=session_id, + task_name=task_name, + config=config, + on_error=self._on_capture_error, + ) + self.current_session.start() + + return session_id + + def stop_recording(self) -> dict: + """Stop the current recording session.""" + if not self.current_session: + return {"error": "No active session"} + + result = self.current_session.stop() + + # Log completion + self.history.log_action( + action_type="recording.stop", + session_id=self.current_session.session_id, + data={ + "duration": result.duration, + "frame_count": result.frame_count, + "success": result.success, + } + ) + + # Sync with capture database + self.history.sync_session(self.current_session.session_id) + + self.current_session = None + return result.to_dict() + + def _on_capture_error(self, error: Exception): + """Handle capture errors.""" + get_telemetry().capture_exception(error) + self.history.log_action( + action_type="error.occurred", + session_id=self.current_session.session_id if self.current_session else None, + data={"error_type": type(error).__name__} + ) +``` + +### Surfacing Training Logs from openadapt-ml + +The tray app can display training progress and logs from the ML package: + +```python +from openadapt_ml import TrainingJob, TrainingStatus +from openadapt_tray.notifications import show_notification + +class TrayTrainingController: + """Bridge between tray UI and ML training backend.""" + + def __init__(self): + self.history = ActionHistory() + self.current_job: Optional[TrainingJob] = None + + def start_training(self, model_type: str, demo_ids: list[str]) -> str: + """Start a training job.""" + job_id = str(uuid.uuid4()) + + self.history.log_action( + action_type="training.start", + data={ + "job_id": job_id, + "model_type": model_type, + "demo_count": len(demo_ids), + } + ) + + self.current_job = TrainingJob( + job_id=job_id, + model_type=model_type, + demo_ids=demo_ids, + on_progress=self._on_training_progress, + on_complete=self._on_training_complete, + on_error=self._on_training_error, + ) + self.current_job.start() + + return job_id + + def _on_training_progress(self, progress: float, message: str): + """Handle training progress updates.""" + # Update tray icon or menu with progress + pass + + def _on_training_complete(self, result: TrainingStatus): + """Handle training completion.""" + self.history.log_action( + action_type="training.complete", + data={ + "job_id": self.current_job.job_id, + "duration": result.duration, + "success": result.success, + } + ) + + show_notification( + title="Training Complete", + message=f"Model trained successfully in {result.duration:.1f}s" + ) + + # Track telemetry (anonymous) + get_telemetry().capture_event( + "tray.training_complete", + {"model_type": self.current_job.model_type, "success": True} + ) + + def _on_training_error(self, error: Exception): + """Handle training errors.""" + get_telemetry().capture_exception(error) + + self.history.log_action( + action_type="training.error", + data={ + "job_id": self.current_job.job_id if self.current_job else None, + "error_type": type(error).__name__, + } + ) + + show_notification( + title="Training Failed", + message="An error occurred during training. Check logs for details." + ) +``` + +### Log Aggregation View + +The tray app can provide a unified view of logs from all OpenAdapt components: + +```python +from pathlib import Path +from typing import Iterator, NamedTuple +from datetime import datetime + +class LogEntry(NamedTuple): + timestamp: datetime + level: str + source: str # tray, capture, ml, etc. + message: str + +def aggregate_logs(max_entries: int = 1000) -> Iterator[LogEntry]: + """Aggregate logs from all OpenAdapt components.""" + + log_sources = { + "tray": get_storage_paths()["logs"] / "openadapt-tray.log", + "capture": get_capture_log_path(), # From openadapt-capture + "ml": get_ml_log_path(), # From openadapt-ml + } + + entries = [] + + for source, log_path in log_sources.items(): + if not log_path.exists(): + continue + + with open(log_path, "r") as f: + for line in f: + try: + entry = parse_log_line(line, source) + if entry: + entries.append(entry) + except Exception: + continue + + # Sort by timestamp and return most recent + entries.sort(key=lambda e: e.timestamp, reverse=True) + return iter(entries[:max_entries]) +``` + +--- + +## Summary + +This document defines the logging and storage architecture for the OpenAdapt tray application: + +1. **Local Logging**: Platform-specific paths with rotation and retention policies +2. **Action History**: SQLite-based storage for user interactions, synced with capture database +3. **Telemetry**: Integration with shared telemetry module for crash reporting and opt-in analytics +4. **Privacy**: Local-first approach with no PII collection and clear opt-in/opt-out controls +5. **Storage**: Organized directory structure following OS conventions +6. **Integration**: Seamless coordination with capture, ML, and telemetry packages + +For telemetry implementation details, refer to the comprehensive [telemetry design document](./telemetry-design.md). diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md index 3ac7ccc46..29e139fdf 100644 --- a/docs/getting-started/quickstart.md +++ b/docs/getting-started/quickstart.md @@ -1,15 +1,15 @@ # Quick Start -This guide walks you through recording a demonstration, training a model, and evaluating it. +This guide walks you through collecting a demonstration, learning a policy, and evaluating the agent. ## Prerequisites - OpenAdapt installed with required packages: `pip install openadapt[all]` - macOS users: [Grant required permissions](permissions.md) -## 1. Record a Demonstration +## 1. Collect a Demonstration -Start recording your screen and inputs: +Start capturing your screen and inputs: ```bash openadapt capture start --name my-task @@ -22,14 +22,14 @@ Now perform the task you want to automate: 3. Navigate menus 4. Complete your workflow -When finished, stop recording: +When finished, stop the capture: ```bash # Press Ctrl+C in the terminal, or: openadapt capture stop ``` -## 2. View the Recording +## 2. View the Trajectory Inspect what was captured: @@ -37,15 +37,15 @@ Inspect what was captured: openadapt capture view my-task ``` -This opens an HTML viewer showing: +This opens a trajectory viewer showing: -- Screenshots at each step -- Mouse and keyboard events +- Observations (screenshots) at each step +- Actions (mouse and keyboard events) - Timing information -## 3. List Your Captures +## 3. List Your Demonstrations -See all recorded demonstrations: +See all collected demonstrations: ```bash openadapt capture list @@ -59,25 +59,25 @@ my-task 45 2m 30s 2026-01-16 login-demo 23 1m 15s 2026-01-15 ``` -## 4. Train a Model +## 4. Learn a Policy -Train a model on your recorded demonstration: +Learn an agent policy from your demonstration trajectory: ```bash openadapt train start --capture my-task --model qwen3vl-2b ``` -Monitor training progress: +Monitor policy learning progress: ```bash openadapt train status ``` -Training creates a checkpoint file in `training_output/`. +Policy learning creates a checkpoint file in `training_output/`. -## 5. Evaluate the Model +## 5. Evaluate the Agent -Test your trained model on a benchmark: +Test your trained policy on a benchmark: ```bash openadapt eval run --checkpoint training_output/model.pt --benchmark waa @@ -103,7 +103,7 @@ openadapt eval run --agent api-claude --benchmark waa ## Complete Workflow Example -Here is a complete example from start to finish: +Here is a complete example demonstrating the full pipeline: ```bash # 1. Install OpenAdapt @@ -112,21 +112,21 @@ pip install openadapt[all] # 2. Check system requirements openadapt doctor -# 3. Record a task +# 3. Collect a demonstration openadapt capture start --name email-reply # ... perform the task ... # Press Ctrl+C to stop -# 4. View the recording +# 4. View the trajectory openadapt capture view email-reply -# 5. Train a model +# 5. Learn a policy openadapt train start --capture email-reply --model qwen3vl-2b -# 6. Wait for training to complete +# 6. Wait for policy learning to complete openadapt train status -# 7. Evaluate +# 7. Evaluate the agent openadapt eval run --checkpoint training_output/model.pt --benchmark waa ``` diff --git a/docs/index.md b/docs/index.md index 4aeb4c448..99f8bc665 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,7 +4,7 @@ OpenAdapt is the **open** source software **adapt**er between Large Multimodal Models (LMMs) and traditional desktop and web GUIs. -Record GUI demonstrations, train ML models, and evaluate agents - all from a unified CLI. +Collect human demonstrations, learn agent policies, and evaluate autonomous execution - all from a unified CLI. [Join Discord](https://discord.gg/yF527cQbDG){ .md-button .md-button--primary } [View on GitHub](https://github.com/OpenAdaptAI/OpenAdapt){ .md-button } @@ -15,24 +15,24 @@ Record GUI demonstrations, train ML models, and evaluate agents - all from a uni OpenAdapt bridges the gap between powerful AI models and everyday software automation. Instead of writing complex scripts or learning APIs, you simply: -1. **Record** - Demonstrate a task by doing it yourself -2. **Train** - Let OpenAdapt learn from your demonstration -3. **Deploy** - Run your trained agent to automate the task -4. **Evaluate** - Measure performance on standardized benchmarks +1. **Demonstrate** - Show the agent how to perform a task by doing it yourself +2. **Learn** - Let OpenAdapt learn an agent policy from your demonstration trajectory +3. **Execute** - Deploy your trained agent to autonomously perform the task +4. **Evaluate** - Measure agent performance on standardized benchmarks ```mermaid flowchart LR - subgraph Record["1. Record"] - A[User Demo] --> B[Capture] + subgraph Demonstrate["1. Demonstrate"] + A[Human Trajectory] --> B[Capture] end - subgraph Train["2. Train"] - B --> C[ML Model] + subgraph Learn["2. Learn"] + B --> C[Policy Learning] end - subgraph Deploy["3. Deploy"] - C --> D[Agent Policy] - D --> E[Action Replay] + subgraph Execute["3. Execute"] + C --> D[Trained Policy] + D --> E[Agent Deployment] end subgraph Evaluate["4. Evaluate"] @@ -53,7 +53,7 @@ flowchart LR Works with any Large Multimodal Model - Claude, GPT-4V, Gemini, Qwen-VL, or your own fine-tuned models. ### Learn from Demonstration -No prompting required. OpenAdapt learns directly from how you perform tasks, automatically generating the right prompts. +No manual prompt engineering required. OpenAdapt learns agent policies directly from your demonstration trajectories. ### Universal GUI Support Works with all desktop GUIs including native applications, web browsers, and virtualized environments. @@ -71,14 +71,14 @@ Install OpenAdapt with the features you need: pip install openadapt[all] # Everything ``` -Record a demonstration: +Collect a demonstration: ```bash openadapt capture start --name my-task # Perform your task, then press Ctrl+C ``` -Train a model: +Learn a policy: ```bash openadapt train start --capture my-task --model qwen3vl-2b @@ -100,12 +100,12 @@ OpenAdapt v1.0+ uses a **modular meta-package architecture**. The main `openadap | Package | Description | |---------|-------------| -| [openadapt-capture](packages/capture.md) | Event recording and storage | -| [openadapt-ml](packages/ml.md) | ML engine, training, inference | +| [openadapt-capture](packages/capture.md) | Demonstration collection and storage | +| [openadapt-ml](packages/ml.md) | Policy learning, training, inference | | [openadapt-evals](packages/evals.md) | Benchmark evaluation | -| [openadapt-viewer](packages/viewer.md) | HTML visualization | -| [openadapt-grounding](packages/grounding.md) | UI element localization | -| [openadapt-retrieval](packages/retrieval.md) | Multimodal demo retrieval | +| [openadapt-viewer](packages/viewer.md) | Trajectory visualization | +| [openadapt-grounding](packages/grounding.md) | UI element grounding | +| [openadapt-retrieval](packages/retrieval.md) | Trajectory retrieval | | [openadapt-privacy](packages/privacy.md) | PII/PHI scrubbing | See the full [Architecture Documentation](architecture.md) for detailed diagrams. diff --git a/docs/packages/capture.md b/docs/packages/capture.md index 67499f3fa..b27b6d846 100644 --- a/docs/packages/capture.md +++ b/docs/packages/capture.md @@ -1,6 +1,6 @@ # openadapt-capture -GUI recording, event capture, and storage. +Demonstration collection, observation-action capture, and storage. **Repository**: [OpenAdaptAI/openadapt-capture](https://github.com/OpenAdaptAI/openadapt-capture) @@ -14,17 +14,17 @@ pip install openadapt-capture ## Overview -The capture package records user interactions with desktop and web GUIs, including: +The capture package collects human demonstrations from desktop and web GUIs, including: -- Screenshots at configurable intervals -- Mouse events (clicks, movement, scrolling) -- Keyboard events (key presses, text input) +- Observations (screenshots) at configurable intervals +- Actions: mouse events (clicks, movement, scrolling) +- Actions: keyboard events (key presses, text input) - Window and application context -- Timing information +- Timing information for trajectory reconstruction ## CLI Commands -### Start Recording +### Start Demonstration Collection ```bash openadapt capture start --name my-task @@ -37,27 +37,27 @@ Options: - `--no-screenshots` - Disable screenshot capture - `--no-keyboard` - Disable keyboard capture -### Stop Recording +### Stop Demonstration Collection ```bash openadapt capture stop ``` -Or press `Ctrl+C` in the recording terminal. +Or press `Ctrl+C` in the capture terminal. -### List Captures +### List Demonstrations ```bash openadapt capture list ``` -### View a Capture +### View a Demonstration Trajectory ```bash openadapt capture view my-task ``` -### Delete a Capture +### Delete a Demonstration ```bash openadapt capture delete my-task @@ -75,41 +75,41 @@ session = CaptureSession(name="my-task") recorder = Recorder(session) recorder.start() -# ... user performs actions ... +# ... user demonstrates the task ... # Stop recording recorder.stop() -# Access captured data -events = session.get_events() -screenshots = session.get_screenshots() +# Access captured trajectory data +actions = session.get_actions() +observations = session.get_observations() # screenshots ``` ## Data Format -Captures are stored as JSON/Parquet files: +Demonstrations are stored as JSON/Parquet files: ``` -captures/ +demonstrations/ my-task/ metadata.json # Session metadata - events.parquet # Event data - screenshots/ # Screenshot images + actions.parquet # Action data (observation-action pairs) + observations/ # Screenshot images (observations) 0001.png 0002.png ... ``` -### Event Schema +### Action Schema ```python { - "timestamp": float, # Unix timestamp - "type": str, # "mouse_click", "key_press", etc. + "timestamp": float, # Unix timestamp + "action_type": str, # "click", "type", "scroll", etc. "data": { - # Event-specific data + # Action-specific data }, - "screenshot_id": int # Reference to screenshot + "observation_id": int # Reference to observation (screenshot) } ``` @@ -117,11 +117,11 @@ captures/ | Export | Description | |--------|-------------| -| `CaptureSession` | Manages a capture session | -| `Recorder` | Records user interactions | +| `CaptureSession` | Manages a demonstration collection session | +| `Recorder` | Captures observation-action pairs | | `Action` | Represents a user action | -| `MouseEvent` | Mouse event data | -| `KeyboardEvent` | Keyboard event data | +| `Observation` | Represents an observation (screenshot) | +| `Trajectory` | Sequence of observation-action pairs | ## Platform Support @@ -133,6 +133,6 @@ captures/ ## Related Packages -- [openadapt-privacy](privacy.md) - Scrub PII/PHI from captures -- [openadapt-viewer](viewer.md) - Visualize capture data -- [openadapt-ml](ml.md) - Train models on captures +- [openadapt-privacy](privacy.md) - Scrub PII/PHI from demonstrations +- [openadapt-viewer](viewer.md) - Visualize trajectories +- [openadapt-ml](ml.md) - Learn policies from demonstrations diff --git a/docs/packages/evals.md b/docs/packages/evals.md index 84f5fa4a7..d861f93a6 100644 --- a/docs/packages/evals.md +++ b/docs/packages/evals.md @@ -26,7 +26,7 @@ The evals package provides: ### Run Evaluation ```bash -# Evaluate a trained model +# Evaluate a trained policy openadapt eval run --checkpoint training_output/model.pt --benchmark waa # Evaluate an API agent @@ -35,7 +35,7 @@ openadapt eval run --agent api-claude --benchmark waa Options: -- `--checkpoint` - Path to model checkpoint +- `--checkpoint` - Path to trained policy checkpoint - `--agent` - Agent type (api-claude, api-gpt4v, custom) - `--benchmark` - Benchmark name (waa, osworld, etc.) - `--tasks` - Number of tasks to evaluate (default: all) @@ -88,7 +88,7 @@ from openadapt_evals import ApiAgent, BenchmarkAdapter, evaluate_agent_on_benchm # Create an API agent agent = ApiAgent.claude() -# Or load a trained model +# Or load a trained policy from openadapt_ml import AgentPolicy agent = AgentPolicy.from_checkpoint("model.pt") @@ -157,7 +157,7 @@ flowchart TB | `ApiAgent` | API-based agent (Claude, GPT-4V) | | `BenchmarkAdapter` | Benchmark interface | | `MockAdapter` | Mock benchmark for testing | -| `evaluate_agent_on_benchmark` | Evaluation function | +| `evaluate_agent_on_benchmark` | Agent evaluation function | | `EvalResults` | Evaluation results container | ## Metrics @@ -171,5 +171,5 @@ flowchart TB ## Related Packages -- [openadapt-ml](ml.md) - Train models to evaluate -- [openadapt-capture](capture.md) - Record training data +- [openadapt-ml](ml.md) - Learn policies to evaluate +- [openadapt-capture](capture.md) - Collect demonstrations diff --git a/docs/packages/grounding.md b/docs/packages/grounding.md index 7ef939cc6..0b52b019f 100644 --- a/docs/packages/grounding.md +++ b/docs/packages/grounding.md @@ -1,6 +1,6 @@ # openadapt-grounding -UI element localization for improved action accuracy. +UI element grounding for improved action accuracy. **Repository**: [OpenAdaptAI/openadapt-grounding](https://github.com/OpenAdaptAI/openadapt-grounding) @@ -14,7 +14,7 @@ pip install openadapt-grounding ## Overview -The grounding package provides UI element detection and localization to improve: +The grounding package provides UI element detection and grounding to improve: - Click accuracy by targeting element centers - Robustness to UI changes @@ -59,7 +59,7 @@ marked_image, element_map = som.create() # element_map: {1: "Submit button", 2: "Email field", ...} ``` -## Integration with ML +## Integration with Policy Execution ```python from openadapt_ml import AgentPolicy @@ -71,8 +71,9 @@ policy = AgentPolicy.from_checkpoint( grounding=ElementDetector() ) -# Predictions will use grounded coordinates -action = policy.predict(screenshot) +# Actions will use grounded coordinates +observation = load_screenshot() +action = policy.predict(observation) ``` ## CLI Commands @@ -122,5 +123,5 @@ openadapt ground som screenshot.png --output marked.png ## Related Packages -- [openadapt-ml](ml.md) - Use grounding in training and inference -- [openadapt-capture](capture.md) - Ground recorded captures +- [openadapt-ml](ml.md) - Use grounding in policy learning and execution +- [openadapt-capture](capture.md) - Apply grounding to demonstrations diff --git a/docs/packages/ml.md b/docs/packages/ml.md index c2261a709..479ea3aa0 100644 --- a/docs/packages/ml.md +++ b/docs/packages/ml.md @@ -1,6 +1,6 @@ # openadapt-ml -ML engine, training, and inference for GUI automation agents. +Policy learning, training, and inference for GUI automation agents. **Repository**: [OpenAdaptAI/openadapt-ml](https://github.com/OpenAdaptAI/openadapt-ml) @@ -17,13 +17,13 @@ pip install openadapt-ml The ML package provides: - Model adapters for various LMMs (Qwen-VL, LLaVA, etc.) -- Training infrastructure for supervised learning +- Policy learning infrastructure from demonstration trajectories - Inference engine for action prediction -- Agent policies for deployment +- Agent policies for autonomous execution ## CLI Commands -### Start Training +### Start Policy Learning ```bash openadapt train start --capture my-task --model qwen3vl-2b @@ -31,19 +31,19 @@ openadapt train start --capture my-task --model qwen3vl-2b Options: -- `--capture` - Name of the capture to train on (required) +- `--capture` - Name of the demonstration to learn from (required) - `--model` - Model architecture (required) - `--epochs` - Number of training epochs (default: 10) - `--batch-size` - Batch size (default: 4) - `--output` - Output directory (default: training_output/) -### Check Training Status +### Check Policy Learning Status ```bash openadapt train status ``` -### Stop Training +### Stop Policy Learning ```bash openadapt train stop @@ -72,32 +72,32 @@ from openadapt_ml import QwenVLAdapter, Trainer, AgentPolicy # Load a pre-trained model adapter = QwenVLAdapter.from_pretrained("qwen3vl-2b") -# Create trainer +# Create trainer for policy learning trainer = Trainer( model=adapter, - capture_name="my-task", + demonstration="my-task", # demonstration name epochs=10 ) -# Train +# Learn policy from demonstration trajectory checkpoint_path = trainer.train() -# Load for inference +# Load trained policy for execution policy = AgentPolicy.from_checkpoint(checkpoint_path) -# Predict next action -screenshot = load_screenshot() -action = policy.predict(screenshot) +# Predict next action from observation +observation = load_screenshot() +action = policy.predict(observation) ``` -## Training Pipeline +## Policy Learning Pipeline ```mermaid flowchart LR subgraph Input - CAP[Capture Data] - SS[Screenshots] - EV[Events] + DEMO[Demonstration] + OBS[Observations] + ACT[Actions] end subgraph Processing @@ -106,20 +106,20 @@ flowchart LR TOK[Tokenization] end - subgraph Training + subgraph Learning FWD[Forward Pass] LOSS[Loss Calculation] OPT[Optimization] end subgraph Output - CKPT[Checkpoint] + CKPT[Trained Policy] LOG[Training Logs] end - CAP --> DL - SS --> DL - EV --> DL + DEMO --> DL + OBS --> DL + ACT --> DL DL --> AUG AUG --> TOK TOK --> FWD @@ -135,9 +135,9 @@ flowchart LR |--------|-------------| | `QwenVLAdapter` | Qwen-VL model adapter | | `LLaVAAdapter` | LLaVA model adapter | -| `Trainer` | Training infrastructure | -| `AgentPolicy` | Inference policy | -| `train_supervised` | Training function | +| `Trainer` | Policy learning infrastructure | +| `AgentPolicy` | Trained policy for execution | +| `learn_from_demonstrations` | Policy learning function | ## Hardware Requirements @@ -149,6 +149,6 @@ flowchart LR ## Related Packages -- [openadapt-capture](capture.md) - Record training data -- [openadapt-evals](evals.md) - Evaluate trained models -- [openadapt-retrieval](retrieval.md) - Few-shot retrieval for training +- [openadapt-capture](capture.md) - Collect demonstrations +- [openadapt-evals](evals.md) - Evaluate trained policies +- [openadapt-retrieval](retrieval.md) - Trajectory retrieval for few-shot policy learning diff --git a/docs/packages/privacy.md b/docs/packages/privacy.md index 9bbf2be9c..2a5aff056 100644 --- a/docs/packages/privacy.md +++ b/docs/packages/privacy.md @@ -44,7 +44,7 @@ The privacy package provides: ## CLI Commands -### Scrub a Capture +### Scrub a Demonstration ```bash openadapt privacy scrub my-task @@ -78,8 +78,8 @@ from openadapt_privacy import Scrubber, PIIDetector # Create a scrubber scrubber = Scrubber(mode="blur") -# Scrub a capture -scrubber.scrub_capture("my-task", output_dir="scrubbed/") +# Scrub a demonstration +scrubber.scrub_demonstration("my-task", output_dir="scrubbed/") # Or scrub individual images scrubbed_image = scrubber.scrub_image(screenshot_path) @@ -106,10 +106,10 @@ session = CaptureSession( recorder = Recorder(session) recorder.start() -# ... recording ... +# ... demonstration collection ... recorder.stop() -# Captures are automatically scrubbed +# Demonstrations are automatically scrubbed ``` ## Redaction Modes @@ -152,5 +152,5 @@ This package helps with compliance for: ## Related Packages -- [openadapt-capture](capture.md) - Record demonstrations to scrub -- [openadapt-viewer](viewer.md) - View scrubbed captures +- [openadapt-capture](capture.md) - Collect demonstrations to scrub +- [openadapt-viewer](viewer.md) - View scrubbed demonstrations diff --git a/docs/packages/retrieval.md b/docs/packages/retrieval.md index 1c85b4e9f..ae167ebf0 100644 --- a/docs/packages/retrieval.md +++ b/docs/packages/retrieval.md @@ -1,6 +1,6 @@ # openadapt-retrieval -Multimodal demonstration retrieval for few-shot prompting. +Multimodal trajectory retrieval for few-shot policy learning. **Repository**: [OpenAdaptAI/openadapt-retrieval](https://github.com/OpenAdaptAI/openadapt-retrieval) @@ -16,47 +16,47 @@ pip install openadapt-retrieval The retrieval package enables: -- Semantic search over captured demonstrations -- Few-shot example selection for prompting +- Semantic search over demonstration trajectories +- Few-shot example selection for policy learning - Multimodal similarity (text + image) - Demonstration library management ## Use Cases -### Few-Shot Prompting +### Few-Shot Policy Learning -Find similar demonstrations to use as examples when prompting an LMM. +Find similar demonstrations to use as examples when learning agent policies. -### Transfer Learning +### Trajectory Transfer -Retrieve relevant demonstrations for new tasks. +Retrieve relevant demonstration trajectories for new tasks. ### Demonstration Discovery -Search your library of captured demonstrations. +Search your library of demonstration trajectories. ## Python API ```python from openadapt_retrieval import DemoIndex, retrieve_similar -# Build an index over your captures +# Build an index over your demonstrations index = DemoIndex() -index.add_captures(["task-1", "task-2", "task-3"]) +index.add_demonstrations(["task-1", "task-2", "task-3"]) -# Retrieve similar demonstrations -screenshot = load_screenshot() +# Retrieve similar demonstration trajectories +observation = load_screenshot() similar = index.search( - query_image=screenshot, + query_image=observation, query_text="click the submit button", top_k=3 ) for result in similar: - print(f"{result.capture_name}: {result.similarity:.2f}") + print(f"{result.demonstration_name}: {result.similarity:.2f}") ``` -### Integration with ML +### Integration with Policy Learning ```python from openadapt_ml import AgentPolicy @@ -69,8 +69,9 @@ policy = AgentPolicy.from_checkpoint( retrieval_index=index ) -# Predictions include relevant examples -action = policy.predict(screenshot, use_retrieval=True) +# Policy uses similar trajectory examples for few-shot learning +observation = load_screenshot() +action = policy.predict(observation, use_retrieval=True) ``` ## CLI Commands @@ -87,7 +88,7 @@ openadapt retrieval index --captures task-1 task-2 task-3 openadapt retrieval search --image screenshot.png --text "click submit" ``` -### List Indexed Captures +### List Indexed Demonstrations ```bash openadapt retrieval list @@ -97,7 +98,7 @@ openadapt retrieval list | Export | Description | |--------|-------------| -| `DemoIndex` | Demonstration index | +| `DemoIndex` | Demonstration trajectory index | | `retrieve_similar` | Similarity search | | `Embedding` | Vector embedding | | `SearchResult` | Search result data | @@ -118,7 +119,7 @@ Indexes are stored as pickle files: indexes/ demo_index.pkl # Main index embeddings.npy # Vector embeddings - metadata.json # Capture metadata + metadata.json # Demonstration metadata ``` ## Performance @@ -131,5 +132,5 @@ indexes/ ## Related Packages -- [openadapt-capture](capture.md) - Record demonstrations to index -- [openadapt-ml](ml.md) - Use retrieval in training +- [openadapt-capture](capture.md) - Collect demonstrations to index +- [openadapt-ml](ml.md) - Use retrieval in policy learning diff --git a/docs/packages/viewer.md b/docs/packages/viewer.md index 54646ceb8..2314413d3 100644 --- a/docs/packages/viewer.md +++ b/docs/packages/viewer.md @@ -1,6 +1,6 @@ # openadapt-viewer -HTML visualization components for capture data. +Trajectory visualization components for demonstration data. **Repository**: [OpenAdaptAI/openadapt-viewer](https://github.com/OpenAdaptAI/openadapt-viewer) @@ -16,14 +16,14 @@ pip install openadapt-viewer The viewer package provides: -- HTML-based visualization of captures -- Interactive replay viewer -- Event timeline display -- Screenshot galleries +- HTML-based visualization of demonstration trajectories +- Interactive trajectory viewer +- Action timeline display +- Observation galleries ## CLI Commands -### View a Capture +### View a Demonstration Trajectory ```bash openadapt capture view my-task @@ -49,8 +49,8 @@ Access the dashboard at `http://localhost:8080`. ```python from openadapt_viewer import PageBuilder, HTMLBuilder -# Build a viewer page for a capture -builder = PageBuilder(capture_name="my-task") +# Build a viewer page for a demonstration +builder = PageBuilder(demonstration="my-task") html = builder.build() # Save to file @@ -59,29 +59,29 @@ with open("viewer.html", "w") as f: # Or use HTMLBuilder for custom visualizations html_builder = HTMLBuilder() -html_builder.add_screenshot(screenshot_path, events) -html_builder.add_timeline(events) +html_builder.add_observation(screenshot_path, actions) +html_builder.add_timeline(actions) html = html_builder.render() ``` ## Viewer Features -### Screenshot Gallery +### Observation Gallery -Browse all captured screenshots with navigation controls. +Browse all captured observations (screenshots) with navigation controls. -### Event Timeline +### Action Timeline Interactive timeline showing: -- Mouse events (clicks, movement) -- Keyboard events (key presses) -- Screenshot timestamps -- Event metadata +- Mouse actions (clicks, movement) +- Keyboard actions (key presses) +- Observation timestamps +- Action metadata -### Replay Controls +### Trajectory Playback Controls -- Play/pause replay +- Play/pause trajectory playback - Speed controls (0.5x, 1x, 2x) - Step forward/backward - Jump to specific time @@ -90,16 +90,16 @@ Interactive timeline showing: - Export as HTML (static) - Export as video (MP4) -- Export event log (JSON) +- Export trajectory log (JSON) ## Key Exports | Export | Description | |--------|-------------| -| `PageBuilder` | Builds viewer pages | +| `PageBuilder` | Builds trajectory viewer pages | | `HTMLBuilder` | Low-level HTML construction | -| `TimelineWidget` | Timeline visualization | -| `ScreenshotGallery` | Screenshot browser | +| `TimelineWidget` | Action timeline visualization | +| `ObservationGallery` | Observation browser | ## Customization @@ -109,27 +109,27 @@ Interactive timeline showing: from openadapt_viewer import PageBuilder, Theme builder = PageBuilder( - capture_name="my-task", + demonstration="my-task", theme=Theme.DARK # or Theme.LIGHT ) ``` -### Custom Event Rendering +### Custom Action Rendering ```python -from openadapt_viewer import PageBuilder, EventRenderer +from openadapt_viewer import PageBuilder, ActionRenderer -class CustomRenderer(EventRenderer): - def render_mouse_click(self, event): - return f"
{event}
" +class CustomRenderer(ActionRenderer): + def render_click(self, action): + return f"
{action}
" builder = PageBuilder( - capture_name="my-task", + demonstration="my-task", renderer=CustomRenderer() ) ``` ## Related Packages -- [openadapt-capture](capture.md) - Record data to visualize +- [openadapt-capture](capture.md) - Collect demonstrations to visualize - [openadapt-privacy](privacy.md) - Scrub sensitive data before viewing diff --git a/docs/publication-roadmap.md b/docs/publication-roadmap.md new file mode 100644 index 000000000..8eb076530 --- /dev/null +++ b/docs/publication-roadmap.md @@ -0,0 +1,527 @@ +# OpenAdapt Publication Roadmap + +**Version**: 1.0 +**Date**: January 2026 +**Status**: Active Planning +**Author**: OpenAdapt Research Team + +--- + +## Executive Summary + +This roadmap outlines the publication strategy for OpenAdapt's core research contributions. The primary innovation is **demonstration-conditioned GUI agents**, which achieve dramatic accuracy improvements (33% to 100% first-action accuracy) by conditioning VLM agents on human demonstrations rather than relying solely on natural language instructions. + +--- + +## Table of Contents + +1. [Publishable Contributions](#1-publishable-contributions) +2. [Publication Timeline](#2-publication-timeline) +3. [Required Experiments](#3-required-experiments) +4. [Author Contributions](#4-author-contributions) +5. [Venue Analysis](#5-venue-analysis) +6. [Existing Drafts and Assets](#6-existing-drafts-and-assets) + +--- + +## 1. Publishable Contributions + +### 1.1 Demo-Conditioned GUI Agents (Core Innovation) + +**The Big Result**: Demonstration conditioning improves first-action accuracy from 33% to 100% on macOS tasks, with expected similar improvements (+30-50pp) on Windows Agent Arena (WAA). + +**Key Claims**: +- Demonstrations capture implicit knowledge that natural language prompts cannot convey +- Demo retrieval enables automatic selection of relevant examples from a library +- The "show, don't tell" paradigm reduces prompt engineering burden +- Works with any VLM backend (Claude, GPT, Gemini, Qwen-VL) + +**Research Questions Addressed**: +1. How much does demonstration context improve GUI agent performance? +2. Can we automatically retrieve relevant demonstrations for new tasks? +3. What is the transfer efficiency between similar tasks across platforms? + +**Preliminary Results** (from `/Users/abrichr/oa/src/openadapt-ml/docs/experiments/`): +- Zero-shot (instruction only): 33% first-action accuracy +- Demo-conditioned: 100% first-action accuracy (+67pp improvement) +- Demo persists across ALL steps (critical P0 fix for episode success) + +**WAA Predictions** (from experiment design): +- Zero-shot expected: 10-20% task success (consistent with SOTA ~19.5%) +- Demo-conditioned expected: 40-70% task success (+30-50pp improvement) + +--- + +### 1.2 Modular Open-Source Architecture (Meta-Package Design) + +**Contribution**: A composable, model-agnostic architecture for GUI automation research. + +**Key Components**: +| Package | Responsibility | Key Innovation | +|---------|---------------|----------------| +| `openadapt-capture` | GUI recording | Cross-platform event + a11y tree capture | +| `openadapt-ml` | Training & inference | Model-agnostic VLM adapters | +| `openadapt-evals` | Benchmark evaluation | Unified adapter for WAA, WebArena | +| `openadapt-retrieval` | Demo search | Multimodal (text+image) embedding with Qwen3-VL | +| `openadapt-grounding` | Element localization | Multiple providers (OmniParser, Florence2, Gemini) | +| `openadapt-viewer` | Visualization | Interactive HTML trajectory viewer | +| `openadapt-privacy` | PII scrubbing | Privacy-preserving demonstration storage | + +**Technical Highlights**: +- Abstraction ladder: Literal -> Symbolic -> Template -> Semantic -> Goal +- Process graph representations for temporal context +- Three-phase architecture: DEMONSTRATE -> LEARN -> EXECUTE +- Feedback loops for continuous improvement + +**Prior Art Comparison**: +| System | Open Source | Modular | Demo-Conditioned | Multi-VLM | +|--------|------------|---------|------------------|-----------| +| OpenAdapt | Yes | Yes | **Yes** | Yes | +| Claude Computer Use | No | No | No | No | +| UFO | Partial | No | No | No | +| SeeAct | Yes | No | No | No | + +--- + +### 1.3 Benchmark Evaluation Framework (WAA Integration) + +**Contribution**: Unified evaluation infrastructure for GUI agent benchmarks. + +**Key Features**: +- `BenchmarkAdapter` abstract interface for any benchmark +- `WAALiveAdapter` with HTTP-based `/evaluate` endpoint +- `ApiAgent` supporting Claude, GPT-5.1, Gemini backends +- `RetrievalAugmentedAgent` for automatic demo selection +- Execution trace collection with screenshots per step +- HTML viewer for result analysis + +**Benchmark Coverage**: +| Benchmark | Status | Tasks | Domain | +|-----------|--------|-------|--------| +| Windows Agent Arena (WAA) | Implemented | 154 tasks | Windows desktop | +| Mock Benchmark | Implemented | N tasks | Testing | +| WebArena | Partial | 812 tasks | Web browser | +| OSWorld | Planned | 369 tasks | Cross-platform | + +**WAA Task Selection** (from experiment design): +- 10 carefully selected tasks across 4 enterprise-relevant domains +- Browser/Edge (3 tasks): Privacy settings, bookmarks, font size +- Office/LibreOffice (3 tasks): Fill blanks, charts, alignment +- Settings (2 tasks): Notifications, Night Light scheduling +- File Explorer (2 tasks): Archive creation, view changes + +--- + +### 1.4 Multimodal Retrieval for Demo Conditioning + +**Contribution**: Automatic demonstration retrieval using VLM embeddings. + +**Technical Approach**: +- **Embedder**: Qwen3-VL-Embedding with Matryoshka Representation Learning (MRL) +- **Index**: FAISS vector index with cosine similarity +- **Query**: Multimodal (task text + current screenshot) +- **Reranking**: Cross-encoder for top-k refinement + +**Key Classes** (from `openadapt-retrieval`): +```python +# Core retrieval interface +retriever = MultimodalDemoRetriever(embedding_dim=512) +retriever.add_demo(demo_id, task, screenshot, app_name) +retriever.build_index() +results = retriever.retrieve(task, screenshot, top_k=3) +``` + +**Performance Considerations**: +- Qwen3-VL: ~6-8 GB VRAM, ~50-200ms per embedding +- CLIP fallback: ~2 GB VRAM, ~10-50ms per embedding +- Flexible dimensions via MRL: 256, 512, 1024, 2048 + +--- + +## 2. Publication Timeline + +### Phase 1: Short-Term (Q1 2026) + +#### 2.1.1 Blog Post / Technical Report + +**Target**: January-February 2026 +**Venue**: OpenAdapt blog, HuggingFace, towards data science +**Effort**: 1-2 weeks + +**Content**: +- Demo-conditioned GUI agents: The "show, don't tell" paradigm +- Preliminary results (33% -> 100% accuracy) +- Open-source release announcement +- Interactive demo with viewer + +**Deliverables**: +- [ ] Write blog post (~2000 words) +- [ ] Create figures (architecture diagram, accuracy comparison) +- [ ] Record demo video (2-3 minutes) +- [ ] Publish to blog + cross-post to HN, Reddit, Twitter + +--- + +#### 2.1.2 arXiv Preprint + +**Target**: February-March 2026 +**Venue**: arXiv cs.AI, cs.HC +**Effort**: 3-4 weeks + +**Title Options**: +1. "Show, Don't Tell: Demonstration-Conditioned GUI Automation with Vision-Language Models" +2. "OpenAdapt: An Open Framework for Demo-Conditioned GUI Agents" +3. "From Demonstrations to Actions: Retrieval-Augmented GUI Automation" + +**Existing Drafts**: +- `/Users/abrichr/oa/src/omnimcp/paper/omnimcp_whitepaper.tex` - Spatial-temporal framework +- `/Users/abrichr/oa/src/omnimcp/paper/omnimcp_arxiv.tex` - Full arXiv draft (1056 lines) + +**Structure** (based on existing drafts): +1. Abstract +2. Introduction (demo-conditioning motivation) +3. Related Work (GUI automation, VLM agents, PbD) +4. Method + - Architecture overview + - Demo-conditioned prompting + - Retrieval-augmented generation +5. Experiments + - macOS demo experiment + - WAA benchmark evaluation + - Ablation studies +6. Results + - First-action accuracy + - Episode success rate + - Transfer across platforms +7. Discussion & Limitations +8. Conclusion + +**Deliverables**: +- [ ] Complete WAA experiments (10 tasks x 2 conditions) +- [ ] Update existing LaTeX draft with new results +- [ ] Add retrieval system section +- [ ] Create supplementary materials (code, demos) +- [ ] Submit to arXiv + +--- + +### Phase 2: Medium-Term (Q2-Q3 2026) + +#### 2.2.1 Workshop Paper + +**Target**: April-June 2026 +**Venues** (submission deadlines vary): +| Venue | Conference | Deadline | Focus | +|-------|-----------|----------|-------| +| LLM Agents Workshop | ICML 2026 | ~March | Agent architectures | +| Human-AI Workshop | CHI 2026 | ~Dec 2025 | Human-AI collaboration | +| AutoML Workshop | NeurIPS 2026 | ~Sept | Automation | + +**Format**: 4-8 pages + references +**Effort**: 2-3 weeks (building on preprint) + +**Focus**: Demo retrieval and conditioning system +**Novelty**: Multimodal retrieval for GUI automation + +--- + +#### 2.2.2 Demo Paper (CHI/UIST) + +**Target**: CHI 2027 or UIST 2026 +**Venues**: +| Venue | Deadline | Acceptance Rate | +|-------|----------|-----------------| +| CHI Demo Track | Sept 2026 | ~50% | +| UIST Demo Track | April 2026 | ~40% | + +**Format**: 2-4 pages + live demo +**Effort**: 2 weeks for paper, 1 week for demo prep + +**Demo Content**: +1. Record a demonstration (any application) +2. Show retrieval selecting similar demos +3. Execute task with demo conditioning +4. Visualize predictions in viewer + +**Deliverables**: +- [ ] Prepare stable demo environment +- [ ] Create video walkthrough +- [ ] Write demo paper +- [ ] Prepare live demo hardware/software + +--- + +### Phase 3: Long-Term (Q4 2026 - 2027) + +#### 2.3.1 Full Conference Paper + +**Target**: NeurIPS 2026, ICML 2027, or ICLR 2027 +**Effort**: 3-6 months + +**Venues**: +| Venue | Deadline | Page Limit | Focus | +|-------|----------|------------|-------| +| NeurIPS | May 2026 | 9+refs | ML methods | +| ICML | Feb 2027 | 8+refs | ML methods | +| ICLR | Oct 2026 | 8+refs | Representations | +| AAAI | Aug 2026 | 7+refs | AI systems | +| ACL | Feb 2027 | 8+refs | NLP/multimodal | + +**Contribution Options**: + +**Option A: Demo-Conditioning Method Paper** (NeurIPS/ICML) +- Focus: Retrieval-augmented demo conditioning +- Experiments: WAA, WebArena, OSWorld comparison +- Ablations: Retrieval methods, embedding models, k values +- Baselines: Zero-shot, few-shot, fine-tuned + +**Option B: Systems Paper** (MLSys) +- Focus: Modular architecture for GUI automation +- Experiments: Latency, throughput, grounding accuracy +- Comparisons: End-to-end vs modular approaches + +**Option C: HCI Paper** (CHI Full) +- Focus: Human-AI collaboration in task automation +- User study: Demo creation time, task success, trust +- Qualitative: User preferences, failure modes + +--- + +## 3. Required Experiments + +### 3.1 Completed Experiments + +| Experiment | Status | Location | Result | +|------------|--------|----------|--------| +| macOS demo-conditioning | Done | `openadapt-ml/docs/experiments/` | 33% -> 100% | +| Demo prompt format | Done | Same | Behavior-only format best | +| API baselines | Done | `openadapt-evals` | Claude, GPT working | + +--- + +### 3.2 Required for arXiv (P0) + +| Experiment | Description | Effort | Status | +|------------|-------------|--------|--------| +| WAA zero-shot baseline | 10 tasks, no demos | 2-3 hours | Pending | +| WAA demo-conditioned | 10 tasks, with demos | 2-3 hours | Pending | +| Demo creation | Write demos for 10 WAA tasks | 4-6 hours | Design complete | +| Statistical analysis | Significance tests, confidence intervals | 1-2 hours | Pending | + +**WAA Task List** (from experiment design): +1. Edge: Do Not Track +2. Edge: Bookmark to bar +3. Edge: Font size +4. LibreOffice Calc: Fill blanks +5. LibreOffice Calc: Chart creation +6. LibreOffice Writer: Center align +7. Settings: Notifications off +8. Settings: Night Light schedule +9. File Explorer: Archive folder +10. File Explorer: Details view + +--- + +### 3.3 Required for Workshop/Demo Paper (P1) + +| Experiment | Description | Effort | Status | +|------------|-------------|--------|--------| +| Retrieval accuracy | Measure if correct demo retrieved | 1 day | Pending | +| Retrieval latency | Embedding + search time | 2 hours | Pending | +| Cross-domain transfer | Demo from app A helps app B | 1 week | Pending | +| Demo library size | Performance vs library size | 2-3 days | Pending | + +--- + +### 3.4 Required for Full Conference Paper (P2) + +| Experiment | Description | Effort | Status | +|------------|-------------|--------|--------| +| WebArena evaluation | 100+ web tasks | 1-2 weeks | Pending | +| OSWorld evaluation | Cross-platform tasks | 2-3 weeks | Pending | +| Fine-tuning comparison | Demo prompting vs fine-tuning | 2-4 weeks | Pending | +| Ablation: VLM backend | Claude vs GPT vs Gemini | 1 week | Partial | +| Ablation: Embedding model | Qwen3-VL vs CLIP vs ColPali | 1 week | Pending | +| Ablation: Demo format | Full trace vs behavior-only | 3 days | Partial | +| User study | N=20-30 participants | 2-4 weeks | Pending | + +--- + +## 4. Author Contributions + +### 4.1 Proposed Author Order + +**Lead Authors** (equal contribution): +1. **Richard Abrich** - Architecture, demo-conditioning, experiments +2. **[Contributor 2]** - Retrieval system, embeddings + +**Contributing Authors**: +3. **[Contributor 3]** - WAA benchmark integration +4. **[Contributor 4]** - Grounding module +5. **[Contributor 5]** - Viewer and visualization + +**Acknowledgments**: +- OmniParser team (Microsoft) +- Windows Agent Arena team (Microsoft) +- Open-source contributors + +--- + +### 4.2 Contribution Matrix + +| Contribution | Lead | Contributors | +|--------------|------|--------------| +| Architecture design | RA | - | +| Demo-conditioning method | RA | - | +| Retrieval system | - | - | +| WAA integration | RA | - | +| Grounding providers | RA | - | +| Experiments: macOS | RA | - | +| Experiments: WAA | RA | - | +| Writing: Introduction | RA | - | +| Writing: Method | RA | - | +| Writing: Experiments | RA | - | +| Figures and diagrams | RA | - | +| Code open-sourcing | RA | - | + +--- + +## 5. Venue Analysis + +### 5.1 Target Venues by Contribution Type + +#### Systems/Architecture +| Venue | Deadline | Fit | Notes | +|-------|----------|-----|-------| +| MLSys | Jan 2026 | Good | Modular architecture focus | +| OSDI | May 2026 | Medium | More systems-focused | +| SoCC | June 2026 | Medium | Cloud systems angle | + +#### ML Methods +| Venue | Deadline | Fit | Notes | +|-------|----------|-----|-------| +| NeurIPS | May 2026 | Excellent | Demo-conditioning as retrieval | +| ICML | Feb 2027 | Excellent | Method + experiments | +| ICLR | Oct 2026 | Good | Representation learning angle | + +#### HCI/Agents +| Venue | Deadline | Fit | Notes | +|-------|----------|-----|-------| +| CHI | Sept 2026 | Excellent | Human-AI, user study | +| UIST | April 2026 | Excellent | Demo interaction | +| IUI | Oct 2026 | Good | Intelligent interfaces | + +#### NLP/Multimodal +| Venue | Deadline | Fit | Notes | +|-------|----------|-----|-------| +| ACL | Feb 2027 | Good | Multimodal grounding | +| EMNLP | May 2026 | Good | VLM applications | +| NAACL | Dec 2026 | Good | Shorter, regional | + +--- + +### 5.2 Workshop Opportunities + +| Workshop | Conference | Typical Deadline | Focus | +|----------|-----------|------------------|-------| +| LLM Agents | ICML/NeurIPS | 2-3 months before | Agent architectures | +| Human-AI Interaction | CHI/IUI | Variable | Collaboration | +| AutoML | NeurIPS | September | Automation | +| Efficient ML | ICML/NeurIPS | Variable | Efficiency | + +--- + +## 6. Existing Drafts and Assets + +### 6.1 Paper Drafts + +| File | Location | Status | Content | +|------|----------|--------|---------| +| `omnimcp_whitepaper.tex` | `/Users/abrichr/oa/src/omnimcp/paper/` | Complete (whitepaper) | Spatial-temporal framework, 530 lines | +| `omnimcp_arxiv.tex` | `/Users/abrichr/oa/src/omnimcp/paper/` | Complete (arXiv format) | Full paper, 1056 lines, benchmarks pending | +| `omnimcp_whitepaper.pdf` | Same | Compiled | 2.7 MB | +| `omnimcp_arxiv.pdf` | Same | Compiled | 133 KB | + +### 6.2 Figures + +| Figure | Location | Description | +|--------|----------|-------------| +| `spatial-features.png` | `/Users/abrichr/oa/src/omnimcp/paper/` | Spatial feature understanding | +| `temporal-features.png` | Same | Temporal feature understanding | +| `api-generation.png` | Same | Internal API generation | +| `api-publication.png` | Same | External API (MCP) publication | + +### 6.3 Documentation + +| Document | Location | Relevance | +|----------|----------|-----------| +| `architecture-evolution.md` | `/Users/abrichr/oa/src/OpenAdapt/docs/` | Full architecture description | +| `waa_demo_experiment_design.md` | `/Users/abrichr/oa/src/openadapt-ml/docs/experiments/` | WAA experiment details | +| `waa-evaluator-integration.md` | `/Users/abrichr/oa/src/openadapt-evals/docs/research/` | Evaluation methodology | +| `CLAUDE.md` files | Various repos | Implementation details | + +### 6.4 Code Assets + +| Asset | Location | Description | +|-------|----------|-------------| +| openadapt-capture | GitHub | Recording package | +| openadapt-ml | GitHub | Training/inference | +| openadapt-evals | GitHub | Benchmarks | +| openadapt-retrieval | GitHub | Demo retrieval | +| openadapt-grounding | GitHub | UI localization | +| openadapt-viewer | GitHub | Visualization | + +--- + +## 7. Action Items + +### Immediate (This Week) + +- [ ] Complete 10 WAA demo documents +- [ ] Run WAA zero-shot baseline +- [ ] Run WAA demo-conditioned evaluation +- [ ] Update omnimcp_arxiv.tex with new results + +### Short-Term (Next 2 Weeks) + +- [ ] Write blog post announcing demo-conditioning results +- [ ] Create comparison figure (zero-shot vs demo-conditioned) +- [ ] Record demo video +- [ ] Finalize arXiv submission + +### Medium-Term (Next Month) + +- [ ] Implement retrieval accuracy metrics +- [ ] Run cross-domain transfer experiments +- [ ] Identify workshop submission targets +- [ ] Begin CHI/UIST demo preparation + +--- + +## 8. Risk Assessment + +| Risk | Likelihood | Impact | Mitigation | +|------|------------|--------|------------| +| WAA results don't match predictions | Medium | High | Focus on subset where demos help most | +| Retrieval accuracy insufficient | Low | Medium | Add reranking, increase demo library | +| Competition publishes first | Medium | Medium | Differentiate with open-source, modularity | +| Reviewer skepticism of accuracy claims | Medium | Medium | Multiple seeds, statistical tests | + +--- + +## 9. References + +### Key Citations for Paper + +1. **Windows Agent Arena** - Bonatti et al., 2024. Microsoft benchmark, SOTA 19.5%. +2. **OmniParser** - Chen et al., 2024. Vision-only UI parsing. +3. **Set-of-Mark** - Yang et al., 2023. Visual grounding via labels. +4. **Claude Computer Use** - Anthropic, 2024. Production VLM agent. +5. **UFO** - Microsoft, 2024. Windows agent architecture. +6. **Qwen-VL** - Alibaba, 2024. Open-source VLM. +7. **WebArena** - Zhou et al., 2023. Web automation benchmark. +8. **OSWorld** - Xie et al., 2024. Cross-platform benchmark. + +--- + +*Last updated: January 2026* diff --git a/docs/roadmap-priorities.md b/docs/roadmap-priorities.md new file mode 100644 index 000000000..26f86492c --- /dev/null +++ b/docs/roadmap-priorities.md @@ -0,0 +1,562 @@ +# OpenAdapt Roadmap - Priorities + +**Last Updated**: January 16, 2026 +**Version**: 1.1.0 +**Status**: Active Development + +--- + +## Executive Summary + +This document outlines the prioritized roadmap for OpenAdapt, focusing on ensuring the modular meta-package architecture is stable, functional, and delivers on the core promise: **Record -> Train -> Evaluate** GUI automation workflows. + +--- + +## Current State Assessment + +### PyPI Packages Published + +| Package | Version | Python | Status | +|---------|---------|--------|--------| +| `openadapt` | 1.0.0 (meta) | >=3.10 | Published | +| `openadapt-capture` | 0.1.0 | >=3.10 | Published | +| `openadapt-ml` | 0.2.0 | >=3.12 | Published | +| `openadapt-evals` | 0.1.0 | >=3.10 | Published | +| `openadapt-viewer` | 0.1.0 | >=3.10 | Published | +| `openadapt-grounding` | 0.1.0 | >=3.10 | Published | +| `openadapt-retrieval` | 0.1.0 | >=3.10 | Published | +| `openadapt-privacy` | 0.1.0 | >=3.10 | Published | + +**Note**: `openadapt-ml` requires Python 3.12+, which may cause compatibility issues with other packages requiring 3.10. + +### CI/Test Status + +- **Main repo**: CI runs on macOS and Ubuntu, Python 3.10/3.11/3.12 +- **Lint check**: `ruff check` and `ruff format --check` - **Currently Passing** +- **Tests verified**: + - `openadapt-grounding`: 53 tests passing + - `openadapt-retrieval`: 28 tests passing +- **Known issues**: PR #969 addresses ruff format, Docker build needs verification + +### Meta-Package Structure + +The `openadapt` meta-package v1.0.0 uses: +- Hatchling build system +- Lazy imports to avoid heavy dependencies +- Optional extras: `[capture]`, `[ml]`, `[evals]`, `[viewer]`, `[grounding]`, `[retrieval]`, `[privacy]`, `[core]`, `[all]` + +--- + +## Priority Definitions + +| Priority | Urgency | Timeframe | Description | +|----------|---------|-----------|-------------| +| **P0** | Critical | This week | Blockers preventing basic functionality | +| **P1** | High | 1-2 weeks | Core feature completion, essential for v1.0 | +| **P2** | Medium | This month | Important enhancements, user experience | +| **P3** | Lower | Backlog | Nice to have, future considerations | + +--- + +## P0 - Critical: Blocking Issues + +### 1. Fix CI - Ruff Format (PR #969) + +| Field | Value | +|-------|-------| +| **Status** | In Progress | +| **Effort** | Small (1-2 hours) | +| **Owner** | TBD | +| **PR** | #969 | +| **Branch** | `fix/ruff-format-config` | + +**Description**: The CI workflow runs `ruff format --check openadapt/` which may fail if code is not formatted. A fix branch exists with formatting applied. + +**Current State**: Local `ruff check` passes. Branch `fix/ruff-format-config` contains formatting fixes. + +**Next Actions**: +- [ ] Review and merge PR #969 +- [ ] Verify CI passes on all Python versions (3.10, 3.11, 3.12) +- [ ] Verify CI passes on all platforms (macOS, Ubuntu) + +**Files**: +- `.github/workflows/main.yml` +- `openadapt/config.py` +- `openadapt/cli.py` + +--- + +### 2. Fix Docker Build + +| Field | Value | +|-------|-------| +| **Status** | Needs Investigation | +| **Effort** | Medium (2-4 hours) | +| **Owner** | TBD | +| **Location** | `legacy/deploy/deploy/models/omniparser/Dockerfile` | + +**Description**: Docker build for OmniParser server may have issues. This is used for the grounding provider integration. + +**Next Actions**: +- [ ] Test `docker build` for OmniParser Dockerfile +- [ ] Verify CUDA/GPU support works correctly +- [ ] Test model download during build (huggingface-cli) +- [ ] Document any missing dependencies or configuration + +**Files**: +- `legacy/deploy/deploy/models/omniparser/Dockerfile` + +--- + +### 3. Verify Meta-Package Installs Correctly + +| Field | Value | +|-------|-------| +| **Status** | Needs Testing | +| **Effort** | Medium (2-4 hours) | +| **Owner** | TBD | + +**Description**: Critical compatibility issue - `openadapt-ml` requires Python 3.12+, but `openadapt-capture` and others require 3.10+. Need to verify `pip install openadapt[all]` works. + +**Test Matrix**: + +| Installation | Python 3.10 | Python 3.11 | Python 3.12 | +|-------------|-------------|-------------|-------------| +| `openadapt` | Test | Test | Test | +| `openadapt[capture]` | Test | Test | Test | +| `openadapt[ml]` | Expected Fail | Expected Fail | Test | +| `openadapt[core]` | Expected Fail | Expected Fail | Test | +| `openadapt[all]` | Expected Fail | Expected Fail | Test | + +**Next Actions**: +- [ ] Test `pip install openadapt[all]` on Python 3.12 +- [ ] Test `pip install openadapt[core]` on Python 3.12 +- [ ] Verify imports work: `python -c "from openadapt.cli import main"` +- [ ] Document minimum Python version clearly (3.12 if ml is needed) +- [ ] Consider downgrading `openadapt-ml` requirements to 3.10+ if feasible + +--- + +### 4. Basic Capture -> Train -> Eval Workflow + +| Field | Value | +|-------|-------| +| **Status** | Needs End-to-End Testing | +| **Effort** | Large (4-8 hours) | +| **Owner** | TBD | + +**Description**: The core value proposition requires this workflow to function: + +```bash +openadapt capture start --name my-task # 1. Record demo +openadapt train start --capture my-task # 2. Train model +openadapt eval run --checkpoint model.pt # 3. Evaluate +``` + +**CLI Commands to Test**: + +| Command | Status | Notes | +|---------|--------|-------| +| `openadapt capture start` | Needs Test | Requires macOS permissions | +| `openadapt capture list` | Needs Test | | +| `openadapt capture view ` | Needs Test | Generates HTML | +| `openadapt capture stop` | TODO | Uses Ctrl+C currently | +| `openadapt train start` | Needs Test | Requires openadapt-ml | +| `openadapt eval run --agent api-claude` | Needs Test | Requires API key | +| `openadapt eval mock --tasks 10` | Needs Test | Quick verification | + +**Next Actions**: +- [ ] Test `openadapt capture start` on macOS (permissions required) +- [ ] Test `openadapt capture list` shows recordings +- [ ] Test `openadapt capture view ` generates HTML +- [ ] Test `openadapt train start` with real capture data +- [ ] Test `openadapt eval run --agent api-claude` with API key +- [ ] Test `openadapt eval mock --tasks 10` for quick verification +- [ ] Document any failures and create issues + +**Known Blockers**: +- `capture stop` is TODO (uses Ctrl+C currently) +- macOS requires Accessibility + Screen Recording permissions + +--- + +## P1 - High: Core Features + +### 5. Complete Baseline Adapters + +| Field | Value | +|-------|-------| +| **Status** | Partially Implemented | +| **Effort** | Medium (4-8 hours) | +| **Owner** | TBD | +| **Package** | `openadapt-ml` | + +**Description**: API baseline adapters (Anthropic, OpenAI, Google) are implemented but need testing and validation. + +**Adapter Status**: + +| Provider | Adapter | Status | Notes | +|----------|---------|--------|-------| +| Anthropic | Claude | Implemented | Claude Computer Use patterns | +| OpenAI | GPT-4V | Implemented | Needs testing | +| Google | Gemini | Implemented | Needs testing | +| Qwen | Qwen3-VL | Implemented | Local model | + +**Next Actions**: +- [ ] Test Anthropic adapter with Claude API +- [ ] Test OpenAI adapter with GPT-4V +- [ ] Test Google adapter with Gemini +- [ ] Verify prompts follow SOTA patterns (Claude CU, UFO, OSWorld) +- [ ] Add error handling for rate limits and API failures +- [ ] Document adapter usage and configuration + +--- + +### 6. Demo Conditioning Integration in Evals + +| Field | Value | +|-------|-------| +| **Status** | Designed, Needs Integration | +| **Effort** | Medium (4-8 hours) | +| **Owner** | TBD | +| **Packages** | `openadapt-retrieval`, `openadapt-evals` | + +**Description**: Demo-conditioned prompting shows **33% -> 100% first-action accuracy improvement**. This is a key differentiator. + +**Architecture**: +``` +openadapt-retrieval (demo library) -> openadapt-ml (adapters) -> openadapt-evals (benchmark) +``` + +**Next Actions**: +- [ ] Integrate `openadapt-retrieval` with `openadapt-ml` adapters +- [ ] Add `--demo` flag to `openadapt eval run` +- [ ] Test with real demo library on WAA benchmark +- [ ] Document demo library format (JSON structure, screenshots) +- [ ] Add `--demo-library` option for multi-demo retrieval + +--- + +### 7. WAA Benchmark Validation + +| Field | Value | +|-------|-------| +| **Status** | Blocked on Azure VM Setup | +| **Effort** | Medium (4-8 hours) | +| **Owner** | TBD | +| **Package** | `openadapt-evals` | + +**Description**: Need to validate demo-conditioning claims on full Windows Agent Arena benchmark. This provides credibility for landing page claims. + +**Infrastructure Required**: +- Azure VM with nested virtualization (Windows 10/11) +- WAA server running +- API keys for Claude/GPT-4V + +**Target Metrics**: + +| Metric | Baseline (No Demo) | With Demo | Target | +|--------|-------------------|-----------|--------| +| First-action accuracy | ~33% | ~100% | Validate | +| Episode success rate | TBD | TBD | Measure | +| Average steps | TBD | TBD | Measure | + +**Next Actions**: +- [ ] Start Azure VM with WAA server (nested virtualization) +- [ ] Run `openadapt eval run --agent api-claude --server ` +- [ ] Record metrics: episode success rate, avg steps, failure modes +- [ ] Generate HTML report with `openadapt-viewer` +- [ ] Document results for landing page claims + +--- + +## P2 - Medium: Enhancements + +### 8. Safety Gate Implementation + +| Field | Value | +|-------|-------| +| **Status** | Design Phase | +| **Effort** | Medium (4-8 hours) | +| **Owner** | TBD | +| **Package** | `openadapt-ml` | + +**Description**: Implement safety gates to prevent harmful or unintended actions during agent execution. + +**Safety Categories**: +1. **Pre-action validation**: Check action against allowed patterns +2. **Dangerous action detection**: Block destructive file ops, system commands +3. **Human-in-the-loop confirmation**: Require approval for certain actions +4. **Rollback capability**: Undo recent actions if needed + +**Next Actions**: +- [ ] Design safety gate API interface +- [ ] Implement pre-action validation hooks +- [ ] Add dangerous action detection (rm, format, delete, etc.) +- [ ] Add optional human confirmation prompts +- [ ] Document safety configuration options + +--- + +### 9. Grounding Provider Improvements + +| Field | Value | +|-------|-------| +| **Status** | Package Published (53 tests passing) | +| **Effort** | Medium (4-6 hours) | +| **Owner** | TBD | +| **Package** | `openadapt-grounding` | + +**Description**: `openadapt-grounding` provides UI element localization for improved click accuracy. Needs integration with ML package. + +**Available Providers**: + +| Provider | Backend | Status | GPU Required | +|----------|---------|--------|--------------| +| OmniGrounder | OmniParser | Working | Yes (CUDA) | +| GeminiGrounder | Gemini API | Working | No | +| SoMGrounder | Set-of-Marks | Working | Yes | + +**Next Actions**: +- [ ] Integrate with `openadapt-ml` action replay +- [ ] Test OmniGrounder with recorded captures +- [ ] Test GeminiGrounder with API key +- [ ] Add grounding visualization to `openadapt-viewer` +- [ ] Document grounding provider selection +- [ ] Fix Docker build for OmniParser server + +--- + +### 10. Viewer Dashboard Features + +| Field | Value | +|-------|-------| +| **Status** | Basic HTML Generation Works | +| **Effort** | Medium (4-8 hours) | +| **Owner** | TBD | +| **Package** | `openadapt-viewer` | + +**Description**: `openadapt-viewer` generates HTML but could be enhanced for better debugging and analysis. + +**Requested Features**: + +| Feature | Priority | Complexity | +|---------|----------|------------| +| Video playback from screenshots | High | Medium | +| Action timeline with seek | High | Medium | +| Side-by-side comparison view | Medium | Low | +| Filtering by action type | Medium | Low | +| Benchmark result integration | Medium | Medium | +| Failure analysis tools | Medium | High | + +**Next Actions**: +- [ ] Add video playback (from captured screenshots) +- [ ] Add action timeline with seek +- [ ] Add side-by-side comparison view +- [ ] Add filtering by action type +- [ ] Integrate with benchmark results for failure analysis + +--- + +## P3 - Lower: Nice to Have + +### 11. Telemetry (GlitchTip) + +| Field | Value | +|-------|-------| +| **Status** | Design Doc Complete | +| **Effort** | Large (1-2 weeks) | +| **Owner** | TBD | +| **Design Doc** | `docs/design/telemetry-design.md` | + +**Description**: Create `openadapt-telemetry` package for unified error tracking and usage analytics across all packages. + +**Key Features**: +- GlitchTip/Sentry SDK integration +- Privacy filtering (path sanitization, PII scrubbing) +- Internal user tagging (CI detection, dev mode) +- Opt-out mechanisms (DO_NOT_TRACK env var) + +**Next Actions**: +- [ ] Create `openadapt-telemetry` package scaffold +- [ ] Implement Sentry/GlitchTip integration +- [ ] Add privacy filtering (path sanitization, PII scrubbing) +- [ ] Add internal user tagging (CI detection, dev mode) +- [ ] Create opt-out mechanisms (DO_NOT_TRACK env var) +- [ ] Integrate with openadapt-evals as pilot + +--- + +### 12. Additional Benchmarks (WebArena, OSWorld) + +| Field | Value | +|-------|-------| +| **Status** | Future Consideration | +| **Effort** | Large (2-4 weeks) | +| **Owner** | TBD | +| **Package** | `openadapt-evals` | + +**Description**: Expand evaluation infrastructure beyond WAA. + +**Target Benchmarks**: + +| Benchmark | Type | Status | Priority | +|-----------|------|--------|----------| +| Windows Agent Arena (WAA) | Desktop | In Progress | High | +| WebArena | Web Browser | Not Started | Medium | +| OSWorld | Cross-Platform | Not Started | Medium | +| MiniWoB++ | Synthetic | Not Started | Low | + +**Next Actions**: +- [ ] Implement WebArena adapter for browser automation +- [ ] Implement OSWorld adapter for cross-platform desktop +- [ ] Create unified metrics across benchmarks +- [ ] Add benchmark comparison view + +--- + +### 13. Documentation Site (docs.openadapt.ai) + +| Field | Value | +|-------|-------| +| **Status** | MkDocs Configured, Needs Deployment | +| **Effort** | Medium (4-6 hours) | +| **Owner** | TBD | +| **Config** | `mkdocs.yml` | + +**Description**: Documentation site using MkDocs with existing markdown files. + +**Existing Documentation**: +- `docs/index.md` - Home page +- `docs/architecture.md` - System architecture +- `docs/cli.md` - CLI reference +- `docs/packages/*.md` - Package documentation +- `docs/getting-started/*.md` - Installation, quickstart, permissions + +**Next Actions**: +- [ ] Verify `mkdocs.yml` configuration +- [ ] Run `mkdocs build` and test locally +- [ ] Set up GitHub Actions for auto-deploy to GitHub Pages +- [ ] Configure CNAME for docs.openadapt.ai +- [ ] Add API reference (auto-generated from docstrings) +- [ ] Write getting-started tutorial (5-minute quickstart) + +--- + +## Dependency Graph + +``` +P0: Fix CI (PR #969) ─────────────────────────────────────────────────┐ +P0: Docker Build ─────────────────────────────────────────────────────┤ +P0: Verify Meta-Package ──────────────────────────────────────────────┤ +P0: Basic Workflow ───────────────────────────────────────────────────┤ + │ + v +P1: Baseline Adapters ────────────────────────────────────────────────┤ +P1: Demo Conditioning ────────────────────────────────────────────────┤ +P1: WAA Benchmark ────────────────────────────────────────────────────┘ + │ + v +P2: Safety Gates ─────────────────────────────────────────────────────┐ +P2: Grounding Improvements ───────────────────────────────────────────┤ +P2: Viewer Dashboard ─────────────────────────────────────────────────┘ + │ + v +P3: Telemetry (GlitchTip) ────────────────────────────────────────────┐ +P3: Additional Benchmarks ────────────────────────────────────────────┤ +P3: Documentation Site ───────────────────────────────────────────────┘ +``` + +--- + +## Technical Debt + +### Known Issues + +| Issue | Severity | Package | Notes | +|-------|----------|---------|-------| +| Python version mismatch | Medium | `openadapt-ml` | Requires 3.12+, others 3.10+ | +| `capture stop` TODO | Low | `openadapt` CLI | Uses Ctrl+C instead of signal/file | +| `release-and-publish.yml` uses hatchling | Low | Main repo | Aligned with meta-package | +| Legacy code | Low | `/legacy/` | Many TODOs, not blocking v1.0 | + +### Code Quality + +| Package | TODOs | Notes | +|---------|-------|-------| +| `openadapt/cli.py` | 1 | Implement stop via signal/file | +| `legacy/` | 100+ | Historical, not blocking v1.0 | + +--- + +## Success Criteria + +### P0 Complete (This Week) + +- [ ] CI passes on all matrix combinations (Python 3.10/3.11/3.12, macOS/Ubuntu) +- [ ] PR #969 merged +- [ ] Docker build succeeds for OmniParser +- [ ] `pip install openadapt[core]` works on Python 3.12 +- [ ] Basic capture/eval workflow demonstrated + +### P1 Complete (1-2 Weeks) + +- [ ] API agents (Claude, GPT-4V) working with demo conditioning +- [ ] WAA baseline established with metrics +- [ ] First-action accuracy validated (33% -> 100% with demo) + +### P2 Complete (This Month) + +- [ ] Safety gates implemented and documented +- [ ] Grounding improving action accuracy +- [ ] Viewer dashboard with video playback + +### P3 Complete (Backlog) + +- [ ] Telemetry package published +- [ ] docs.openadapt.ai live +- [ ] Additional benchmarks integrated + +--- + +## Resources Required + +| Resource | Purpose | Status | +|----------|---------|--------| +| Azure credits | WAA benchmark VM | Needed | +| Anthropic API key | Claude testing | Available | +| OpenAI API key | GPT-4V testing | Needed | +| Google API key | Gemini testing | Needed | +| Test machines | Windows 10/11, Ubuntu 22.04/24.04 | Needed | +| DNS access | docs.openadapt.ai CNAME | Needed | + +--- + +## Appendix: Quick Reference + +### PyPI Package URLs + +- https://pypi.org/project/openadapt/ +- https://pypi.org/project/openadapt-capture/ +- https://pypi.org/project/openadapt-ml/ +- https://pypi.org/project/openadapt-evals/ +- https://pypi.org/project/openadapt-viewer/ +- https://pypi.org/project/openadapt-grounding/ +- https://pypi.org/project/openadapt-retrieval/ +- https://pypi.org/project/openadapt-privacy/ + +### GitHub Repositories + +- Main: https://github.com/OpenAdaptAI/openadapt +- Sub-packages: https://github.com/OpenAdaptAI/openadapt-{capture,ml,evals,viewer,grounding,retrieval,privacy} + +### Related Documents + +- Architecture: `/docs/architecture.md` +- Telemetry Design: `/docs/design/telemetry-design.md` +- Landing Page Strategy: `/docs/design/landing-page-strategy.md` +- Legacy Freeze: `/docs/legacy/freeze.md` + +--- + +*This roadmap is a living document. Update as priorities shift based on user feedback and technical discoveries.* diff --git a/mkdocs.yml b/mkdocs.yml index 39e4c9985..c0b353060 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -116,8 +116,23 @@ nav: - openadapt-grounding: packages/grounding.md - openadapt-retrieval: packages/retrieval.md - openadapt-privacy: packages/privacy.md - - Architecture: architecture.md + - Architecture: + - Overview: architecture.md + - Evolution: architecture-evolution.md + - Design: + - Index: design/INDEX.md + - System Tray App: design/openadapt-tray.md + - Tray Logging: design/tray-logging.md + - Telemetry: design/telemetry-design.md + - Landing Page: design/landing-page-strategy.md + - Repo Rename Analysis: design/repo-rename-analysis.md + - Roadmap: + - Priorities: roadmap-priorities.md + - Publications: publication-roadmap.md - CLI Reference: cli.md - Contributing: contributing.md - Legacy: - Legacy Freeze: legacy/freeze.md + - Legacy Freeze (Alt): LEGACY_FREEZE.md + - Reference: + - macOS Permissions: permissions-macos.md diff --git a/pyproject.toml b/pyproject.toml index b737bbb04..e27f1f34a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,7 +6,7 @@ readme = "README.md" requires-python = ">=3.10" license = "MIT" authors = [ - {name = "MLDSAI Inc.", email = "richard@mldsai.com"} + {name = "Richard Abrich", email = "richard@openadapt.ai"} ] keywords = ["gui", "automation", "ml", "rpa", "agent", "vlm", "computer-use"] classifiers = [