Turn any website into a REST API by scraping it live with Playwright.
Web2API starts with no recipes installed by default. Recipes are installed into a local recipes directory from a catalog source (local path or git repo), then discovered at runtime. Each recipe defines endpoints with selectors, actions, fields, and pagination in YAML. Optional Python scrapers handle interactive or complex sites. Optional plugin metadata can declare external dependencies and required env vars.
Recipe Repository — browse and install available recipes from the catalog.
Installed APIs — active recipes with their API endpoints and copy-to-clipboard URLs.
- Recipe: a site integration folder (
recipe.yaml+ optionalscraper.py) that exposes API endpoints. - Plugin metadata: optional
plugin.yamlinside a recipe that declares dependencies, healthchecks, and compatibility.
In this project, recipe lifecycle operations are always recipes commands. plugin.yaml is only
for optional dependency/runtime metadata inside a recipe.
- Arbitrary named endpoints — recipes define as many endpoints as needed (not limited to read/search)
- Declarative YAML recipes with selectors, actions, transforms, and pagination
- Custom Python scrapers for interactive sites (e.g. typing text, waiting for dynamic content)
- Optional plugin metadata (
plugin.yaml) for recipe-specific dependency requirements - Shared browser/context pool for concurrent Playwright requests
- In-memory response cache with stale-while-revalidate
- Unified JSON response schema across all recipes and endpoints
- Docker deployment with auto-restart
git clone https://github.com/Endogen/web2api.git
cd web2api
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
playwright install --with-deps chromiumStart the service:
uvicorn web2api.main:app --host 0.0.0.0 --port 8010Install recipes (in a separate terminal):
web2api recipes catalog list
web2api recipes catalog add hackernews --yesService: http://localhost:8010
curl -s http://localhost:8010/health | jq
curl -s http://localhost:8010/api/sites | jqgit clone https://github.com/Endogen/web2api.git
cd web2api
docker compose up --build -dService: http://localhost:8010
curl -s http://localhost:8010/health | jq
curl -s http://localhost:8010/api/sites | jqInstall recipes via the CLI inside the container:
docker compose exec web2api web2api recipes catalog list
docker compose exec web2api web2api recipes catalog add hackernews --yesNote: When using Docker, all
web2apiCLI commands must be prefixed withdocker compose exec web2apisince the CLI is installed inside the container.
- Provision host with Python 3.12+, Chromium dependencies, and optional Docker/Nginx.
- Set persistent recipe storage (
RECIPES_DIR, for example/var/lib/web2api/recipes). - Use the default official catalog repo (
https://github.com/Endogen/web2api-recipes.git) or override viaWEB2API_RECIPE_CATALOG_SOURCE(plus optionalWEB2API_RECIPE_CATALOG_REF/WEB2API_RECIPE_CATALOG_PATH). - Run Web2API as a long-lived process (systemd, container, or supervisor).
- Install initial recipes via CLI/API/UI.
- Put reverse proxy/TLS in front (Nginx/Caddy/Traefik) for production.
Web2API ships with a management CLI:
web2api --help# List all recipe folders with metadata readiness
web2api recipes list
# Check missing env vars/commands/packages
web2api recipes doctor
web2api recipes doctor x
web2api recipes doctor x --no-run-healthchecks
web2api recipes doctor x --allow-untrusted
# Install recipe from source
web2api recipes add ./my-recipe
web2api recipes add https://github.com/acme/web2api-recipes.git --ref v1.2.0 --subdir recipes/news
# Update managed recipe from recorded source
web2api recipes update x --yes
web2api recipes update x --ref v1.3.0 --subdir recipes/x --yes
# Install recipe from catalog
web2api recipes catalog list
web2api recipes catalog add hackernews --yes
web2api recipes catalog list --catalog-source https://github.com/acme/web2api-recipes.git
# Install declared dependencies from recipe metadata (host)
web2api recipes install x --yes
web2api recipes install x --apt --yes # include apt packages
# Generate Dockerfile snippet for recipe metadata dependencies
web2api recipes install x --target docker --apt
# Remove recipe + manifest record
web2api recipes uninstall x --yes
# Disable/enable a recipe (writes/removes recipes/<slug>/.disabled)
web2api recipes disable x --yes
web2api recipes enable xrecipes install does not run apt installs unless --apt is explicitly passed.
Install-state records are stored in <RECIPES_DIR>/.web2api_recipes.json.
Default RECIPES_DIR is ~/.web2api/recipes.
Catalog defaults come from:
WEB2API_RECIPE_CATALOG_SOURCE(path or git URL)WEB2API_RECIPE_CATALOG_REF(optional git ref)WEB2API_RECIPE_CATALOG_PATH(catalog file path inside source, defaultcatalog.yaml) IfWEB2API_RECIPE_CATALOG_SOURCEis unset, Web2API uses the official remote repohttps://github.com/Endogen/web2api-recipes.git.recipes updateworks only for recipes tracked in the manifest.
Catalog entries can include optional setup hints:
requires_env: list of required environment variable names (e.g.["BIRD_AUTH_TOKEN", "BIRD_CT0"])docs_url(orreadme_url): URL shown in CLI/UI as setup documentation
If docs_url is omitted and the recipe source resolves to GitHub, Web2API automatically
links to <repo>/blob/<ref-or-HEAD>/<subdir>/README.md.
Recipes installed from untrusted sources (for example git URLs) are blocked from executing
install/healthcheck commands unless --allow-untrusted is passed.
You can use custom recipes without publishing them to the recipe repository:
# Direct local path install into RECIPES_DIR (tracked as source_type=local)
web2api recipes add ./my-recipe --yes
# Or copy folder manually into RECIPES_DIR/<slug> (unmanaged local recipe)
cp -r ./my-recipe "$RECIPES_DIR/<slug>"Recipe origin visibility:
source_type=catalog|git|localin manifest-backed installsorigin=unmanagedfor manual local folders not tracked in manifest- The web UI manager shows both catalog recipes and local-only installed recipes
# Show current version + recommended update method
web2api self update check
# Apply update using auto-detected method (pip/git/docker)
web2api self update apply --yes
# Pin explicit method or target version/ref
web2api self update apply --method pip --to 0.1.0 --yes
web2api self update apply --method git --to v0.1.0 --yesFor --method git, self update apply checks out a tag:
- if
--tois provided, that tag/ref is used - if
--tois omitted, the latest sortable git tag is used
After self update apply, the CLI automatically runs web2api recipes doctor.
Recipe availability is dynamic. Use discovery endpoints instead of relying on a static README list.
# List all discovered sites and endpoint metadata
curl -s "http://localhost:8010/api/sites" | jq
# Print endpoint paths with required params
curl -s "http://localhost:8010/api/sites" | jq -r '
.[] as $site
| $site.endpoints[]
| "/\($site.slug)/\(.name) params: page" + (if .requires_query then ", q" else "" end)
'
# Print ready-to-run URL templates
curl -s "http://localhost:8010/api/sites" | jq -r '
.[] as $site
| $site.endpoints[]
| "http://localhost:8010/\($site.slug)/\(.name)?"
+ (if .requires_query then "q=<query>&" else "" end)
+ "page=1"
'
# Example call pattern (no query endpoint)
curl -s "http://localhost:8010/{slug}/{endpoint}?page=1" | jq
# Example call pattern (query endpoint)
curl -s "http://localhost:8010/{slug}/{endpoint}?q=hello&page=1" | jqFor custom scraper parameters beyond page and q, check the specific recipe folder
(recipes/<slug>/scraper.py).
| Endpoint | Description |
|---|---|
GET / |
HTML index listing all recipes and endpoints |
GET /health |
Service, browser pool, and cache health |
GET /api/sites |
JSON list of all recipes with endpoint metadata |
GET /api/recipes/manage |
JSON catalog + installed recipe state for UI/automation |
POST /api/recipes/manage/install/{name} |
Install recipe by catalog entry name |
POST /api/recipes/manage/update/{slug} |
Update installed managed recipe |
POST /api/recipes/manage/uninstall/{slug} |
Uninstall recipe (add ?force=true for unmanaged local recipes) |
POST /api/recipes/manage/enable/{slug} |
Enable installed recipe |
POST /api/recipes/manage/disable/{slug} |
Disable installed recipe |
GET /api/recipes/manage includes:
catalog: entries from the current catalog sourceinstalled: discovered recipes fromRECIPES_DIRorigin: one ofcatalog,git,local,unmanaged
All recipe endpoints follow the pattern: GET /{slug}/{endpoint}?page=1&q=...
page— pagination (default: 1)q— query text (required whenrequires_query: true)- additional query params are passed to custom scrapers
- extra query param names must match
[a-zA-Z0-9][a-zA-Z0-9_-]{0,63}and values are capped at 512 chars
| HTTP | Code | When |
|---|---|---|
| 400 | INVALID_PARAMS |
Missing required q or invalid extra query parameters |
| 404 | — | Unknown recipe or endpoint |
| 502 | SCRAPE_FAILED |
Browser/upstream failure |
| 504 | SCRAPE_TIMEOUT |
Scrape exceeded timeout |
- Successful responses are cached in-memory by
(slug, endpoint, page, q, extra params). - Cache hits return
metadata.cached: true. - Stale entries can be served immediately while a background refresh updates the cache.
{
"site": { "name": "...", "slug": "...", "url": "..." },
"endpoint": "read",
"query": null,
"items": [
{
"title": "Example title",
"url": "https://example.com",
"fields": { "score": 153, "author": "pg" }
}
],
"pagination": {
"current_page": 1,
"has_next": true,
"has_prev": false,
"total_pages": null,
"total_items": null
},
"metadata": {
"scraped_at": "2026-02-18T12:34:56Z",
"response_time_ms": 1832,
"item_count": 30,
"cached": false
},
"error": null
}recipes/
<slug>/
recipe.yaml # required — endpoint definitions
scraper.py # optional — custom Python scraper
plugin.yaml # optional — dependency metadata and runtime checks
README.md # optional — documentation
- Folder name must match
slug slugcannot be a reserved system route (api,health,docs,openapi,redoc)- Recipe folders containing
.disabledare skipped by discovery - Recipes installed via CLI/API/UI are loaded immediately
- If you edit recipe files manually on disk, restart the service to reload them
- Invalid recipes are skipped with warning logs
name: "Example Site"
slug: "examplesite"
base_url: "https://example.com"
description: "Scrapes example.com listings and search"
endpoints:
read:
description: "Browse listings"
url: "https://example.com/list?page={page}"
actions:
- type: wait
selector: ".item"
timeout: 10000
items:
container: ".item"
fields:
title:
selector: "a.title"
attribute: "text"
url:
selector: "a.title"
attribute: "href"
transform: "absolute_url"
pagination:
type: "page_param"
param: "page"
start: 1
search:
description: "Search listings"
requires_query: true
url: "https://example.com/search?q={query}&page={page_zero}"
items:
container: ".result"
fields:
title:
selector: "a"
attribute: "text"
pagination:
type: "page_param"
param: "page"
start: 0| Field | Required | Description |
|---|---|---|
url |
yes | URL template with {page}, {page_zero}, {query} placeholders |
description |
no | Human-readable endpoint description |
requires_query |
no | If true, the q parameter is mandatory (default: false) |
actions |
no | Playwright actions to run before extraction |
items |
yes | Container selector + field definitions |
pagination |
yes | Pagination strategy (page_param, offset_param, or next_link) |
Pagination notes:
{page} resolves to start + ((api_page - 1) * step).
| Type | Parameters |
|---|---|
wait |
selector, timeout (optional) |
click |
selector |
scroll |
direction (down/up), amount (pixels or "bottom") |
type |
selector, text |
sleep |
ms |
evaluate |
script |
strip · strip_html · regex_int · regex_float · iso_date · absolute_url
self (default) · next_sibling · parent
For interactive or complex sites, add a scraper.py with a Scraper class:
from playwright.async_api import Page
from web2api.scraper import BaseScraper, ScrapeResult
class Scraper(BaseScraper):
def supports(self, endpoint: str) -> bool:
return endpoint in {"de-en", "en-de"}
async def scrape(self, endpoint: str, page: Page, params: dict) -> ScrapeResult:
# page is BLANK — navigate yourself
await page.goto("https://example.com")
# ... interact with the page ...
return ScrapeResult(
items=[{"title": "result", "fields": {"key": "value"}}],
current_page=params["page"],
has_next=False,
)supports(endpoint)— declare which endpoints use custom scrapingscrape(endpoint, page, params)—pageis blank, you mustgoto()yourselfparamsalways containspage(int) andquery(str | None)paramsalso includes validated extra query params (for examplecount)- Endpoints not handled by the scraper fall back to declarative YAML
Use plugin.yaml to declare install/runtime requirements for a recipe:
version: "1.0.0"
web2api:
min: "0.2.0"
max: "1.0.0"
requires_env:
- BIRD_AUTH_TOKEN
- BIRD_CT0
dependencies:
commands:
- bird
python:
- httpx
apt:
- nodejs
npm:
- "@steipete/bird"
healthcheck:
command: ["bird", "--version"]Version bounds in web2api.min / web2api.max use numeric major.minor.patch format.
GET /api/sites now includes a plugin block (or null) with:
- declared metadata from
plugin.yaml - computed
status.readyplus missing env vars/commands/python packages - unverified package declarations (
apt,npm) for operators
Compatibility enforcement:
PLUGIN_ENFORCE_COMPATIBILITY=false(default): incompatible plugins are loaded but reported as not ready.PLUGIN_ENFORCE_COMPATIBILITY=true: incompatible plugins are skipped at discovery time.
Environment variables (with defaults):
| Variable | Default | Description |
|---|---|---|
POOL_MAX_CONTEXTS |
5 | Max browser contexts in pool |
POOL_CONTEXT_TTL |
50 | Requests per context before recycling |
POOL_ACQUIRE_TIMEOUT |
30 | Seconds to wait for a context |
POOL_PAGE_TIMEOUT |
15000 | Page navigation timeout (ms) |
POOL_QUEUE_SIZE |
20 | Max queued requests |
SCRAPE_TIMEOUT |
30 | Overall scrape timeout (seconds) |
CACHE_ENABLED |
true | Enable in-memory response caching |
CACHE_TTL_SECONDS |
30 | Fresh cache duration in seconds |
CACHE_STALE_TTL_SECONDS |
120 | Stale-while-revalidate window in seconds |
CACHE_MAX_ENTRIES |
500 | Maximum cached request variants |
RECIPES_DIR |
~/.web2api/recipes |
Path to recipes directory |
WEB2API_RECIPE_CATALOG_SOURCE |
https://github.com/Endogen/web2api-recipes.git |
Catalog source path or git URL |
WEB2API_RECIPE_CATALOG_REF |
empty | Optional git ref for catalog source |
WEB2API_RECIPE_CATALOG_PATH |
catalog.yaml |
Catalog file path inside catalog source |
PLUGIN_ENFORCE_COMPATIBILITY |
false | Skip plugin recipes outside declared web2api version bounds |
BIRD_AUTH_TOKEN |
empty | X/Twitter auth token for x recipe |
BIRD_CT0 |
empty | X/Twitter ct0 token for x recipe |
# Inside the container or with deps installed:
pytest tests/unit tests/integration --timeout=30 -x -q- Python 3.12 + FastAPI + Playwright (Chromium)
- Pydantic for config validation
- Docker for deployment
MIT