|
| 1 | +# Crawl4AI v0.7.6 Release Notes |
| 2 | + |
| 3 | +*Release Date: October 22, 2025* |
| 4 | + |
| 5 | +I'm excited to announce Crawl4AI v0.7.6, featuring a complete webhook infrastructure for the Docker job queue API! This release eliminates polling and brings real-time notifications to both crawling and LLM extraction workflows. |
| 6 | + |
| 7 | +## 🎯 What's New |
| 8 | + |
| 9 | +### Webhook Support for Docker Job Queue API |
| 10 | + |
| 11 | +The headline feature of v0.7.6 is comprehensive webhook support for asynchronous job processing. No more constant polling to check if your jobs are done - get instant notifications when they complete! |
| 12 | + |
| 13 | +**Key Capabilities:** |
| 14 | + |
| 15 | +- ✅ **Universal Webhook Support**: Both `/crawl/job` and `/llm/job` endpoints now support webhooks |
| 16 | +- ✅ **Flexible Delivery Modes**: Choose notification-only or include full data in the webhook payload |
| 17 | +- ✅ **Reliable Delivery**: Exponential backoff retry mechanism (5 attempts: 1s → 2s → 4s → 8s → 16s) |
| 18 | +- ✅ **Custom Authentication**: Add custom headers for webhook authentication |
| 19 | +- ✅ **Global Configuration**: Set default webhook URL in `config.yml` for all jobs |
| 20 | +- ✅ **Task Type Identification**: Distinguish between `crawl` and `llm_extraction` tasks |
| 21 | + |
| 22 | +### How It Works |
| 23 | + |
| 24 | +Instead of constantly checking job status: |
| 25 | + |
| 26 | +**OLD WAY (Polling):** |
| 27 | +```python |
| 28 | +# Submit job |
| 29 | +response = requests.post("http://localhost:11235/crawl/job", json=payload) |
| 30 | +task_id = response.json()['task_id'] |
| 31 | + |
| 32 | +# Poll until complete |
| 33 | +while True: |
| 34 | + status = requests.get(f"http://localhost:11235/crawl/job/{task_id}") |
| 35 | + if status.json()['status'] == 'completed': |
| 36 | + break |
| 37 | + time.sleep(5) # Wait and try again |
| 38 | +``` |
| 39 | + |
| 40 | +**NEW WAY (Webhooks):** |
| 41 | +```python |
| 42 | +# Submit job with webhook |
| 43 | +payload = { |
| 44 | + "urls": ["https://example.com"], |
| 45 | + "webhook_config": { |
| 46 | + "webhook_url": "https://myapp.com/webhook", |
| 47 | + "webhook_data_in_payload": True |
| 48 | + } |
| 49 | +} |
| 50 | +response = requests.post("http://localhost:11235/crawl/job", json=payload) |
| 51 | + |
| 52 | +# Done! Webhook will notify you when complete |
| 53 | +# Your webhook handler receives the results automatically |
| 54 | +``` |
| 55 | + |
| 56 | +### Crawl Job Webhooks |
| 57 | + |
| 58 | +```bash |
| 59 | +curl -X POST http://localhost:11235/crawl/job \ |
| 60 | + -H "Content-Type: application/json" \ |
| 61 | + -d '{ |
| 62 | + "urls": ["https://example.com"], |
| 63 | + "browser_config": {"headless": true}, |
| 64 | + "crawler_config": {"cache_mode": "bypass"}, |
| 65 | + "webhook_config": { |
| 66 | + "webhook_url": "https://myapp.com/webhooks/crawl-complete", |
| 67 | + "webhook_data_in_payload": false, |
| 68 | + "webhook_headers": { |
| 69 | + "X-Webhook-Secret": "your-secret-token" |
| 70 | + } |
| 71 | + } |
| 72 | + }' |
| 73 | +``` |
| 74 | + |
| 75 | +### LLM Extraction Job Webhooks (NEW!) |
| 76 | + |
| 77 | +```bash |
| 78 | +curl -X POST http://localhost:11235/llm/job \ |
| 79 | + -H "Content-Type: application/json" \ |
| 80 | + -d '{ |
| 81 | + "url": "https://example.com/article", |
| 82 | + "q": "Extract the article title, author, and publication date", |
| 83 | + "schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\"}}}", |
| 84 | + "provider": "openai/gpt-4o-mini", |
| 85 | + "webhook_config": { |
| 86 | + "webhook_url": "https://myapp.com/webhooks/llm-complete", |
| 87 | + "webhook_data_in_payload": true |
| 88 | + } |
| 89 | + }' |
| 90 | +``` |
| 91 | + |
| 92 | +### Webhook Payload Structure |
| 93 | + |
| 94 | +**Success (with data):** |
| 95 | +```json |
| 96 | +{ |
| 97 | + "task_id": "llm_1698765432", |
| 98 | + "task_type": "llm_extraction", |
| 99 | + "status": "completed", |
| 100 | + "timestamp": "2025-10-22T10:30:00.000000+00:00", |
| 101 | + "urls": ["https://example.com/article"], |
| 102 | + "data": { |
| 103 | + "extracted_content": { |
| 104 | + "title": "Understanding Web Scraping", |
| 105 | + "author": "John Doe", |
| 106 | + "date": "2025-10-22" |
| 107 | + } |
| 108 | + } |
| 109 | +} |
| 110 | +``` |
| 111 | + |
| 112 | +**Failure:** |
| 113 | +```json |
| 114 | +{ |
| 115 | + "task_id": "crawl_abc123", |
| 116 | + "task_type": "crawl", |
| 117 | + "status": "failed", |
| 118 | + "timestamp": "2025-10-22T10:30:00.000000+00:00", |
| 119 | + "urls": ["https://example.com"], |
| 120 | + "error": "Connection timeout after 30s" |
| 121 | +} |
| 122 | +``` |
| 123 | + |
| 124 | +### Simple Webhook Handler Example |
| 125 | + |
| 126 | +```python |
| 127 | +from flask import Flask, request, jsonify |
| 128 | + |
| 129 | +app = Flask(__name__) |
| 130 | + |
| 131 | +@app.route('/webhook', methods=['POST']) |
| 132 | +def handle_webhook(): |
| 133 | + payload = request.json |
| 134 | + |
| 135 | + task_id = payload['task_id'] |
| 136 | + task_type = payload['task_type'] |
| 137 | + status = payload['status'] |
| 138 | + |
| 139 | + if status == 'completed': |
| 140 | + if 'data' in payload: |
| 141 | + # Process data directly |
| 142 | + data = payload['data'] |
| 143 | + else: |
| 144 | + # Fetch from API |
| 145 | + endpoint = 'crawl' if task_type == 'crawl' else 'llm' |
| 146 | + response = requests.get(f'http://localhost:11235/{endpoint}/job/{task_id}') |
| 147 | + data = response.json() |
| 148 | + |
| 149 | + # Your business logic here |
| 150 | + print(f"Job {task_id} completed!") |
| 151 | + |
| 152 | + elif status == 'failed': |
| 153 | + error = payload.get('error', 'Unknown error') |
| 154 | + print(f"Job {task_id} failed: {error}") |
| 155 | + |
| 156 | + return jsonify({"status": "received"}), 200 |
| 157 | + |
| 158 | +app.run(port=8080) |
| 159 | +``` |
| 160 | + |
| 161 | +## 📊 Performance Improvements |
| 162 | + |
| 163 | +- **Reduced Server Load**: Eliminates constant polling requests |
| 164 | +- **Lower Latency**: Instant notification vs. polling interval delay |
| 165 | +- **Better Resource Usage**: Frees up client connections while jobs run in background |
| 166 | +- **Scalable Architecture**: Handles high-volume crawling workflows efficiently |
| 167 | + |
| 168 | +## 🐛 Bug Fixes |
| 169 | + |
| 170 | +- Fixed webhook configuration serialization for Pydantic HttpUrl fields |
| 171 | +- Improved error handling in webhook delivery service |
| 172 | +- Enhanced Redis task storage for webhook config persistence |
| 173 | + |
| 174 | +## 🌍 Expected Real-World Impact |
| 175 | + |
| 176 | +### For Web Scraping Workflows |
| 177 | +- **Reduced Costs**: Less API calls = lower bandwidth and server costs |
| 178 | +- **Better UX**: Instant notifications improve user experience |
| 179 | +- **Scalability**: Handle 100s of concurrent jobs without polling overhead |
| 180 | + |
| 181 | +### For LLM Extraction Pipelines |
| 182 | +- **Async Processing**: Submit LLM extraction jobs and move on |
| 183 | +- **Batch Processing**: Queue multiple extractions, get notified as they complete |
| 184 | +- **Integration**: Easy integration with workflow automation tools (Zapier, n8n, etc.) |
| 185 | + |
| 186 | +### For Microservices |
| 187 | +- **Event-Driven**: Perfect for event-driven microservice architectures |
| 188 | +- **Decoupling**: Decouple job submission from result processing |
| 189 | +- **Reliability**: Automatic retries ensure webhooks are delivered |
| 190 | + |
| 191 | +## 🔄 Breaking Changes |
| 192 | + |
| 193 | +**None!** This release is fully backward compatible. |
| 194 | + |
| 195 | +- Webhook configuration is optional |
| 196 | +- Existing code continues to work without modification |
| 197 | +- Polling is still supported for jobs without webhook config |
| 198 | + |
| 199 | +## 📚 Documentation |
| 200 | + |
| 201 | +### New Documentation |
| 202 | +- **[WEBHOOK_EXAMPLES.md](../deploy/docker/WEBHOOK_EXAMPLES.md)** - Comprehensive webhook usage guide |
| 203 | +- **[docker_webhook_example.py](../docs/examples/docker_webhook_example.py)** - Working code examples |
| 204 | + |
| 205 | +### Updated Documentation |
| 206 | +- **[Docker README](../deploy/docker/README.md)** - Added webhook sections |
| 207 | +- API documentation with webhook examples |
| 208 | + |
| 209 | +## 🛠️ Migration Guide |
| 210 | + |
| 211 | +No migration needed! Webhooks are opt-in: |
| 212 | + |
| 213 | +1. **To use webhooks**: Add `webhook_config` to your job payload |
| 214 | +2. **To keep polling**: Continue using your existing code |
| 215 | + |
| 216 | +### Quick Start |
| 217 | + |
| 218 | +```python |
| 219 | +# Just add webhook_config to your existing payload |
| 220 | +payload = { |
| 221 | + # Your existing configuration |
| 222 | + "urls": ["https://example.com"], |
| 223 | + "browser_config": {...}, |
| 224 | + "crawler_config": {...}, |
| 225 | + |
| 226 | + # NEW: Add webhook configuration |
| 227 | + "webhook_config": { |
| 228 | + "webhook_url": "https://myapp.com/webhook", |
| 229 | + "webhook_data_in_payload": True |
| 230 | + } |
| 231 | +} |
| 232 | +``` |
| 233 | + |
| 234 | +## 🔧 Configuration |
| 235 | + |
| 236 | +### Global Webhook Configuration (config.yml) |
| 237 | + |
| 238 | +```yaml |
| 239 | +webhooks: |
| 240 | + enabled: true |
| 241 | + default_url: "https://myapp.com/webhooks/default" # Optional |
| 242 | + data_in_payload: false |
| 243 | + retry: |
| 244 | + max_attempts: 5 |
| 245 | + initial_delay_ms: 1000 |
| 246 | + max_delay_ms: 32000 |
| 247 | + timeout_ms: 30000 |
| 248 | + headers: |
| 249 | + User-Agent: "Crawl4AI-Webhook/1.0" |
| 250 | +``` |
| 251 | +
|
| 252 | +## 🚀 Upgrade Instructions |
| 253 | +
|
| 254 | +### Docker |
| 255 | +
|
| 256 | +```bash |
| 257 | +# Pull the latest image |
| 258 | +docker pull unclecode/crawl4ai:0.7.6 |
| 259 | + |
| 260 | +# Or use latest tag |
| 261 | +docker pull unclecode/crawl4ai:latest |
| 262 | + |
| 263 | +# Run with webhook support |
| 264 | +docker run -d \ |
| 265 | + -p 11235:11235 \ |
| 266 | + --env-file .llm.env \ |
| 267 | + --name crawl4ai \ |
| 268 | + unclecode/crawl4ai:0.7.6 |
| 269 | +``` |
| 270 | + |
| 271 | +### Python Package |
| 272 | + |
| 273 | +```bash |
| 274 | +pip install --upgrade crawl4ai |
| 275 | +``` |
| 276 | + |
| 277 | +## 💡 Pro Tips |
| 278 | + |
| 279 | +1. **Use notification-only mode** for large results - fetch data separately to avoid large webhook payloads |
| 280 | +2. **Set custom headers** for webhook authentication and request tracking |
| 281 | +3. **Configure global default webhook** for consistent handling across all jobs |
| 282 | +4. **Implement idempotent webhook handlers** - same webhook may be delivered multiple times on retry |
| 283 | +5. **Use structured schemas** with LLM extraction for predictable webhook data |
| 284 | + |
| 285 | +## 🎬 Demo |
| 286 | + |
| 287 | +Try the release demo: |
| 288 | + |
| 289 | +```bash |
| 290 | +python docs/releases_review/demo_v0.7.6.py |
| 291 | +``` |
| 292 | + |
| 293 | +This comprehensive demo showcases: |
| 294 | +- Crawl job webhooks (notification-only and with data) |
| 295 | +- LLM extraction webhooks (with JSON schema support) |
| 296 | +- Custom headers for authentication |
| 297 | +- Webhook retry mechanism |
| 298 | +- Real-time webhook receiver |
| 299 | + |
| 300 | +## 🙏 Acknowledgments |
| 301 | + |
| 302 | +Thank you to the community for the feedback that shaped this feature! Special thanks to everyone who requested webhook support for asynchronous job processing. |
| 303 | + |
| 304 | +## 📞 Support |
| 305 | + |
| 306 | +- **Documentation**: https://docs.crawl4ai.com |
| 307 | +- **GitHub Issues**: https://github.com/unclecode/crawl4ai/issues |
| 308 | +- **Discord**: https://discord.gg/crawl4ai |
| 309 | + |
| 310 | +--- |
| 311 | + |
| 312 | +**Happy crawling with webhooks!** 🕷️🪝 |
| 313 | + |
| 314 | +*- unclecode* |
0 commit comments