The CI/CD for Prompts. Ship robust prompts with confidence by stress-testing them across model ladders and temperature ranges.
PromptLint helps you engineer prompts that are not just "good enough" for one model, but robust across many. It evaluates consistency, stability, and format adherence, ensuring your prompts transfer well between models and don't break under sampling variations.
- 📉 The Model Ladder: Don't just test on GPT-4. Verify your prompt works on cheaper/faster models (e.g., GPT-3.5, small OSS models) to save costs without sacrificing reliability.
- 🧬 Semantic Consistency: Beyond exact matches. Uses embedding similarity to check if different models mean the same thing, even if the words differ.
- 🛡 Constraint Guards: Automatically enforce JSON schemas, bullet counts, regex patterns, and more.
- ⚡️ Async & Cached: built for speed. Run massive suites concurrently with automatic caching to save API costs.
Constraints of production systems often force us to use smaller, faster models (e.g., Llama-3-8B, GPT-3.5) rather than the smartest available models (e.g., GPT-4o, Claude 3.5 Sonnet).
PromptLint formalizes this by testing a "ladder" of models:
- Tier 1 (Oracle): The smartest model available. We assume its output is the "ground truth" or ideal response.
- Tier 2+ (Candidates): Smaller, cheaper models that we want to deploy.
We measure Consistency by checking if Tier 2+ models deviate from Tier 1. If a small model matches the large model's intent (high semantic similarity) and structure, it is safe to deploy.
flowchart LR
A[Suite Config] --> B(Runner)
B --> C{Provider Pool}
C -->|GPT-4| D[Output 1]
C -->|GPT-3.5| E[Output 2]
C -->|Claude| F[Output 3]
D & E & F --> G(Evaluator)
G --> H[Aggregator]
H --> I[Report .md/.json/.html]
- Define a Suite: Configure your prompt, constraints, and the "ladder" of models to test.
- Run: PromptLint executes prompts across all defined models and temperatures (e.g., T=0.0 to 1.0).
- Score: Outputs are scored for Format Adherence (Does it look right?) and Consistency (Is it stable?).
- Report: Get a detailed report highlighting where your prompt becomes unstable.
pip install -e .-
Set your API key:
export OPENAI_API_KEY=sk-... -
Run the example suite:
promptlint --suite examples/suite.yaml --report report.html --report-format html
-
Open
report.htmlto see your robustness scores!
The core of PromptLint is the suite.yaml. Here is a conceptual example:
# 1. Define Providers
providers:
- name: "openai"
kind: "openai_compatible"
api_key_env: "OPENAI_API_KEY"
# 2. Define the Model Ladder (Tiers)
ladder:
- name: "gpt-4o"
provider: "openai"
tier: 1 # Reference model
- name: "gpt-3.5-turbo"
provider: "openai"
tier: 2 # Cheaper alternative
# 3. Sampling Strategy
sampling:
- temperature: 0.0
- temperature: 0.7 # Test stability under noise
# 4. Prompts & Constraints
prompts:
- id: "summarize_email"
text: "Summarize this email in 3 bullet points: {{email_body}}"
constraints:
- name: "format_check"
description: "Must be a list"
rules:
type: "count"
pattern: "^\\s*[-*]"
min: 3PromptLint focuses on structural and semantic robustness, not just "is this fact true?".
Aggregation uses a weighted geometric mean plus a stability penalty to avoid hiding weak components.
| Metric | What it measures |
|---|---|
| Constraint Adherence | Do outputs satisfy explicit rules (Regex, JSON, Length)? |
| Cross-Model Consistency | do GPT-4 and GPT-3.5 say the same thing? (Text + Embedding Similarity) |
| Temperature Stability | Does the output change drastically when Temperature increases? |
| Task Alignment | Does the output format match the expected_format metadata? |
| Success Rate | How many runs succeeded without provider errors? |
We believe in reliable tools. Run the test suite (Unit & Integration) to verify logic:
# Install test dependencies generally implied
python -m unittest discover testsMIT