An open benchmark for evaluating AI understanding of garment construction.
Current fashion AI benchmarks test aesthetics—whether an image "looks like" a suit.
None test construction—whether AI understands how garments are actually built.
This matters because:
- AI generates physically impossible garments (lapels that can't roll, impossible button stances)
- Fashion recommendations lack fit understanding
- No standard exists for evaluating construction knowledge
- Models trained on Pinterest can't distinguish surgeon cuffs from decorative buttons
| Category | Examples | Why It Matters |
|---|---|---|
| Identification | Lapel types, vent styles, shoulder construction | Basic recognition |
| Reasoning | Why single vent suits seated clients | Understanding mechanics |
| Error Detection | Spotting impossible construction in AI images | Practical application |
- Basic — Anyone who wears suits occasionally should know this
- Intermediate — Enthusiasts, salespeople, MTM customers
- Expert — Tailors, pattern makers, bespoke clients
A model that scores 90% on basic but 10% on expert doesn't understand construction. It's pattern matching surface features.
- Lapels: Type, width, gorge height, roll
- Shoulders: Structured, natural, roped, Neapolitan
- Construction: Full canvas, half canvas, fused
- Details: Vents, button stance, pockets, surgeon cuffs
- Waist: Waistband style, closures, side adjusters
- Pleats: Flat front, single, double, direction
- Leg: Break, cuffs, taper
- Collar: Point, spread, cutaway, button-down
- Cuffs: Barrel, French, convertible
- Body: Placket, yoke, back pleats
See benchmark/taxonomy/ for the complete element breakdown.
tailorbench/
├── benchmark/
│ ├── taxonomy/ # What we test
│ │ ├── construction_elements.md
│ │ ├── difficulty_levels.md
│ │ └── question_formats.md
│ ├── dataset/ # Images and questions
│ │ ├── images/ # Jacket, trouser, shirt images
│ │ └── questions/ # JSON question files
│ ├── evaluation/ # How we score
│ │ └── metrics.md
│ └── baselines/ # Model results (coming)
└── research/
├── gap_analysis.md # What's missing in fashion AI
└── sources.md # Reference materials
- Overall Accuracy: Simple correct/total
- Weighted Accuracy: Expert questions count 3x, Intermediate 2x, Basic 1x
- Category Breakdown: Separate scores for identification, reasoning, error detection
- Multiple choice (identification)
- True/False
- Count/Measure
- Error detection
- Open-ended reasoning
- Comparison
- Recommendation
See benchmark/evaluation/metrics.md for full scoring methodology.
- Taxonomy defined (50+ construction elements)
- Evaluation framework (weighted scoring, 7 question formats)
- Gap analysis documented
- Dataset structure ready (images + JSON questions)
- Dataset v0.1 population (in progress)
- Baseline testing (GPT-4V, Gemini, Claude)
- Public leaderboard
Built by practitioners, not just researchers.
We come from production tailoring—real garments, real customers, real fit data. Fashion AI needs domain expertise that image datasets can't provide.
The gap between "looks like a suit" and "understands suit construction" is where current AI fails. This benchmark measures that gap.
See CONTRIBUTING.md for how to help.
Priority areas:
- Construction element additions (especially regional variations)
- Question contributions with clear correct answers
- Baseline testing on additional models
- Translations (terminology varies by region)
If you use TailorBench in research:
@misc{tailorbench2025,
title={TailorBench: A Benchmark for Garment Construction Understanding in AI},
author={FashionX},
year={2025},
url={https://github.com/fashionx-ai/tailorbench}
}MIT License — see LICENSE
- Twitter: @fashionx112
- Builder: @sudoinX
Version 0.1 — December 2025