11# Web Codegen Scorer
22
3- This project is a tool designed to assess the quality of front-end code generated by Large Language Models (LLMs).
3+ ** Web Codegen Scorer** is a tool for evaluating the quality of web code generated by Large Language
4+ Models (LLMs).
45
5- ## Documentation directory
6+ You can use this tool to make evidence-based decisions relating to AI-generated code. For example:
67
7- - [ Environment config reference] ( ./docs/environment-reference.md )
8- - [ How to set up a new model?] ( ./docs/model-setup.md )
8+ * 🔄 Iterate on a system prompt to find most effective instructions for your project.
9+ * ⚖️ Compare the code quality of code produced by different models.
10+ * 📈 Monitor generated code quality over time as models and agents evolve.
11+
12+ Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _ web_
13+ code and relies primarily on well-established measures of code quality.
14+
15+ ## Features
16+
17+ * ⚙️ Configure your evaluations with different models, frameworks, and tools.
18+ * ✍️ Specify system instructions and add MCP servers.
19+ * 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and
20+ coding best practices. (More built-in checks coming soon!)
21+ * 🔧 Automatically attempt to repair issues detected during code generating.
22+ * 📊 View and compare results with an intuitive report viewer UI.
923
1024## Setup
1125
12- 1 . ** Install the package:**
26+ 1 . ** Install the package:**
27+
1328``` bash
1429npm install -g web-codegen-scorer
1530```
1631
17- 2 . ** Set up your API keys:**
18- In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:
32+ 2 . ** Set up your API keys:**
33+
34+ In order to run an eval, you have to specify an API keys for the relevant providers as
35+ environment variables:
36+
1937``` bash
2038export GEMINI_API_KEY=" YOUR_API_KEY_HERE" # If you're using Gemini models
2139export OPENAI_API_KEY=" YOUR_API_KEY_HERE" # If you're using OpenAI models
2240export ANTHROPIC_API_KEY=" YOUR_API_KEY_HERE" # If you're using Anthropic models
2341```
2442
25433 . ** Run an eval:**
26- You can run your first eval using our Angular example with the following command:
44+
45+ You can run your first eval using our Angular example with the following command:
46+
2747``` bash
2848web-codegen-scorer eval --env=angular-example
2949```
3050
31514 . (Optional) ** Set up your own eval:**
32- If you want to set up a custom eval, instead of using our built-in examples, you can run the following
33- command which will guide you through the process:
52+
53+ If you want to set up a custom eval, instead of using our built-in examples, you can run the
54+ following command which will guide you through the process:
3455
3556``` bash
3657web-codegen-scorer init
@@ -40,54 +61,112 @@ web-codegen-scorer init
4061
4162You can customize the ` web-codegen-scorer eval ` script with the following flags:
4263
43- - ` --env=<path> ` (alias: ` --environment ` ): (** Required** ) Specifies the path from which to load the environment config.
44- - Example: ` web-codegen-scorer eval --env=foo/bar/my-env.js `
64+ - ` --env=<path> ` (alias: ` --environment ` ): (** Required** ) Specifies the path from which to load the
65+ environment config.
66+ - Example: ` web-codegen-scorer eval --env=foo/bar/my-env.js `
4567
46- - ` --model=<name> ` : Specifies the model to use when generating code. Defaults to the value of ` DEFAULT_MODEL_NAME ` .
47- - Example: ` web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path> `
68+ - ` --model=<name> ` : Specifies the model to use when generating code. Defaults to the value of
69+ ` DEFAULT_MODEL_NAME ` .
70+ - Example: ` web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path> `
4871
49- - ` --runner=<name> ` : Specifies the runner to use to execute the eval. Supported runners are ` genkit ` (default) or ` gemini-cli ` .
72+ - ` --runner=<name> ` : Specifies the runner to use to execute the eval. Supported runners are
73+ ` genkit ` (default) or ` gemini-cli ` .
5074
51- - ` --local ` : Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the ` .llm-output ` directory (e.g., ` .llm-output/todo-app.ts ` ). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.
52- - ** Note:** You typically need to run ` web-codegen-scorer eval ` once without ` --local ` to generate the initial files in ` .llm-output ` .
53- - The ` web-codegen-scorer eval:local ` script is a shortcut for ` web-codegen-scorer eval --local ` .
75+ - ` --local ` : Runs the script in local mode for the initial code generation request. Instead of
76+ calling the LLM, it will attempt to read the initial code from a corresponding file in the
77+ ` .llm-output ` directory (e.g., ` .llm-output/todo-app.ts ` ). This is useful for re-running
78+ assessments or debugging the build/repair process without incurring LLM costs for the initial
79+ generation.
80+ - ** Note:** You typically need to run ` web-codegen-scorer eval ` once without ` --local ` to
81+ generate the initial files in ` .llm-output ` .
82+ - The ` web-codegen-scorer eval:local ` script is a shortcut for
83+ ` web-codegen-scorer eval --local ` .
5484
5585- ` --limit=<number> ` : Specifies the number of application prompts to process. Defaults to ` 5 ` .
56- - Example: ` web-codegen-scorer eval --limit=10 --env=<config path> `
86+ - Example: ` web-codegen-scorer eval --limit=10 --env=<config path> `
5787
58- - ` --output-directory=<name> ` (alias: ` --output-dir ` ): Specifies which directory to output the generated code under which is useful for debugging. By default the code will be generated in a temporary directory.
59- - Example: ` web-codegen-scorer eval --output-dir=test-output --env=<config path> `
88+ - ` --output-directory=<name> ` (alias: ` --output-dir ` ): Specifies which directory to output the
89+ generated code under which is useful for debugging. By default the code will be generated in a
90+ temporary directory.
91+ - Example: ` web-codegen-scorer eval --output-dir=test-output --env=<config path> `
6092
61- - ` --concurrency=<number> ` : Sets the maximum number of concurrent AI API requests. Defaults to ` 5 ` (as defined by ` DEFAULT_CONCURRENCY ` in ` src/config.ts ` ).
62- - Example: ` web-codegen-scorer eval --concurrency=3 --env=<config path> `
93+ - ` --concurrency=<number> ` : Sets the maximum number of concurrent AI API requests. Defaults to ` 5 ` (
94+ as defined by ` DEFAULT_CONCURRENCY ` in ` src/config.ts ` ).
95+ - Example: ` web-codegen-scorer eval --concurrency=3 --env=<config path> `
6396
64- - ` --report-name=<name> ` : Sets the name for the generated report directory. Defaults to a timestamp (e.g., ` 2023-10-27T10-30-00-000Z ` ). The name will be sanitized (non-alphanumeric characters replaced with hyphens).
65- - Example: ` web-codegen-scorer eval --report-name=my-custom-report --env=<config path> `
97+ - ` --report-name=<name> ` : Sets the name for the generated report directory. Defaults to a
98+ timestamp (e.g., ` 2023-10-27T10-30-00-000Z ` ). The name will be sanitized (non-alphanumeric
99+ characters replaced with hyphens).
100+ - Example: ` web-codegen-scorer eval --report-name=my-custom-report --env=<config path> `
66101
67- - ` --rag-endpoint=<url> ` : Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a ` PROMPT ` substring, which will be replaced with the user prompt.
68- - Example: ` web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path> `
102+ - ` --rag-endpoint=<url> ` : Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The
103+ URL must contain a ` PROMPT ` substring, which will be replaced with the user prompt.
104+ - Example:
105+ ` web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path> `
69106
70- - ` --prompt-filter=<name> ` : String used to filter which prompts should be run. By default a random sample (controlled by ` --limit ` ) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.
71- - Example: ` web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path> `
107+ - ` --prompt-filter=<name> ` : String used to filter which prompts should be run. By default a random
108+ sample (controlled by ` --limit ` ) will be taken from the prompts in the current environment.
109+ Setting this can be useful for debugging a specific prompt.
110+ - Example: ` web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path> `
72111
73- - ` --skip-screenshots ` : Whether to skip taking screenshots of the generated app. Defaults to ` false ` .
74- - Example: ` web-codegen-scorer eval --skip-screenshots --env=<config path> `
112+ - ` --skip-screenshots ` : Whether to skip taking screenshots of the generated app. Defaults to
113+ ` false ` .
114+ - Example: ` web-codegen-scorer eval --skip-screenshots --env=<config path> `
75115
76116- ` --labels=<label1> <label2> ` : Metadata labels that will be attached to the run.
77- - Example: ` web-codegen-scorer eval --labels my-label another-label --env=<config path> `
117+ - Example: ` web-codegen-scorer eval --labels my-label another-label --env=<config path> `
78118
79119- ` --mcp ` : Whether to start an MCP for the evaluation. Defaults to ` false ` .
80- - Example: ` web-codegen-scorer eval --mcp --env=<config path> `
120+ - Example: ` web-codegen-scorer eval --mcp --env=<config path> `
81121
82122- ` --help ` : Prints out usage information about the script.
83123
124+ ### Additional configuration options
125+
126+ - [ Environment config reference] ( ./docs/environment-reference.md )
127+ - [ How to set up a new model?] ( ./docs/model-setup.md )
128+
84129## Local development
85130
86- If you've cloned this repo and want to work on the tool, you have to install its dependencies by running ` pnpm install ` .
131+ If you've cloned this repo and want to work on the tool, you have to install its dependencies by
132+ running ` pnpm install ` .
87133Once they're installed, you can run the following commands:
88134
89135* ` pnpm run release-build ` - Builds the package in the ` dist ` directory for publishing to npm.
90136* ` pnpm run eval ` - Runs an eval from source.
91137* ` pnpm run report ` - Runs the report app from source.
92138* ` pnpm run init ` - Runs the init script from source.
93139* ` pnpm run format ` - Formats the source code using Prettier.
140+
141+ ## FAQ
142+
143+ ### Who built this tool?
144+
145+ This tool is built by the Angular team at Google.
146+
147+ ### Does this tool only work for Angular code or Google models?
148+
149+ No! You can use this tool with any web library or framework (or none at all) as well as any model.
150+
151+ ### Why did you build this tool?
152+
153+ As more and more developers reach for LLM-based tools to create and modify code, we wanted to be
154+ able to empirically measure the effect of different factors on the quality of generated code. While
155+ many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the
156+ specific quality metrics we cared about.
157+
158+ In the absence of such a tool, we found that many developers based their judgements on codegen with
159+ different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web
160+ Codegen Scorer gives us a platform to consistently measure codegen across different configurations
161+ with consistency and repeatability.
162+
163+ ### Will you add more features over time?
164+
165+ Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.
166+
167+ Our roadmap includes:
168+
169+ * Including _ interaction testing_ in the rating, to ensure the generated code performs any requested
170+ behaviors.
171+ * Measure Core Web Vitals.
172+ * Measuring the effectiveness of LLM-driven edits on an existing codebase.
0 commit comments