|
1 | 1 | # Web Codegen Scorer |
2 | 2 |
|
3 | | -This project is a tool designed to assess the quality of front-end code generated by Large Language Models (LLMs). |
| 3 | +**Web Codegen Scorer** is a tool for evaluating the quality of web code generated by Large Language |
| 4 | +Models (LLMs). |
4 | 5 |
|
5 | | -## Documentation directory |
| 6 | +You can use this tool to make evidence-based decisions relating to AI-generated code. For example: |
6 | 7 |
|
7 | | -- [Environment config reference](./docs/environment-reference.md) |
8 | | -- [How to set up a new model?](./docs/model-setup.md) |
| 8 | +* 🔄 Iterate on a system prompt to find most effective instructions for your project. |
| 9 | +* ⚖️ Compare the code quality of code produced by different models. |
| 10 | +* 📈 Monitor generated code quality over time as models and agents evolve. |
| 11 | + |
| 12 | +Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on _web_ |
| 13 | +code and relies primarily on well-established measures of code quality. |
| 14 | + |
| 15 | +## Features |
| 16 | + |
| 17 | +* ⚙️ Configure your evaluations with different models, frameworks, and tools. |
| 18 | +* ✍️ Specify system instructions and add MCP servers. |
| 19 | +* 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and |
| 20 | + coding best practices. (More built-in checks coming soon!) |
| 21 | +* 🔧 Automatically attempt to repair issues detected during code generating. |
| 22 | +* 📊 View and compare results with an intuitive report viewer UI. |
9 | 23 |
|
10 | 24 | ## Setup |
11 | 25 |
|
12 | | -1. **Install the package:** |
| 26 | +1. **Install the package:** |
| 27 | + |
13 | 28 | ```bash |
14 | 29 | npm install -g web-codegen-scorer |
15 | 30 | ``` |
16 | 31 |
|
17 | | -2. **Set up your API keys:** |
18 | | -In order to run an eval, you have to specify an API keys for the relevant providers as environment variables: |
| 32 | +2. **Set up your API keys:** |
| 33 | + |
| 34 | + In order to run an eval, you have to specify an API keys for the relevant providers as |
| 35 | + environment variables: |
| 36 | + |
19 | 37 | ```bash |
20 | 38 | export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models |
21 | 39 | export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models |
22 | 40 | export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models |
23 | 41 | ``` |
24 | 42 |
|
25 | 43 | 3. **Run an eval:** |
26 | | -You can run your first eval using our Angular example with the following command: |
| 44 | + |
| 45 | + You can run your first eval using our Angular example with the following command: |
| 46 | + |
27 | 47 | ```bash |
28 | 48 | web-codegen-scorer eval --env=angular-example |
29 | 49 | ``` |
30 | 50 |
|
31 | 51 | 4. (Optional) **Set up your own eval:** |
32 | | -If you want to set up a custom eval, instead of using our built-in examples, you can run the following |
33 | | -command which will guide you through the process: |
| 52 | + |
| 53 | + If you want to set up a custom eval, instead of using our built-in examples, you can run the |
| 54 | + following command which will guide you through the process: |
34 | 55 |
|
35 | 56 | ```bash |
36 | 57 | web-codegen-scorer init |
37 | 58 | ``` |
38 | 59 |
|
39 | 60 | ## Command-line flags |
40 | 61 |
|
41 | | -You can customize the `web-codegen-scorer eval` script with the following flags: |
| 62 | +You can customize the `web-codegen-scorer eval` command with the following flags. |
| 63 | + |
| 64 | +| Flag | Description | |
| 65 | +|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 66 | +| `--env=<path>` <br> (alias: `--environment`) | `(**Required**)` Specifies the path from which to load the environment config file.<br>Example: `web-codegen-scorer eval --env=foo/bar/my-env.js` | |
| 67 | +| `--model=<name>` | Specifies the model to use when generating code. Defaults to the value of `DEFAULT_MODEL_NAME`.<br>Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>` | |
| 68 | +| `--runner=<name>` | Specifies the runner to use to execute the eval. Supported runners are `genkit` (default) or `gemini-cli`. | |
| 69 | +| `--local` | Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the `.llm-output` directory (e.g., `.llm-output/todo-app.ts`). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.<br>**Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to generate the initial files in `.llm-output`.<br>The `web-codegen-scorer eval:local` script is a shortcut for `web-codegen-scorer eval --local`. | |
| 70 | +| `--limit=<number>` | Specifies the number of application prompts to process. Defaults to `5`.<br>Example: `web-codegen-scorer eval --limit=10 --env=<config path>` | |
| 71 | +| `--output-directory=<name>` <br> (alias: `--output-dir`) | Specifies which directory to output the generated code under which is useful for debugging. By default the code will be generated in a temporary directory.<br>Example: `web-codegen-scorer eval --output-dir=test-output --env=<config path>` | |
| 72 | +| `--concurrency=<number>` | Sets the maximum number of concurrent AI API requests. Defaults to `5` (as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`).<br>Example: `web-codegen-scorer eval --concurrency=3 --env=<config path>` | |
| 73 | +| `--report-name=<name>` | Sets the name for the generated report directory. Defaults to a timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric characters replaced with hyphens).<br>Example: `web-codegen-scorer eval --report-name=my-custom-report --env=<config path>` | |
| 74 | +| `--rag-endpoint=<url>` | Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a `PROMPT` substring, which will be replaced with the user prompt.<br>Example: `web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>` | |
| 75 | +| `--prompt-filter=<name>` | String used to filter which prompts should be run. By default a random sample (controlled by `--limit`) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.<br>Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>` | |
| 76 | +| `--skip-screenshots` | Whether to skip taking screenshots of the generated app. Defaults to `false`.<br>Example: `web-codegen-scorer eval --skip-screenshots --env=<config path>` | |
| 77 | +| `--labels=<label1> <label2>` | Metadata labels that will be attached to the run.<br>Example: `web-codegen-scorer eval --labels my-label another-label --env=<config path>` | |
| 78 | +| `--mcp` | Whether to start an MCP for the evaluation. Defaults to `false`.<br>Example: `web-codegen-scorer eval --mcp --env=<config path>` | |
| 79 | +| `--help` | Prints out usage information about the script. | |
| 80 | + |
| 81 | +### Additional configuration options |
42 | 82 |
|
43 | | -- `--env=<path>` (alias: `--environment`): (**Required**) Specifies the path from which to load the environment config. |
44 | | - - Example: `web-codegen-scorer eval --env=foo/bar/my-env.js` |
45 | | - |
46 | | -- `--model=<name>`: Specifies the model to use when generating code. Defaults to the value of `DEFAULT_MODEL_NAME`. |
47 | | - - Example: `web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>` |
| 83 | +- [Environment config reference](./docs/environment-reference.md) |
| 84 | +- [How to set up a new model?](./docs/model-setup.md) |
48 | 85 |
|
49 | | -- `--runner=<name>`: Specifies the runner to use to execute the eval. Supported runners are `genkit` (default) or `gemini-cli`. |
| 86 | +## Local development |
50 | 87 |
|
51 | | -- `--local`: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the `.llm-output` directory (e.g., `.llm-output/todo-app.ts`). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation. |
52 | | - - **Note:** You typically need to run `web-codegen-scorer eval` once without `--local` to generate the initial files in `.llm-output`. |
53 | | - - The `web-codegen-scorer eval:local` script is a shortcut for `web-codegen-scorer eval --local`. |
| 88 | +If you've cloned this repo and want to work on the tool, you have to install its dependencies by |
| 89 | +running `pnpm install`. |
| 90 | +Once they're installed, you can run the following commands: |
54 | 91 |
|
55 | | -- `--limit=<number>`: Specifies the number of application prompts to process. Defaults to `5`. |
56 | | - - Example: `web-codegen-scorer eval --limit=10 --env=<config path>` |
| 92 | +* `pnpm run release-build` - Builds the package in the `dist` directory for publishing to npm. |
| 93 | +* `pnpm run eval` - Runs an eval from source. |
| 94 | +* `pnpm run report` - Runs the report app from source. |
| 95 | +* `pnpm run init` - Runs the init script from source. |
| 96 | +* `pnpm run format` - Formats the source code using Prettier. |
57 | 97 |
|
58 | | -- `--output-directory=<name>` (alias: `--output-dir`): Specifies which directory to output the generated code under which is useful for debugging. By default the code will be generated in a temporary directory. |
59 | | - - Example: `web-codegen-scorer eval --output-dir=test-output --env=<config path>` |
| 98 | +## FAQ |
60 | 99 |
|
61 | | -- `--concurrency=<number>`: Sets the maximum number of concurrent AI API requests. Defaults to `5` (as defined by `DEFAULT_CONCURRENCY` in `src/config.ts`). |
62 | | - - Example: `web-codegen-scorer eval --concurrency=3 --env=<config path>` |
| 100 | +### Who built this tool? |
63 | 101 |
|
64 | | -- `--report-name=<name>`: Sets the name for the generated report directory. Defaults to a timestamp (e.g., `2023-10-27T10-30-00-000Z`). The name will be sanitized (non-alphanumeric characters replaced with hyphens). |
65 | | - - Example: `web-codegen-scorer eval --report-name=my-custom-report --env=<config path>` |
| 102 | +This tool is built by the Angular team at Google. |
66 | 103 |
|
67 | | -- `--rag-endpoint=<url>`: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a `PROMPT` substring, which will be replaced with the user prompt. |
68 | | - - Example: `web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>` |
| 104 | +### Does this tool only work for Angular code or Google models? |
69 | 105 |
|
70 | | -- `--prompt-filter=<name>`: String used to filter which prompts should be run. By default a random sample (controlled by `--limit`) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt. |
71 | | - - Example: `web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>` |
| 106 | +No! You can use this tool with any web library or framework (or none at all) as well as any model. |
72 | 107 |
|
73 | | -- `--skip-screenshots`: Whether to skip taking screenshots of the generated app. Defaults to `false`. |
74 | | - - Example: `web-codegen-scorer eval --skip-screenshots --env=<config path>` |
| 108 | +### Why did you build this tool? |
75 | 109 |
|
76 | | -- `--labels=<label1> <label2>`: Metadata labels that will be attached to the run. |
77 | | - - Example: `web-codegen-scorer eval --labels my-label another-label --env=<config path>` |
| 110 | +As more and more developers reach for LLM-based tools to create and modify code, we wanted to be |
| 111 | +able to empirically measure the effect of different factors on the quality of generated code. While |
| 112 | +many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the |
| 113 | +specific quality metrics we cared about. |
78 | 114 |
|
79 | | -- `--mcp`: Whether to start an MCP for the evaluation. Defaults to `false`. |
80 | | - - Example: `web-codegen-scorer eval --mcp --env=<config path>` |
| 115 | +In the absence of such a tool, we found that many developers based their judgements on codegen with |
| 116 | +different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web |
| 117 | +Codegen Scorer gives us a platform to consistently measure codegen across different configurations |
| 118 | +with consistency and repeatability. |
81 | 119 |
|
82 | | -- `--help`: Prints out usage information about the script. |
| 120 | +### Will you add more features over time? |
83 | 121 |
|
84 | | -## Local development |
| 122 | +Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios. |
85 | 123 |
|
86 | | -If you've cloned this repo and want to work on the tool, you have to install its dependencies by running `pnpm install`. |
87 | | -Once they're installed, you can run the following commands: |
| 124 | +Our roadmap includes: |
88 | 125 |
|
89 | | -* `pnpm run release-build` - Builds the package in the `dist` directory for publishing to npm. |
90 | | -* `pnpm run eval` - Runs an eval from source. |
91 | | -* `pnpm run report` - Runs the report app from source. |
92 | | -* `pnpm run init` - Runs the init script from source. |
93 | | -* `pnpm run format` - Formats the source code using Prettier. |
| 126 | +* Including _interaction testing_ in the rating, to ensure the generated code performs any requested |
| 127 | + behaviors. |
| 128 | +* Measure Core Web Vitals. |
| 129 | +* Measuring the effectiveness of LLM-driven edits on an existing codebase. |
0 commit comments