From 2aa2e9455025f1a19b83c28dc2ab49e5ba9151f2 Mon Sep 17 00:00:00 2001 From: Alexandre Lacoste Date: Wed, 27 Nov 2024 12:43:08 -0500 Subject: [PATCH 1/3] Update README.md --- README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 80cec115..a28b5513 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,6 @@
-![AgentLab Banner](https://github.com/user-attachments/assets/a23b3cd8-b5c4-4918-817b-654ae6468cb4) - [![pypi](https://badge.fury.io/py/agentlab.svg)](https://pypi.org/project/agentlab/) @@ -21,6 +19,11 @@ [🤖 Build Your Agent](#-implement-a-new-agent)  |  [↻ Reproducibility](#-reproducibility) + +agentlab-diagram + + +Demo solving tasks: https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85
@@ -240,7 +243,7 @@ dynamic benchmarks. version and commit hash * The `Study` class allows automatic upload of your results to [`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a - large amount of reference points. + large amount of reference points. For this feature, you need to `git clone` the repository and install via `pip install -e .`. * **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users to try to reproduce the results and upload them to the leaderboard. There is a special column containing information about all reproduced results of an agent on a benchmark. From 3fe25845519c8203ecc8c4f3e9be80eba1e58d03 Mon Sep 17 00:00:00 2001 From: Alexandre Lacoste Date: Wed, 27 Nov 2024 13:04:09 -0500 Subject: [PATCH 2/3] Update README.md --- README.md | 41 ++++++++++++++++++++++++----------------- 1 file changed, 24 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index a28b5513..b207322a 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,8 @@ [🛠️ Setup](#%EF%B8%8F-setup-agentlab)  |  [🤖 Assistant](#-ui-assistant)  |  [🚀 Launch Experiments](#-launch-experiments)  |  -[🔍 Analyse Results](#-analyse-results)  |  +[🔍 Analyse Results](#-analyse-results)  |  +[🏆 Leaderboard](#-leaderboard)  |  [🤖 Build Your Agent](#-implement-a-new-agent)  |  [↻ Reproducibility](#-reproducibility) @@ -35,10 +36,10 @@ AgentLab is a framework for developing and evaluating agents on a variety of AgentLab Features: * Easy large scale parallel [agent experiments](#-launch-experiments) using [ray](https://www.ray.io/) * Building blocks for making agents over BrowserGym -* Unified LLM API for OpenRouter, OpenAI, Azure, or self hosted using TGI. -* Prefered way for running benchmarks like WebArena +* Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI. +* Preferred way for running benchmarks like WebArena * Various [reproducibility features](#reproducibility-features) -* Unified LeaderBoard (soon) +* Unified [LeaderBoard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard) ## 🎯 Supported Benchmarks @@ -62,12 +63,12 @@ AgentLab Features: pip install agentlab ``` -If not done already, install playwright: +If not done already, install Playwright: ```bash playwright install ``` -Make sure to prepare the required benchmark according to instructions provided in the [setup +Make sure to prepare the required benchmark according to the instructions provided in the [setup column](#-supported-benchmarks). ```bash @@ -177,7 +178,7 @@ experience, consider using benchmarks like WorkArena instead. ### Loading Results -The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursivley find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage. +The class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursively find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage. ```python from agentlab.analyze import inspect_results @@ -207,9 +208,15 @@ Once this is selected, you can see the trace of your agent on the given task. Cl image to select a step and observe the action taken by the agent. -**⚠️ Note**: Gradio is still in developement and unexpected behavior have been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again. +**⚠️ Note**: Gradio is still developing, and unexpected behavior has been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again. +## 🏆 Leaderboard + +Official unified [leaderboard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard) across all benchmarks. + +Experiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard. + ## 🤖 Implement a new Agent Get inspiration from the `MostBasicAgent` in @@ -225,18 +232,18 @@ Several factors can influence reproducibility of results in the context of evalu dynamic benchmarks. ### Factors affecting reproducibility -* **Software version**: Different version of Playwright or any package in the software stack could +* **Software version**: Different versions of Playwright or any package in the software stack could influence the behavior of the benchmark or the agent. -* **API based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to - incorporate latest web knowledge. +* **API-based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to + incorporate the latest web knowledge. * **Live websites**: * WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow - sometime push minor modifications. + sometimes pushes minor modifications. * AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may change depending on which country or region, some websites might be in different languages by default. -* **Stochastic Agents**: Setting temperature of the LLM to 0 can reduce most stochasticity. -* **Non deterministic tasks**: For a fixed seed, the changes should be minimal +* **Stochastic Agents**: Setting the temperature of the LLM to 0 can reduce most stochasticity. +* **Non-deterministic tasks**: For a fixed seed, the changes should be minimal ### Reproducibility Features * `Study` contains a dict of information about reproducibility, including benchmark version, package @@ -244,13 +251,13 @@ dynamic benchmarks. * The `Study` class allows automatic upload of your results to [`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a large amount of reference points. For this feature, you need to `git clone` the repository and install via `pip install -e .`. -* **Reproduced results in the leaderboard**. For agents that are repdocudibile, we encourage users +* **Reproduced results in the leaderboard**. For agents that are reprocudibile, we encourage users to try to reproduce the results and upload them to the leaderboard. There is a special column containing information about all reproduced results of an agent on a benchmark. * **ReproducibilityAgent**: [You can run this agent](src/agentlab/agents/generic_agent/reproducibility_agent.py) on an existing study and it will try to re-run - the same actions on the same task seeds. A vsiual diff of the two prompts will be displayed in the + the same actions on the same task seeds. A visual diff of the two prompts will be displayed in the AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes - between to two executions. **Note**: this is a beta feature and will need some adaptation for your + between the two executions. **Note**: this is a beta feature and will need some adaptation for your own agent. From b7382a48874f62c1e15216760ff8ff9ba9ef4d45 Mon Sep 17 00:00:00 2001 From: Alexandre Lacoste Date: Wed, 27 Nov 2024 13:05:10 -0500 Subject: [PATCH 3/3] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index b207322a..682f56ba 100644 --- a/README.md +++ b/README.md @@ -16,6 +16,7 @@ [🤖 Assistant](#-ui-assistant)  |  [🚀 Launch Experiments](#-launch-experiments)  |  [🔍 Analyse Results](#-analyse-results)  |  +
[🏆 Leaderboard](#-leaderboard)  |  [🤖 Build Your Agent](#-implement-a-new-agent)  |  [↻ Reproducibility](#-reproducibility)