IntentGym¶
IntentGym is Oryn's benchmark harness for evaluating web agents across task suites like MiniWoB.
Overview¶
IntentGym runs benchmark episodes by combining:
- an Oryn browser backend (
headless,embedded, orremote), - an LLM provider (
openai,anthropic,litellm,mock), - an agent strategy (
react,plan_act,reflexion,ralph,swarm,adk), - a benchmark loader (
miniwob,webshop,webarena,mock).
Prerequisites¶
- Build the Oryn CLI binary:
- Install IntentGym dependencies with Poetry:
- Set credentials for your chosen LLM provider (for example
OPENAI_API_KEYorANTHROPIC_API_KEY).
Note
IntentGym is managed with Poetry in this repository.
Use poetry run ... for commands.
MiniWoB Quick Start¶
Start the MiniWoB server in one terminal:
Run a benchmark in a second terminal:
Run only one task:
Run multiple tasks:
Config Structure¶
IntentGym config files use a flat schema like this:
run_id: miniwob_litellm
seed: 42
benchmark: miniwob
benchmark_options:
server_url: http://localhost:8765
episodes_per_task: 5
llm_provider: litellm
llm_model: vertex_ai/gemini-2.5-flash-lite
llm_options:
temperature: 1.0
agent_type: react
agent_options: {}
prompt_template: oil_instructions_v2
oryn_mode: headless
oryn_options:
binary_path: ../target/release/oryn
timeout: 60.0
log_file: oryn_debug.log
cli_args: ["--visible"]
max_steps: 50
timeout_seconds: 300
save_transcript: true
Working examples live in intentgym/configs/.
CLI Commands¶
Run benchmark:
Run with browser log redirection:
Override Oryn options from CLI (repeatable):
poetry run intentgym run --config configs/miniwob_5ep.yaml --oryn-opt timeout=90 --oryn-opt log_file=oryn_debug.log
Inspect a run:
Compare runs:
Initialize benchmark resources:
poetry run intentgym download miniwob
poetry run intentgym download webshop
poetry run intentgym download webarena
Outputs¶
- JSON reports:
intentgym/results/<run_id>.json - Markdown transcripts (when enabled):
intentgym/transcripts/ - Optional Oryn subprocess logs: path set by
oryn_options.log_fileor--oryn-log-file
Notes¶
configs/miniwob_100ep_visible_debug.yamlenables a visible browser and is intended for local debugging.- Long MiniWoB runs are usually more stable in default headless mode configs.