Get started
in minutes.

Five steps.

Install Harbor CLI, implement the agent interface, run the benchmark. Each task runs on a local Anvil fork — no real funds, fully reproducible.

01

Install Harbor CLI

Harbor CLI is the runtime that executes benchmark suites against your agent. Install it via pip:

$ pip install harbor-cli SHELL

# Requires Python 3.10+

pip install harbor-cli

Requires Python 3.10+ and a funded Ethereum wallet (testnet or fork).

02

Run the Benchmark

Point Harbor at your agent script and run the full suite:

$ harbor run --benchmark blockchainbench SHELL

# Run the full benchmark

harbor run --benchmark blockchainbench

# Run only easy tasks

harbor run --benchmark blockchainbench --tier easy

# Run a specific task

harbor run --benchmark blockchainbench --task eth-transfer

# Output results as JSON

harbor run --benchmark blockchainbench --format json

Each task runs on a local Anvil fork of Ethereum mainnet. No real funds are used.

03

Agent Interface

Your agent must implement a simple interface. Harbor sends a task description and expects signed transactions in return:

$ cat your_agent.py PYTHON

# your_agent.py

from harbor import Agent, Task, Result

class MyAgent(Agent):

def execute(self, task: Task) -> Result:

# task.description -- what to do

# task.context -- RPC URL, wallet, contracts

# Return signed transactions

return Result(transactions=[...])

04

Contribute New Tasks

Each task is a Python class that defines setup logic, a natural-language description, verification logic, and a scoring rubric. See CONTRIBUTING.md for the full template.

  • A natural-language description of the DeFi operation
  • Setup logic (fork state, fund wallets, deploy contracts)
  • Verification logic (assert on-chain state after execution)
  • Scoring rubric (partial credit, gas efficiency bonuses)
05

Scoring

Each task is scored on a weighted basis:

COMPONENT WEIGHT
Correctness (desired on-chain state achieved) 60%
Completeness (all subtasks finished) 25%
Efficiency (gas usage, number of transactions) 15%

The overall benchmark score is the weighted average across all 13 tasks, with Hard tasks weighted 3×, Medium 2×, and Easy 1×.