agentbench

Benchmark your OpenClaw agent across 40 real-world tasks. Tests file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency. Not a coding benchmark — measures your agent setup and config.

View Source SKILL.md

Stars

Installs

Status

ACTIVE

Visibility

PUBLIC

安装方式

直接复制以下提示词，发送给你的 AI 助手即可完成安装。

请先检查是否已安装 SkillHub 商店，若未安装，请根据 https://skillhub-1388575217.cos.ap-guangzhou.myqcloud.com/install/skillhub.md 安装 SkillHub 商店，然后安装 agentbench 技能。若已安装，则直接安装 agentbench 技能。

Overview

Skill Key: exe215/agentbench
Author: exe215
Source Repo: openclaw/skills
Version: -
Source Path: skills/exe215/agentbench
Latest Commit SHA: ca3b265a98f7fca906fe5eed361720a63a51ef5f

Extracted Content

SKILL.md excerpt

# AgentBench for OpenClaw

Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.

## Commands

When the user says any of these, follow the corresponding instructions:

- **`/benchmark`** — Run the full benchmark suite (all 40 tasks)
- **`/benchmark --fast`** — Run only easy+medium tasks (19 tasks)
- **`/benchmark --suite <name>`** — Run one domain only
- **`/benchmark --task <id>`** — Run a single task
- **`/benchmark --strict`** — Tag results as externally verified scoring
- **`/benchmark-list`** — List all tasks grouped by domain
- **`/benchmark-results`** — Show results from previous runs
- **`/benchmark-compare`** — Compare two runs side-by-side

Flags are combinable: `/benchmark --fast --suite research`

## Running a Benchmark

### Step 1: Discover Tasks

Read task.yaml files from the `tasks/` directory in this skill:

```
tasks/{suite-name}/{task-name}/task.yaml
```

Each task.yaml contains: name, id, suite, difficulty, mode, user_message, input_files, expected_outputs, expected_metrics, scoring weights.

Filter by `--suite` or `--task` if specified. If `--fast` is set and `--task` is not, filter to only tasks where difficulty is "easy" or "medium".

Profile is "fast" if `--fast` was specified, otherwise "full".

List discovered tasks with count and suites.

### Step 2: Set Up Run Directory

Generate a run ID from the current timestamp: `YYYYMMDD-HHmmss`

Read `suite_version` from `skill.json` in this skill directory.

Create the results directory:
```
agentbench-results/{run-id}/
```

Announce: `Starting AgentBench run {run-id} | Profile: {profile} | Suite version: {suite_version} | Tasks: {count}`

### Step 3: Execute Each Task

For each task:

1. **Set up workspace**:
   - Create `/tmp/agentbench-task-{task-id}/` as workspace
   - Copy input files from `tasks/{suite}/{task}/inputs/` to the workspace (if inputs/ exists)
   - If the task directory contains a `setup.sh`: run `bash tasks/{suite}/{task}/setup.sh {wor...

README excerpt

# AgentBench for OpenClaw

Benchmark your OpenClaw agent's general capabilities across 40 real-world tasks spanning 7 domains.

Not a coding benchmark — tests file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency.

Same tasks and scoring as the [Claude Code version](https://github.com/agentbench/agentbench). Results are cross-platform comparable and submit to the same [leaderboard](https://www.agentbench.app/leaderboard).

## Install

Place this skill in your OpenClaw skills directory, or clone directly:

```bash
git clone https://github.com/agentbench/agentbench-openclaw.git ~/.openclaw/skills/agentbench
```

## Quick Start

```
/benchmark                              # Run all 40 tasks (full profile)
/benchmark --fast                       # Run 19 easy+medium tasks (fast profile)
/benchmark --suite research             # Run one domain
/benchmark --suite research --fast      # Run easy+medium in one domain
/benchmark --task research-summarize-doc # Run one task
/benchmark --strict                     # Tag as externally verified
```

## Domains

| Domain | Tasks | Difficulty | What It Tests |
|--------|-------|------------|---------------|
| File Creation | 9 | 2E, 3M, 4H | Documents, spreadsheets, project scaffolding, config migration, skill graphs |
| Research | 5 | 3M, 2H | Summarize, compare, multi-source synthesis, git archaeology |
| Data Analysis | 5 | 1E, 1M, 1H, 1X | Anomalies, statistics, multi-format reconciliation, log pattern detection |
| Multi-Step | 5 | 1M, 2H, 2X | Data pipelines, log analysis, repo refactoring, release preparation |
| Memory | 5 | 2M, 1H, 1X | Recall, constraints, context switching, progressive accumulation |
| Error Handling | 6 | 1E, 2M, 3H | Corrupted input, cascading failures, misleading errors, partial recovery |
| Tool Efficiency | 5 | 3E, 2H | Minimal reads, right tool choice, codebase navigation, targeted fixes |

*E=Easy, M=Medium, H=Hard, X=Expert*

## Scoring

Each t...

Related Claw Skills

heyixuan2

bambu-studio-ai

★ 41

Bambu Lab 3D printer control and automation. Activate when user mentions: printer status, 3D printing, slice, analyze model, generate 3D, AMS filament, print monitor, Bambu Lab, or any 3D printing task. Full pipeline: search → generate → analyze → colorize → preview → open BS → user slice → print → monitor. Supports all 9 Bambu Lab printers (A1 Mini, A1, P1S, P2S, X1C, X1E, H2C, H2S, H2D).

capt-marbles

geo-optimization

★ 1

Generative Engine Optimization (GEO) for AI search visibility. Optimize content to appear in ChatGPT, Perplexity, Claude, and Google AI Overviews. Use when optimizing websites, pages, or content for LLM discoverability and citation.

carlulsoe

parakeet-stt

★ 0

Local speech-to-text with NVIDIA Parakeet TDT 0.6B v3 (ONNX on CPU). 30x faster than Whisper, 25 languages, auto-detection, OpenAI-compatible API. Use when transcribing audio files, converting speech to text, or processing voice recordings locally without cloud APIs.

carlzhao007

feishu-process-feedback

★ 0

飞书消息自动处理与进度反馈技能。安装后后台运行，监听飞书任务消息并自动创建独立进程处理。在处理前后发送实时进度反馈（任务确认、进度百分比、完成通知）。支持任务类型识别、智能解析、错误重试、并发控制、状态持久化。使用场景：飞书自动化工作流、任务进度追踪、批量任务处理、需要实时反馈的场景。

cartoonitunes

bottyfans

★ 0

BottyFans agent skill for autonomous creator monetization. Lets AI agents register, build a profile, publish posts (public, subscriber-only, or pay-to-unlock), upload media, accept USDC subscriptions and tips on Base, send and receive DMs, track earnings, and appear on the creator leaderboard. Use this skill when an agent needs to monetize content, interact with fans, manage a creator profile, handle payments in USDC, or operate as an autonomous creator on the BottyFans platform.

camopel

arxivkb

★ 0

Local arXiv paper manager with semantic search. Crawls arXiv categories, downloads PDFs, chunks content, and indexes with FAISS + Ollama embeddings. No cloud API keys required — everything runs locally.