Guides users through creating a complete Harbor task for evaluating agents. It walks through scaffolding the task layout, writing instruction.md, building the execution environment (Dockerfile or docker-compose), selecting and authoring verifiers (pytest, Reward Kit, or custom shell), writing an Oracle solution, and configuring task.toml and README for discoverability. The guide emphasizes verifier design and practical tips for running Oracle and multi-step trials.
Use this skill when you need to create or improve an agent evaluation: new benchmark tasks, stepwise multi-step tasks, or reproducible verification flows. It is especially useful when choosing how to grade agents (separate verifier vs shared environment), adding Reward Kit judges, or preparing tasks for Oracle verification.
Works with agents and tooling that run in containerized sandboxes and support orchestration via the Harbor CLI and Reward Kit-style verifiers (e.g., CLI-driven agents, evaluation harnesses, and LLM judges that can be invoked by Reward Kit).
Well-crafted skill for scaffolding Harbor evaluation tasks. No bundled scripts — pure instructional SKILL.md. Covers the full lifecycle from init through Oracle verification, with three verifier options (Reward Kit, pytest, custom shell) and detailed network policy configuration. Common pitfalls section is a nice touch. Clean frontmatter with specific triggers and argument hint.
No scripts to execute or audit. SKILL.md is documentation-only, guiding the agent through a multi-step CLI workflow. No security concerns whatsoever. The skill is thorough and well-structured — one of the better-written skills seen.