Guides users through creating a complete Harbor task for evaluating agents. It walks through scaffolding the task layout, writing instruction.md, building the execution environment (Dockerfile or docker-compose), selecting and authoring verifiers (pytest, Reward Kit, or custom shell), writing an Oracle solution, and configuring task.toml and README for discoverability. The guide emphasizes verifier design and practical tips for running Oracle and multi-step trials.
Use this skill when you need to create or improve an agent evaluation: new benchmark tasks, stepwise multi-step tasks, or reproducible verification flows. It is especially useful when choosing how to grade agents (separate verifier vs shared environment), adding Reward Kit judges, or preparing tasks for Oracle verification.
Works with agents and tooling that run in containerized sandboxes and support orchestration via the Harbor CLI and Reward Kit-style verifiers (e.g., CLI-driven agents, evaluation harnesses, and LLM judges that can be invoked by Reward Kit).
This skill has not been reviewed by our automated audit pipeline yet.