Calibrate — Agent & Skill Benchmarking

Trust Score 84/100

Runs synthetic benchmarks and calibration tests for agents and skills: measures recall, precision, confidence calibration, and A/B comparisons to quantify instr

triggers:calibratebenchmark agentsconfidence calibrationagent evaluationab-test

GitHub SKILL.md

What it does

Calibrate is a benchmarking and calibration skill that generates synthetic problems with quasi-ground-truth, runs target agents/skills against them, and measures key signals: recall, severity accuracy, formatting/actionability, and confidence calibration. It supports A/B comparisons against a general-purpose baseline and produces compact JSON summaries and human-readable proposals for instruction improvements. The workflow emphasizes sequential, batched pipelines to avoid resource spikes and returns small, audit-friendly result files.

When to use it

Use Calibrate before and after major instruction or system-prompt changes, when adding new specialist agents, or to validate whether an agent's self-reported confidence matches real performance. It's useful for plugin authors and ops teams who need repeatable benchmarks and an actionable improvement loop (proposals, apply, re-run).

What's included

Scripts: pipeline templates and mode files (modes/, templates/) describing agent/skill benchmarks.
References: detailed constants and workflow guidance baked into the skill file; scoring and logging conventions that integrate with .claude/logs/calibrations.jsonl.
Instructions: parsed flags, target resolution, batching rules, timeout handling, and proposal application semantics.

Compatible agents

Designed for environments that can spawn subagents and run file-based pipelines (Claude-based foundry agents, Codex runners). Works with general-purpose scorers and Codex scorers in dual-evaluation setups. Ideal for Claude Code / foundry-like runtimes where Agent/Task primitives exist.

Audit Summary

Extensive agent calibration/benchmarking skill for the ai-rig ecosystem. Very detailed SKILL.md with thorough workflow, constants, edge cases, and multi-mode support. No scripts bundled — relies entirely on Claude Code agent orchestration. Niche usefulness outside its specific plugin ecosystem.

Watch Out

Requires Claude Code agent infrastructure (CLAUDE_PLUGIN_ROOT, .claude/ directory) — not portable to other agent platforms
No bundled scripts — all execution is agent-orchestrated via subagent spawning
Heavily coupled to the ai-rig/foundry plugin ecosystem (modes/, templates/, _shared/)
disable-model-invocation: true means it cannot call LLMs directly

Notes

Security: No curl-pipe-bash, no hardcoded creds, no destructive commands. Minor deduction for shell variable interpolation in bash blocks (theoretically injectable if ARGUMENTS contain malicious content, but context is agent invocation so low risk). Code quality: Very well-documented and structured, but extreme length makes it hard to follow; some redundancy between workflow steps and notes. Architecture: Good frontmatter, clear separation of concerns via mode files, but monolithic SKILL.md (~3000+ words in workflow alone). Usefulness: Niche — only valuable to teams running the ai-rig foundry plugin ecosystem for agent benchmarking.

Information

Repository: ai-rig
Stars: 9

Trust Score

Overall84

Security88

Code Quality78

Architecture82

Usefulness42

Related Skills

Development Worktree

Create an isolated git worktree for feature work, auto-run project setup, and verify a clean test baseline before development.

Readwise Reader Document Management

Manage Readwise Reader documents: list, save, search, move, tag, highlight, export and bulk-edit via official and custom CLIs.

Bounty Hunter — Atlas

Persona skill: 'Atlas' — a profit-focused developer persona for discovering, evaluating and executing paid bounties or freelance tasks with ROI-aware workflows.

Junshi — Research Advisor

Daily strategic research advisor that scans arXiv/venues, digests papers, and proposes bold, ranked research ideas tailored to the user's profile.

Scikit-learn (Classical ML reference)

Comprehensive scikit-learn guidance for classification, regression, clustering, preprocessing, model evaluation, and production-ready ML pipelines.

Full Stack Builder

End-to-end builder that scaffolds, implements, tests, and optionally deploys web and API applications from a natural-language specification.

SHIP Score Interview

Run a structured SHIP interview to score and prioritise product features or bugs (Income + Strategy - Perspiration). Guides the agent to reach 90-95% confidence

ezBookkeeping API Tools

Command-line API tools for ezBookkeeping: record and query transactions, retrieve accounts/categories/tags, and fetch exchange rates for self-hosted personal fi

Back to Skills