
from agents-in-a-box9
SRE patterns for running and protecting autonomous agents: cost caps, circuit breakers, stall detection, observability, and runbooks to recover from incidents.
Provides production-ready SRE patterns tailored for AI agents. Includes concrete checks and small utilities to enforce cost caps, stop runaway loops, detect stalled behavior, and surface health metrics so operators can observe and act.
Use this skill when running long-lived or autonomous agent sessions, coordinating multi-agent workflows, or when you need guardrails to prevent cost or behavior spirals (failed APIs, infinite loops, excessive retries). Also useful during debugging and incident response.
Works with agent runtimes that can run Python/bash snippets and integrate environment variables and local files (Claude/Claude-Code, Cursor, local agent runners).
Agent Ops provides SRE patterns (cost caps, circuit breakers, stall detection, observability, runbooks) for autonomous AI agent workflows. No bundled scripts — all code is inline in SKILL.md as examples. Well-written reference material with practical patterns, though the inline bash script uses template placeholders like {{TOOL_DIR}} that won't resolve outside the agents-in-a-box ecosystem. Security is solid with no risky patterns; one minor concern is the bash health script embedding $METRICS path directly in a python -c string, which could be injection-prone if the path were user-controlled (though here it's derived from $HOME).
Well-structured SRE reference skill. Code examples are idiomatic Python/bash. The skill is documentation-first rather than tool-first, which limits direct usability but makes it a good instructional resource. Would benefit from having actual runnable scripts in a scripts/ directory rather than just inline code blocks.