
from agents-in-a-box9
SRE patterns for running and protecting autonomous agents: cost caps, circuit breakers, stall detection, observability, and runbooks to recover from incidents.
Provides production-ready SRE patterns tailored for AI agents. Includes concrete checks and small utilities to enforce cost caps, stop runaway loops, detect stalled behavior, and surface health metrics so operators can observe and act.
Use this skill when running long-lived or autonomous agent sessions, coordinating multi-agent workflows, or when you need guardrails to prevent cost or behavior spirals (failed APIs, infinite loops, excessive retries). Also useful during debugging and incident response.
Works with agent runtimes that can run Python/bash snippets and integrate environment variables and local files (Claude/Claude-Code, Cursor, local agent runners).
This skill has not been reviewed by our automated audit pipeline yet.