
from claude-plugins32
Evaluation framework and tools for systematically measuring LLM performance using automated metrics, human judgment, and A/B testing.
Provides a practical framework and code examples for evaluating LLM applications: automated metrics (BLEU, ROUGE, BERTScore, perplexity), classification and retrieval metrics, human evaluation guidelines, LLM-as-judge patterns, A/B testing and regression detection. Includes sample Python snippets for metric calculations and benchmarking workflows.
Use this skill when you need to validate model performance, compare prompts or models, detect regressions before deployment, run benchmark suites, or set up human annotation tasks and A/B tests. Good for CI/CD integration and reporting.
Compatible with agents that can run Python snippets and call web or model APIs (Copilot/Codex style agents, Claude Code, Gemini where Python execution is supported).
This skill has not been reviewed by our automated audit pipeline yet.