
from claude-plugins32
Evaluation framework and tools for systematically measuring LLM performance using automated metrics, human judgment, and A/B testing.
Provides a practical framework and code examples for evaluating LLM applications: automated metrics (BLEU, ROUGE, BERTScore, perplexity), classification and retrieval metrics, human evaluation guidelines, LLM-as-judge patterns, A/B testing and regression detection. Includes sample Python snippets for metric calculations and benchmarking workflows.
Use this skill when you need to validate model performance, compare prompts or models, detect regressions before deployment, run benchmark suites, or set up human annotation tasks and A/B tests. Good for CI/CD integration and reporting.
Compatible with agents that can run Python snippets and call web or model APIs (Copilot/Codex style agents, Claude Code, Gemini where Python execution is supported).
LLM Evaluation is a reference-style skill providing code snippets and conceptual guidance for evaluating LLM outputs using automated metrics (BLEU, ROUGE, BERTScore), human evaluation, LLM-as-judge patterns, and A/B testing. It contains no executable scripts — all code is inline Python snippets in the SKILL.md body. The content is well-organized and educational but acts more as a cheat sheet than an actionable skill an agent can execute.
This is essentially a knowledge-base article packaged as a skill. No security concerns — no scripts, no network calls, no credentials. Code quality is decent for snippets but not runnable as-is. Architecture follows basic frontmatter but puts all content in SKILL.md body with no supporting files. Useful as reference material but limited as an executable agent skill.