What it does

Provides a practical framework and code examples for evaluating LLM applications: automated metrics (BLEU, ROUGE, BERTScore, perplexity), classification and retrieval metrics, human evaluation guidelines, LLM-as-judge patterns, A/B testing and regression detection. Includes sample Python snippets for metric calculations and benchmarking workflows.

When to use it

Use this skill when you need to validate model performance, compare prompts or models, detect regressions before deployment, run benchmark suites, or set up human annotation tasks and A/B tests. Good for CI/CD integration and reporting.

What's included

Scripts: Example evaluation scripts and snippets (BLEU, ROUGE, BERTScore, AB testing) included inline in the SKILL.md.
References: Links to evaluation patterns and assets are noted in the skill (assets/evaluation-framework.py, scripts/evaluate-model.py).
Instructions: Step-by-step guidance for building an evaluation pipeline, running metrics, constructing human annotation tasks, and integrating LLM-judge patterns.

Compatible agents

Compatible with agents that can run Python snippets and call web or model APIs (Copilot/Codex style agents, Claude Code, Gemini where Python execution is supported).

LLM Evaluation

What it does

When to use it

What's included

Compatible agents

Tags

Audit Summary

Related Skills

Watch Out

Notes

Information

Trust Score