
from nemo-platform44
Turn a domain rubric and dataset into a reproducible evaluation using the NeMo Evaluator SDK; generate configs, run local evaluations, and explain scores and fa
NeMo Evaluator turns expert rubrics and benchmark datasets into reproducible evaluations. It maps rubric criteria to SDK metric primitives, generates human-reviewable configs/artifacts, runs local or remote evaluations, and explains both row-level and aggregate results with troubleshooting guidance.
Use when you need a repeatable, auditable evaluation pipeline for model or system benchmarks: judge-quality checks, generation-quality tests, RAG/tool-calling evaluations, or bring-your-own-benchmark reproduction. Ideal for ML engineers, evaluation leads, or platform teams.
references/metric-selection.md, references/sdk-execution.md, references/benchmark-reproduction.md, references/troubleshooting.mdDesigned for SDK-aware agents and developer tooling workflows (Python-based NeMo Platform, CLI-driven pipelines, and assistant agents that can surface SDK snippets).
This skill has not been reviewed by our automated audit pipeline yet.