Agent Skills

prompt-engineeringllmagentic

Prompt Engineering

ai-driven-development

Techniques and templates for crafting precise, reliable prompts and agentic instructions across LLMs.

arizeexperimentsevaluation

Arize Experiment

awesome-copilot

Create, run, and analyze Arize experiments to evaluate and compare model performance using the ax CLI.

34,827

evaluationoptimizationllm

Semantic Gradient Optimization (SGO)

sgo

Optimize a product, resume, pitch, or profile against simulated evaluator populations using LLM scoring and counterfactual probes to find changes that improve r

evaluationomnidocbenchdocument-parsing

OmniDocBench Evaluation Helper

opendatalab

Run, validate, and parse OmniDocBench document parsing evaluations with Docker/conda workflows and result parsing.

authoringtestingevaluation

Skill Creator (Claw)

skill-creator-claw

Create, test, and iteratively improve OpenClaw skills; includes eval workflows, test-case guidance, and packaging tools.

multi-modelmodel-comparisondashboard

LLM Council — Multi-model Consensus Dashboard

happycapy-skills

Run parallel queries against multiple models and view side-by-side responses in a live dashboard; synthesize consensus and run anonymous model voting.

122

skill-authoringtestingevaluation

Skill Creator

my_arxiv_daily

Create, iterate on, and evaluate Agent Skills: draft SKILL.md, run test prompts, collect quantitative and qualitative feedback, and optimize skill descriptions

183

document-evaluationevidence-verificationmulti-agent

Brain in the Fish — Document Evaluation

brain-in-the-fish

Evaluates documents against criteria using multi-agent, ontology-grounded scoring to produce evidence-backed, auditable evaluation reports.

benchmarkingcalibrationevaluation

Calibrate — Agent & Skill Benchmarking

ai-rig

Runs synthetic benchmarks and calibration tests for agents and skills: measures recall, precision, confidence calibration, and A/B comparisons to quantify instr

skill-authoringevaluationbenchmarking

Skill Creator

awesome-omni-skill

Guided workflow for drafting, testing, and iterating Agent Skills: write SKILL.md, run evals, grade outputs, and improve descriptions to improve triggering accu

skill-authoringevaluationbenchmarking

Skill Creator

claude-superskills

Create, improve, and evaluate Agent Skills with a guided workflow: capture intent, draft SKILL.md, run evals and benchmarks, and optimize triggering description

researchliterature-scoutingbaseline-discovery

Scout

DeepScientist

Frame ambiguous research or engineering tasks: clarify goals, metrics, datasets, and a justified baseline so work can proceed with confidence.

2,620

LLM Evaluation

claude-plugins

Evaluation framework and tools for systematically measuring LLM performance using automated metrics, human judgment, and A/B testing.

evaluationllmmetrics

prompt-engineeringmulti-agentorchestration

RePrompter

reprompter

Turn messy prompts into structured, scoreable prompts or multi-agent plans; also reverse-engineer great outputs into reusable prompts.

machine-learningml-pipelinemodel-training

Cortex — ML Pipeline Builder

tonone

Build a reproducible ML training-to-serving pipeline: data validation, feature engineering, training, evaluation, and a serving endpoint.

ISC-Bench

Benchmark and pipeline for evaluating model safety (Internal Safety Collapse) using the TVD framework; builds prompts, runs models, extracts harmful content, an

safetybenchmarkjailbreak

775

evaluationtestingcode-quality

Code Evaluation Harness

my-cc-harness

Spawns an independent evaluator agent to score code outputs on functionality, code quality, originality, and usability/security, producing a structured EVAL_REP

122

Harness Health Audit

my-cc-harness

Assess and score the overall health of a Claude Code harness across architecture, skills coverage, hooks, rules, MCP servers, eval pipelines, and team setup.

auditharnessclaude-code

122

huggingfaceevaluationmodel-cards

Hugging Face Evaluation Manager

skillattack

Extract, import, and add structured model evaluation results to Hugging Face model cards; run or import benchmark evaluations and generate model-index YAML for

Review Merge Worker

Claude Code Skills

Merges and deduplicates evaluation worker results into a single verified review summary for coordinators, applying an architecture gate and verification checks.

evaluationreviewmerge

444

Building Agent Systems

code-abyss

Reference guide for designing single- and multi-agent LLM systems, RAG pipelines, security hardening, and prompt engineering for production agents.

agentsmulti-agentrag

218

Langium-AI (LAI)

langium-ai

CLI to generate language descriptors, synthesize system prompts, and run evaluations to iteratively refine AI tooling for Langium projects.

langiumdslcli

searchopensearchsemantic-search

OpenSearch Launchpad

opensearch-agent-skills

Guides users to design, build, and run OpenSearch-based search applications — from data ingestion and index design to semantic/hybrid search and evaluation.

OpenJudge — LLM Evaluation Pipeline

openjudge

Tools and patterns to build automated evaluation pipelines for LLMs: graders, runners, aggregators, and analysis utilities for comparing model outputs and scori

evaluationgraderllm

639

skill-authoringevaluationbenchmarking

Skill Creator

sage

Create, iterate on, and evaluate Agent Skills: write SKILL.md, design test cases, run benchmarks, and improve triggering and performance.

1,197

benchmarkevaluationtesting

Claw Bench

claw-bench

Benchmarking skill that guides an agent through a structured suite of capability tests and reporting steps for leaderboard submission.

173

researchevaluationnovelty

Evaluation Filtering

de-anthropocentric-research-engine

Multi-stage evaluation pipeline that scores ideas for novelty, assesses feasibility, ranks them, and selects top candidates for further development.

381

model-referenceretrievalnemotron

Nemotron 3 Nano Reference

nemotron

Authoritative knowledge-base for Nemotron 3 Nano: architecture, training data, SFT/RL recipes, evaluation and deployment guidance.

1,461

evaluationbenchmarkingnemo

NeMo Evaluator

nemo-platform

Turn a domain rubric and dataset into a reproducible evaluation using the NeMo Evaluator SDK; generate configs, run local evaluations, and explain scores and fa