SKILL.md packages that extend Claude Code, Cursor, Copilot, and other AI agents.
Tags

happycapy-skills
Run parallel queries against multiple models and view side-by-side responses in a live dashboard; synthesize consensus and run anonymous model voting.

brain-in-the-fish
Evaluates documents against criteria using multi-agent, ontology-grounded scoring to produce evidence-backed, auditable evaluation reports.

claude-superskills
Create, improve, and evaluate Agent Skills with a guided workflow: capture intent, draft SKILL.md, run evals and benchmarks, and optimize triggering description

claude-plugins
Evaluation framework and tools for systematically measuring LLM performance using automated metrics, human judgment, and A/B testing.

reprompter
Turn messy prompts into structured, scoreable prompts or multi-agent plans; also reverse-engineer great outputs into reusable prompts.

tonone
Build a reproducible ML training-to-serving pipeline: data validation, feature engineering, training, evaluation, and a serving endpoint.

my-cc-harness
Spawns an independent evaluator agent to score code outputs on functionality, code quality, originality, and usability/security, producing a structured EVAL_REP

my-cc-harness
Assess and score the overall health of a Claude Code harness across architecture, skills coverage, hooks, rules, MCP servers, eval pipelines, and team setup.

skillattack
Extract, import, and add structured model evaluation results to Hugging Face model cards; run or import benchmark evaluations and generate model-index YAML for

sgo
Optimize a product, resume, pitch, or profile against simulated evaluator populations using LLM scoring and counterfactual probes to find changes that improve r

ai-rig
Runs synthetic benchmarks and calibration tests for agents and skills: measures recall, precision, confidence calibration, and A/B comparisons to quantify instr

skill-creator-claw
Create, test, and iteratively improve OpenClaw skills; includes eval workflows, test-case guidance, and packaging tools.

Claude Code Skills
Merges and deduplicates evaluation worker results into a single verified review summary for coordinators, applying an architecture gate and verification checks.

opendatalab
Run, validate, and parse OmniDocBench document parsing evaluations with Docker/conda workflows and result parsing.

awesome-omni-skill
Guided workflow for drafting, testing, and iterating Agent Skills: write SKILL.md, run evals, grade outputs, and improve descriptions to improve triggering accu

ai-driven-development
Techniques and templates for crafting precise, reliable prompts and agentic instructions across LLMs.

awesome-copilot
Create, run, and analyze Arize experiments to evaluate and compare model performance using the ax CLI.

DeepScientist
Frame ambiguous research or engineering tasks: clarify goals, metrics, datasets, and a justified baseline so work can proceed with confidence.

harbor
Scaffold, configure, and verify a Harbor evaluation task end-to-end, including prompts, environment, verifier selection, and Oracle solution.

ISC-Bench
Benchmark and pipeline for evaluating model safety (Internal Safety Collapse) using the TVD framework; builds prompts, runs models, extracts harmful content, an