Agent Skills

ROCm Triton Kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)

kernels

Guidance and examples for writing, benchmarking, and integrating optimized Triton kernels on ROCm (MI355X, R9700) for diffusers and transformers workloads.

gpurocmtriton

599

arizeexperimentsevaluation

Arize Experiment

awesome-copilot

Create, run, and analyze Arize experiments to evaluate and compare model performance using the ax CLI.

34,827

8 triggers

Choosing QEC Decoders

qec-autoresearch-skills

Guidance for selecting quantum error-correction decoder backends based on artifact shape, code family, noise model, and validation goals.

quantumqecdecoders

benchmarkingorchestrationdevops

Run Benchmarks

codescalebench

Launch, manage, and rerun CodeScaleBench benchmark suites with safety guardrails, paired baseline+full execution, and orchestration utilities.

authoringtestingevaluation

Skill Creator (Claw)

skill-creator-claw

Create, test, and iteratively improve OpenClaw skills; includes eval workflows, test-case guidance, and packaging tools.

skill-authoringevaluationbenchmarking

Skill Creator

Awesome Omni Skill

Guided workflow for drafting, testing, and iterating Agent Skills: write SKILL.md, run evals, grade outputs, and improve descriptions to improve triggering accu

skill-authoringevaluationbenchmarking

Skill Creator

claude-superskills

Create, improve, and evaluate Agent Skills with a guided workflow: capture intent, draft SKILL.md, run evals and benchmarks, and optimize triggering description

benchmarkingcalibrationevaluation

Calibrate — Agent & Skill Benchmarking

ai-rig

Runs synthetic benchmarks and calibration tests for agents and skills: measures recall, precision, confidence calibration, and A/B comparisons to quantify instr

AgentPuzzles

ide-agent-kit

Competitive puzzle arena API for AI agents: timed puzzles, per-model leaderboards, puzzle creation and moderation.

puzzlesbenchmarkingapi

civic-analyticspolicy-analysisopen-data

City Policy Analysis — Master Orchestrator

civic-analytics-agent-workflow-claude-skill

A master workflow for city policy analysis and civic innovation: frames problems, runs evidence-based analysis, crafts communications, benchmarks across cities,

performanceprofilingoptimization

Performance Optimizer

dotfiles

Guides profiling and targeted optimizations for code and systems — measure, identify bottlenecks, and verify improvements across Python, Node, shell, and system

117

LLM Evaluation

claude-plugins

Evaluation framework and tools for systematically measuring LLM performance using automated metrics, human judgment, and A/B testing.

evaluationllmmetrics

performanceprofilingoptimization

Performance Optimizer

claude-skill-registry

Guided performance analysis and profiling playbook for identifying bottlenecks and optimizing code, scripts, and systems across Python, Node.js, shell, and infr

483

performancebenchmarkingweb-vitals

Performance Benchmark

gStack

Measure and detect performance regressions for web pages using automated benchmarks, baselines, and trend analysis.

climodel-profilingaot-compilation

SkVM General

jiuwenswarm

Drive the skvm CLI to profile models, AOT-compile skills, run single-task executions and benchmarks, and manage compilation/jit proposals via safe CLI workflows

813

huggingfaceevaluationmodel-cards

Hugging Face Evaluation Manager

skillattack

Extract, import, and add structured model evaluation results to Hugging Face model cards; run or import benchmark evaluations and generate model-index YAML for

pythonperformanceprofiling

Python Performance Optimization

opencode-skills-collection

Profile, analyze, and optimize Python applications for CPU and memory efficiency using profiling tools and performance best practices.

HPC Python Patterns

claude-skill-registry

Guidance and patterns for Python parallelism and GPU/CPU performance: threading vs multiprocessing vs asyncio, CUDA streams, PyTorch DDP, and benchmarking.

pythonhpcgpu

483

performancebenchmarkingweb-vitals

Performance Benchmark (gstack)

gstack-ko

Run automated performance baselines and regression detection for web pages (TTFB, FCP, LCP, bundle sizes, requests) and compare against historical baselines.

performanceprofilingaudit

TAO Performance Audit

tao

Structured performance-audit methodology: measure, identify bottlenecks, optimize the true hotspot, and verify improvements with benchmarks.

performancebenchmarkingweb-vitals

Benchmark (Performance Regression Detection)

OStack SaaS

Automated performance benchmarking and regression detection: captures baselines, measures Core Web Vitals, and compares metrics across PRs to flag regressions.

105

benchmarkingexperiment-designai-evaluation

AEC Bench Experiment Configurator

aec-bench

Interactively build and validate experiment.yaml configurations for AI benchmarks, including task and agent selection.

4 triggers

AgentClash Eval Runner

agentclash

Start, track, and analyze agent evaluation runs using the AgentClash CLI.

evaluationbenchmarkingagentclash

skill-comparisonagent-workflowbenchmarking

ForkProbe Skill Comparator

forkprobe

Compares multiple AI skills or pipelines side-by-side on a single task to determine the most effective output.

skill-authoringevaluationbenchmarking

Skill Creator

sage

Create, iterate on, and evaluate Agent Skills: write SKILL.md, design test cases, run benchmarks, and improve triggering and performance.

1,197

gpuperformancebenchmarking

AKO4ALL — Agentic Kernel Optimization

ako4all

Automated loop that profiles, iterates, benchmarks and commits GPU kernel optimizations across CUDA/Triton/TileLang/Python/C++ to achieve measurable speedups.

262

evaluationbenchmarkingnemo

NeMo Evaluator

nemo-platform

Turn a domain rubric and dataset into a reproducible evaluation using the NeMo Evaluator SDK; generate configs, run local evaluations, and explain scores and fa

mojomodularsystems-programming

Mojo Development Environment

claude-skills

Setup and execute Mojo code within Claude.ai containers, including installation, compilation, and performance benchmarking.

127

benchmarkingai-evalperformance-testing

AI Agent Benchmarking Suite

awesome-codex-plugins

Compare AI agents (Claude Code, Codex, OpenCode) on real coding tasks using git worktrees and standardized checks.

616