
from ako4all262
Automated loop that profiles, iterates, benchmarks and commits GPU kernel optimizations across CUDA/Triton/TileLang/Python/C++ to achieve measurable speedups.
AKO4ALL runs an agentic optimize→benchmark→log→commit workflow for GPU kernels. Given a kernel (CUDA, Triton, TileLang, C++, Python), it bootstraps a workspace, profiles the code, runs iterative micro-optimizations, verifies correctness, and records each iteration. The goal is measurable runtime improvement over a provided reference while preserving correctness and reproducible commits.
Use AKO4ALL when you have an existing GPU kernel you want to speed up and measure: 'optimize this kernel', 'run AKO on my CUDA kernel', or when you need repeatable iteration with profiling (ncu) and git-backed experiment logging. It is not for writing a kernel from scratch or for generic non-kernel performance advice.
Best used by agents with code execution and shell capabilities (Copilot/Codex-style or CLI-capable agents that can run Python, shell, and profiling tools). The skill expects the environment to run benchmarks, invoke ncu, and commit to git.
This skill has not been reviewed by our automated audit pipeline yet.