
from kernels599
Guidance and examples for writing, benchmarking, and integrating optimized Triton kernels on ROCm (MI355X, R9700) for diffusers and transformers workloads.
This skill provides in-depth patterns, kernel implementations, and benchmarking guidance for writing optimized Triton kernels targeting AMD ROCm GPUs (MI355X, R9700). It covers core kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN), integration patterns for HuggingFace diffusers (LTX-Video) and transformers, and performance/benchmarking scripts.
Use when developing or optimizing GPU kernels for AMD hardware, injecting custom kernels into diffusers pipelines, running micro-benchmarks and end-to-end comparisons, or debugging ROCm/HIP-specific constraints. Ideal for engineers optimizing inference/compute-heavy models on ROCm.
scripts/ folder.references/.Best used by code-aware agents and developer-facing assistants that can run benchmark scripts and edit code; suitable for agents with access to compute/testing environments.
This skill has not been reviewed by our automated audit pipeline yet.