
from kernels599
Guidance and examples for writing, benchmarking, and integrating optimized Triton kernels on ROCm (MI355X, R9700) for diffusers and transformers workloads.
This skill provides in-depth patterns, kernel implementations, and benchmarking guidance for writing optimized Triton kernels targeting AMD ROCm GPUs (MI355X, R9700). It covers core kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN), integration patterns for HuggingFace diffusers (LTX-Video) and transformers, and performance/benchmarking scripts.
Use when developing or optimizing GPU kernels for AMD hardware, injecting custom kernels into diffusers pipelines, running micro-benchmarks and end-to-end comparisons, or debugging ROCm/HIP-specific constraints. Ideal for engineers optimizing inference/compute-heavy models on ROCm.
scripts/ folder.references/.Best used by code-aware agents and developer-facing assistants that can run benchmark scripts and edit code; suitable for agents with access to compute/testing environments.
This skill provides guidance for writing and benchmarking optimized Triton kernels for AMD ROCm GPUs (MI355X, R9700), targeting HuggingFace diffusers and transformers. It includes 4 Python scripts and a requirements file — all scripts failed to import due to missing torch/triton/ROCm dependencies (expected since this requires specialized AMD GPU hardware). The SKILL.md is exceptionally thorough with detailed kernel implementations, performance benchmarks, and ROCm-specific pitfalls. No security concerns whatsoever.
torchtritondiffuserstransformersacceleratesafetensorshuggingface-hubimageioimageio-ffmpegkernelsVery high-quality niche skill. Clean security profile. Code is well-written with proper closure patterns to avoid Python late-binding bugs. The skill is narrowly useful — only developers working with AMD ROCm Triton kernels benefit. Architecture is solid with good use of scripts/ and references/ directories.