ROCm Triton Kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)

Name: ROCm Triton Kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN)
Rating: 90 (1 reviews)
Author: huggingface

Trust Score 90/100

Guidance and examples for writing, benchmarking, and integrating optimized Triton kernels on ROCm (MI355X, R9700) for diffusers and transformers workloads.

triggers:rocmtritonrmsnormrope 3dkernel benchmarkdiffusers integrationgpu optimization

GitHub SKILL.md

What it does

This skill provides in-depth patterns, kernel implementations, and benchmarking guidance for writing optimized Triton kernels targeting AMD ROCm GPUs (MI355X, R9700). It covers core kernels (RMSNorm, RoPE 3D, GEGLU, AdaLN), integration patterns for HuggingFace diffusers (LTX-Video) and transformers, and performance/benchmarking scripts.

When to use it

Use when developing or optimizing GPU kernels for AMD hardware, injecting custom kernels into diffusers pipelines, running micro-benchmarks and end-to-end comparisons, or debugging ROCm/HIP-specific constraints. Ideal for engineers optimizing inference/compute-heavy models on ROCm.

What's included

Scripts: benchmark_kernels.py, benchmark_e2e.py, ltx_kernel_injection_example.py, and other scripts in the scripts/ folder.
References: hardware guides (MI355X, R9700), integration guides, and troubleshooting docs in references/.
Instructions: concrete kernel code samples (triton.jit implementations), mandatory env vars, integration patterns, performance pitfalls, and profiling commands.

Compatible agents

Best used by code-aware agents and developer-facing assistants that can run benchmark scripts and edit code; suitable for agents with access to compute/testing environments.

Audit Summary

This skill provides guidance for writing and benchmarking optimized Triton kernels for AMD ROCm GPUs (MI355X, R9700), targeting HuggingFace diffusers and transformers. It includes 4 Python scripts and a requirements file — all scripts failed to import due to missing torch/triton/ROCm dependencies (expected since this requires specialized AMD GPU hardware). The SKILL.md is exceptionally thorough with detailed kernel implementations, performance benchmarks, and ROCm-specific pitfalls. No security concerns whatsoever.

Watch Out

Requires AMD ROCm GPU hardware (MI355X or R9700) — cannot run on standard CUDA or CPU
Heavy dependency chain: torch+triton+diffusers+transformers+accelerate+kernels
Scripts are benchmark/demo-only, not installable utilities

Missing Dependencies

torchtritondiffuserstransformersacceleratesafetensorshuggingface-hubimageioimageio-ffmpegkernels

Notes

Very high-quality niche skill. Clean security profile. Code is well-written with proper closure patterns to avoid Python late-binding bugs. The skill is narrowly useful — only developers working with AMD ROCm Triton kernels benefit. Architecture is solid with good use of scripts/ and references/ directories.

Information

Repository: kernels
Stars: 599

Trust Score

Overall90

Security100

Code Quality82

Architecture78

Usefulness28

Related Skills

OpenTestAI

Automated, high-confidence AI testing: bug detection, persona feedback, and prioritized test-case generation using many specialized tester profiles.

Go Data Structures

Authoritative guidance on choosing and using Go built-in and standard-library data structures, with practical best practices for slices, maps, arrays, container

React Development Expert

Provides authoritative React guidance on hooks, state patterns, Server Components, performance optimization, and common architectural patterns.

Code Reviewer

Perform structured, prioritized code reviews that find correctness, security, performance, reliability, and testing issues and provide concrete fix suggestions.

dotLottie Web

Guidelines and patterns for implementing performant dotLottie/Lottie animations on the web (vanilla JS and React), including workers, state machines, and themin

Party Engine Skill

Guidance and examples for using the @cazala/party particle engine (engine lifecycle, modules, WebGPU vs CPU patterns) in custom apps.

Party Skill

Programmatic guide for the @cazala/party library: engine setup, modules, particle APIs, and performance tips for WebGPU and CPU runtimes.

Portfolio Optimization

Select and balance project portfolios using constraint-based mathematical optimization to maximize value under budget, resource, and strategic constraints.

Back to Skills