
from argentos-core114
Comprehensive guide for fine-tuning LLMs using TRL, covering SFT, DPO, PPO, and GRPO for human preference alignment.
This skill provides a complete framework for post-training language models to align them with human preferences. It enables the agent to implement complex ML pipelines including Supervised Fine-Tuning (SFT) for instruction following and various reinforcement learning techniques for optimization.
Use this skill when you need to perform RLHF (Reinforcement Learning from Human Feedback), align a model with a preference dataset (chosen vs rejected pairs), or optimize a model using reward functions.
basic_grpo_training.py.Designed for agents with Python execution capabilities and access to NVIDIA GPUs (CUDA), specifically those integrating with HuggingFace Transformers and the TRL library.
This skill has not been reviewed by our automated audit pipeline yet.