GRPO微调算法
基本概念
GRPO算法全称 Global Reward Preference Optimization,即全局奖励信号偏好优化,是一种利用全局奖励信号进行模型行为优化。
不同于DPO的偏好对比组合,GRPO的核心思想是为每个生成序列分配一个唯一的奖励值,用来指导模型优化的方向,这个奖励值可以用一个全局奖励模型进行分配,也可以进行人工标注评分。这样可以在一定程度上避免偏好的噪声带来影响,更适合复杂的长文本任务、多轮对话任务等。
GRPO的三元组为(prompt(问题), response(回答), reward(评分/全局奖励值)),核心损失函数为 $$\begin{equation}
\mathcal{L}{\text{GRPO}} = -\mathbb{E} {(x, y) \sim \mathcal{D}} \left[
R(y|x) - \beta \cdot \text{KL}\left(\pi_\theta(y|x) \parallel \pi_{\text{ref}}(y|x)\right)
\right]
\end{equation}$$, 其中:
R ( y ∣ x ) R(y|x) R ( y ∣ x ) 是全局奖励函数
β \beta β 用来控制KL惩罚的强度
GRPO的损失函数结合了奖励最大化 和KL散度约束 (防止训练模型偏离参考模型太远)
GRPO的优势
减少了偏好数据的噪声影响
更适合复杂任务,比如长文本生成、多轮对话等任务
一个基于gpt2微调的GRPO算法示例
import torchimport torch.nn.functional as Fdef grpo_loss (policy_logits, ref_logits, rewards, beta=0.1 ): policy_probs = F.softmax(policy_logits, dim=-1 ) ref_probs = F.softmax(ref_logits, dim=-1 ) kl_div = F.kl_div(torch.log(policy_probs), ref_probs, reduce="batchmean" , log_target=False ) loss = -rewards.mean() + beta * kl_div return loss from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "gpt2" tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token policy_model = AutoModelForCausalLM.from_pretrained(model_name) ref_model = AutoModelForCausalLM.from_pretrained(model_name).eval () optimizer = torch.optim.AdamW(policy_model.parameters(), lr=5e-5 ) from datasets import Datasetexample = { "prompt" : ["1 + 1等于几?" ], "response" : ["1 + 1 = 2" ], "reward" : [5.0 ] } dataset = Dataset.from_dict(example) from torch.utils.data import DataLoaderdef collate_fn (batch ): return { "prompt" : [x["prompt" ] for x in batch], "response" : [x["response" ] for x in batch], "reward" : [x["reward" ] for x in batch] } data_loader = DataLoader(dataset, batch_size=1 , collate_fn=collate_fn) for batch in data_loader: inputs = tokenizer(batch["prompt" ], return_tensors="pt" , padding=True ) policy_outputs = policy_model(**inputs, labels=inputs["input_ids" ]) ref_outputs = ref_model(**inputs, labels=inputs["input_ids" ]) loss = grpo_loss(policy_outputs.logits, ref_outputs.logits, torch.tensor(batch["reward" ]), beta=0.1 ) optimizer.zero_grad() loss.backward() optimizer.step() def generate_response (model, tokenizer, prompt ): inputs = tokenizer(prompt, return_tensors="pt" ) outputs = model.generate(**inputs, max_length=100 ) return tokenizer.decode(outputs[0 ], skip_special_tokens=True ) print (generate_response(policy_model, tokenizer, "1 + 1等于几?" ))policy_model.save_pretrained("grpo_finetuned_model" ) tokenizer.save_pretrained("grpo_finetuned_model" )