强化学习 | 记要点

看论文：Self-Rewarding Language Models

概述语言模型通常的训练方法是先收集一大堆人类的反馈,然后基于这些反馈教模型“说话”。但这种依赖外部信号的机制缺点也很明显,模型的能力受限于人类反馈的数据指令。所以论文提出,我们得让模型自己动手试错、自我完善。具体想法是让模型给自己当老师,让它边生成回复边给自己打分。这样模型就可以根据自己的评价,找出好和不好的回答,进而再基于这些评分来改进模型。论文里面迭代模型的过程是这样的: Model0: 没有微调的预训练模型 Model1: 基于人类反馈数据的微调模型,使用SFT的方法微调 Model2: 使用Model1生成的回复,然后使用Model1对回复进行打分,选出好的和不好的结果,用这些结果使用DPO的方法对Model2进行微调 Model3: 使用Model2生成的回复,然后使用Model2对回复进行打分,选出好的和不好的结果,用这些结果使用DPO的方法对Model3进行微调这样,就可以不断的迭代下去,直到模型的能力达到预期的水平。模型迭代细节 Model0:原始预训练模型 Model1:基于人类反馈数据的微调模型,使用SFT的方法微调 Model2:基于Model1自评分微调生成新的指令,具体的方法参考Aligning Language Models with Self-Generated Instructions和Tuning Language Models with (Almost) No Human Labor 基于生成的指令,让Model1给每个输入生成N个回复使用Model1对每个回复进行打分,返回的分数是0-5分。使用如下的Prompt: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Review the user’s question and the corresponding response using the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion: - Add 1 point if the response is relevant and provides some information related to the user’s inquiry, even if it is incomplete or contains some irrelevant content. - Add another point if the response addresses a substantial portion of the user’s question, but does not completely resolve the query or provide a direct answer. - Award a third point if the response answers the basic elements of the user’s question in a useful way, regardless of whether it seems to have been written by an AI Assistant or if it has elements typically found in blogs or search results. - Grant a fourth point if the response is clearly written from an AI Assistant’s perspective, addressing the user’s question directly and comprehensively, and is well-organized and helpful, even if there is slight room for improvement in clarity, conciseness or focus. - Bestow a fifth point for a response that is impeccably tailored to the user’s question by an AI Assistant, without extraneous information, reflecting expert knowledge, and demonstrating a high-quality, engaging, and insightful answer. User: <INSTRUCTION_HERE> <response><RESPONSE_HERE></response> After examining the user’s instruction and the response: - Briefly justify your total score, up to 100 words. - Conclude with the score using the format: “Score: <total points>” Remember to assess from the AI Assistant perspective, utilizing web search knowledge as necess 翻译成中文： ...