免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版

打開(kāi)APP
userphoto
未登錄

開(kāi)通VIP,暢享免費(fèi)電子書(shū)等14項(xiàng)超值服

開(kāi)通VIP
LLMs之RLHF:《LLM對(duì)齊技術(shù)的全面綜述:RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF
LLMs之RLHF:《LLM對(duì)齊技術(shù)的全面綜述:RLHF、RLAIF、PPO、DPO等—A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻譯與解讀
《A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More》翻譯與解讀
地址
論文地址:https://www.arxiv.org/abs/2407.16216
時(shí)間
2024年7月23日
作者
Zhichao Wang*
, Bin Bi*
, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri,
Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng
Salesforce
總結(jié)
背景與痛點(diǎn):盡管大型語(yǔ)言模型(LLMs)在自我監(jiān)督學(xué)習(xí)、指令微調(diào)等方面有所進(jìn)步,但由于訓(xùn)練數(shù)據(jù)質(zhì)量參差不齊,它們?nèi)钥赡苌刹粚?shí)、有毒或無(wú)助于用戶的響應(yīng),與人類意圖不一致?,F(xiàn)有評(píng)估指標(biāo)如BLEU、ROUGE和BERTScore無(wú)法很好地捕捉人類對(duì)LLM輸出的偏好。需要將LLM與人類價(jià)值觀對(duì)齊,以避免生成不當(dāng)內(nèi)容。
具體解決方案:強(qiáng)化學(xué)習(xí)從人類反饋(RLHF):通過(guò)人類反饋來(lái)調(diào)整模型,使其輸出更符合人類期望。收集人類偏好數(shù)據(jù)集(包含提示、期望響應(yīng)和不期望響應(yīng)三元組),并訓(xùn)練獎(jiǎng)勵(lì)模型和強(qiáng)化學(xué)習(xí)策略。強(qiáng)化學(xué)習(xí)從AI反饋(RLAIF):利用AI生成的反饋數(shù)據(jù),以減少人類反饋的成本。
核心思路與步驟
>> 獎(jiǎng)勵(lì)模型:使用顯式或隱式的獎(jiǎng)勵(lì)模型,對(duì)生成的響應(yīng)進(jìn)行評(píng)分。獎(jiǎng)勵(lì)可以是響應(yīng)級(jí)別或令牌級(jí)別的。利用Bradley-Terry模型,基于人類偏好數(shù)據(jù)訓(xùn)練pointwise獎(jiǎng)勵(lì)函數(shù)rφ(x,y),給定提示x和響應(yīng)y,預(yù)測(cè)人類期望響應(yīng)的概率。
>> 反饋機(jī)制:收集偏好反饋或二元反饋。采用成對(duì)或列表的反饋方式。利用人類或AI提供的反饋。
>> 強(qiáng)化學(xué)習(xí)策略:基于參考模型的強(qiáng)化學(xué)習(xí),控制輸出的長(zhǎng)度。采用不同的散度測(cè)量方法,如KL散度。選擇在線或離線的策略。以LLM為代理,獎(jiǎng)勵(lì)模型為環(huán)境,最大化獎(jiǎng)勵(lì)、最小化KL散度,同時(shí)避免"對(duì)齊稅"(即下游任務(wù)性能下降)。
探索了不同的獎(jiǎng)勵(lì)模型(explicit/implicit,pointwise/preferencewise等)、反饋類型(偏好/二元、成對(duì)/列表式等)、RL目標(biāo)(參考/無(wú)參考等)和優(yōu)化方式(在線/離線等)。
>> 優(yōu)化方法:迭代/在線偏好優(yōu)化與非迭代/離線偏好優(yōu)化。將指令微調(diào)與對(duì)齊過(guò)程分開(kāi)或合并。
優(yōu)勢(shì):直接將人類偏好納入模型微調(diào),提高了LLM與人類意圖的一致性。InstructGPT等RLHF模型在真實(shí)性、無(wú)害性等方面優(yōu)于GPT-3等基線模型。探索了多種方法擴(kuò)展RLHF框架,為進(jìn)一步對(duì)齊研究奠定了基礎(chǔ)。
>> 成本效益:RLAIF減少了對(duì)昂貴人類反饋的依賴。
>> 靈活性:多種反饋和獎(jiǎng)勵(lì)模型選擇,適應(yīng)不同的應(yīng)用場(chǎng)景。
>> 提高模型安全性和可靠性:通過(guò)對(duì)齊過(guò)程減少生成不當(dāng)內(nèi)容的風(fēng)險(xiǎn)。
總的來(lái)說(shuō),該綜述系統(tǒng)梳理了近兩年來(lái)LLM對(duì)齊技術(shù)的主要進(jìn)展,概括了面臨的挑戰(zhàn)、提出的解決方案及其優(yōu)缺點(diǎn),為該領(lǐng)域的后續(xù)研究提供了全面的概覽。
Abstract
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
隨著自監(jiān)督學(xué)習(xí)的進(jìn)展、預(yù)訓(xùn)練語(yǔ)料庫(kù)中數(shù)萬(wàn)億個(gè)標(biāo)記的可用性、指令微調(diào)的應(yīng)用以及具有數(shù)十億參數(shù)的大型 Transformer 的發(fā)展,大型語(yǔ)言模型(LLMs)現(xiàn)在能夠生成真實(shí)且連貫的對(duì)人類查詢的回應(yīng)。然而,訓(xùn)練數(shù)據(jù)的質(zhì)量參差不齊可能導(dǎo)致生成不符合預(yù)期的響應(yīng),這成為一個(gè)重大挑戰(zhàn)。在過(guò)去兩年中,提出了各種方法,從不同的角度來(lái)提升 LLM,特別是使其更好地與人類期望對(duì)齊。盡管有這些努力,但尚未出現(xiàn)一篇全面的綜述論文,對(duì)這些方法進(jìn)行分類和詳細(xì)說(shuō)明。本文旨在填補(bǔ)這一空白,通過(guò)將這些論文分類為不同的主題,并提供每種對(duì)齊方法的詳細(xì)解釋,幫助讀者深入了解當(dāng)前領(lǐng)域的現(xiàn)狀。
1 Introduction
Over the past decades, the pretraining of LLMs through self-supervised learning [1] has seen significant advancements. These improvements have been driven by the development of larger decoder-only Transformers, the utilization of trillions of tokens, and the parallelization of computations across multiple GPUs. Following the pretraining phase, instruction tuning was employed to guide LLMs in responding to human queries. Despite these advancements, a critical issue remains unresolved: LLMs can generate undesired responses, such as providing instructions on how to commit illegal activities. To mitigate this risk, it is essential to align LLMs with human values.
Reinforcement Learning from Human Feedback (RLHF) [2, 3] has emerged as a groundbreaking technique for aligning LLMs. This approach has led to the development of powerful models such as GPT-4 [4], Claude [5], and Gemini [6]. Following the introduction of RLHF, numerous studies have explored various approaches to further align LLMs. However, there has not yet been a comprehensive review of methods for aligning LLMs with human preferences. This paper aims to fill that gap by categorically reviewing existing literature and providing detailed analyses of individual papers.
在過(guò)去幾十年里,通過(guò)自監(jiān)督學(xué)習(xí)進(jìn)行的 LLM 預(yù)訓(xùn)練 [1] 取得了顯著進(jìn)展。這些進(jìn)展得益于更大規(guī)模的僅解碼 Transformer 的發(fā)展、數(shù)萬(wàn)億標(biāo)記的利用以及多 GPU 的并行計(jì)算。在預(yù)訓(xùn)練階段之后,采用指令微調(diào)來(lái)指導(dǎo) LLM 響應(yīng)人類查詢。盡管取得了這些進(jìn)展,但一個(gè)關(guān)鍵問(wèn)題仍未解決:LLM 可能生成不符合期望的響應(yīng),例如提供如何進(jìn)行非法活動(dòng)的指示。為了減輕這一風(fēng)險(xiǎn),有必要使 LLM 與人類價(jià)值觀對(duì)齊。
基于人類反饋中進(jìn)行強(qiáng)化學(xué)習(xí)(RLHF)[2, 3] 已成為對(duì)齊 LLM 的一種突破性技術(shù)。這種方法促成了如 GPT-4 [4]、Claude [5] 和 Gemini [6] 等強(qiáng)大模型的發(fā)展。在 RLHF 介紹之后,眾多研究探索了各種進(jìn)一步對(duì)齊 LLM 的方法。然而,尚未對(duì)對(duì)齊 LLM 的方法進(jìn)行全面的綜述。本文旨在通過(guò)分類回顧現(xiàn)有文獻(xiàn)并對(duì)個(gè)別論文進(jìn)行詳細(xì)分析來(lái)填補(bǔ)這一空白。
In this paper, we have structured our review into four main topics: 1. Reward Model; 2. Feedback; 3. Reinforcement Learning (RL); and 4. Optimization. Each topic was further divided into subtopics as shown in Figure. 1. For the Reward Model, the subtopics were: 1. Explicit Reward Model vs. Implicit Reward Model; 2. Pointwise Reward Model vs. Preference Model; 3. Response-Level Reward vs. Token-Level Reward and 4. Negative Preference Optimization. Regarding Feedback, the subtopics included: 1. Preference Feedback vs. Binary Feedback; 2. Pairwise Feedback vs. Listwise Feedback; and 3. Human Feedback vs. AI Feedback. In the RL section, the subtopics were:
1. Reference-Based RL vs. Reference-Free RL; 2. Length-Control RL; 3. Different Divergences in RL and 4. On-Policy RL vs. Off-Policy RL. For Optimization, the subtopics were: 1. Online/Iterative Preference Optimization vs. Offline/Non-iterative Preference Optimization; and 3. Separating SFT and Alignment vs. Merging SFT and Alignment. Table 1 provided an analysis of all the papers reviewed in detail using these 13 evaluation metrics.
在本文中,我們將回顧結(jié)構(gòu)分為四個(gè)主要主題:1. 獎(jiǎng)勵(lì)模型;2. 反饋;3. 強(qiáng)化學(xué)習(xí)(RL);和 4. 優(yōu)化。
每個(gè)主題進(jìn)一步分為子主題,如圖 1 所示。
在獎(jiǎng)勵(lì)模型中,子主題包括:1. 顯式獎(jiǎng)勵(lì)模型 vs. 隱式獎(jiǎng)勵(lì)模型;2. 點(diǎn)對(duì)點(diǎn)獎(jiǎng)勵(lì)模型 vs. 偏好模型;3. 響應(yīng)級(jí)獎(jiǎng)勵(lì) vs. 標(biāo)記級(jí)獎(jiǎng)勵(lì);4. 負(fù)偏好優(yōu)化。
在反饋方面,子主題包括:1. 偏好反饋 vs. 二元反饋;2. 成對(duì)反饋 vs. 列表反饋;3. 人類反饋 vs. AI 反饋。
在?RL 部分,子主題包括:1. 基于參考的 RL vs. 無(wú)參考 RL;2. 長(zhǎng)度控制 RL;3. RL 中的不同散度;4. 策略內(nèi) RL vs. 策略外 RL。
在優(yōu)化方面,子主題包括:1. 在線/迭代偏好優(yōu)化 vs. 離線/非迭代偏好優(yōu)化;2. 分離 SFT 和對(duì)齊 vs. 合并 SFT 和對(duì)齊。
表 1 對(duì)所有詳細(xì)審查的論文進(jìn)行了這些 13 項(xiàng)評(píng)估指標(biāo)的分析。
Figure 1: The 13 categorical directions for xPO to align an LLM with human preference
4 Future Directions未來(lái)發(fā)展方向
Based on the analysis of the reviewed papers, several research problems have been identified for further exploration.
在對(duì)文獻(xiàn)分析的基礎(chǔ)上,提出了若干有待進(jìn)一步探討的研究問(wèn)題。
4.1、General Tasks for Alignment Evaluation對(duì)齊評(píng)估的一般任務(wù)
When reviewing various papers, different tasks were used to evaluate the performance of these methods. However, some tasks, like GSM8K [65], which focused more on reasoning, might not be suitable for assessing alignment performance. In contrast, tasks like TruthfulQA [45] or those addressing toxicity should be prioritized for evaluating the toxicity of fine-tuned LLMs. There should be an effort to combine these tasks and create a unified leaderboard for alignment evaluation.
在回顧不同論文時(shí),使用了不同的任務(wù)來(lái)評(píng)估這些方法的性能。然而,一些任務(wù),如 GSM8K [65],更側(cè)重于推理,可能不適合評(píng)估對(duì)齊性能。相比之下,應(yīng)優(yōu)先考慮像 TruthfulQA?[45] 或處理毒性的問(wèn)題來(lái)評(píng)估微調(diào) LLM 的毒性。應(yīng)努力結(jié)合這些任務(wù),創(chuàng)建一個(gè)統(tǒng)一的對(duì)齊評(píng)估排行榜。
4.2、Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs 將隱式獎(jiǎng)勵(lì)模型、列表偏好和 Nash 學(xué)習(xí)應(yīng)用于更大規(guī)模的 LMs
Currently, implicit reward model methods have been applied only to models with up to 70B parameters. Extending these methods to even larger models, such as those the size of GPT-4 and Claude-3, can provide insights into their effectiveness compared to RLHF/PPO. Similarly, the listwise preference model warrants further investigation. In RLHF, preference datasets were collected using listwise preference but were subsequently transformed into multiple pairs of pairwise preferences. The potential issues associated with applying listwise preference models at larger scales remain to be addressed. Lastly, Nash learning can address the inconsistency among human labelers. Incorporating a Nash learning model into larger-scale LLMs can demonstrate its ability to capture the complexity of human nature.
目前,隱式獎(jiǎng)勵(lì)模型方法僅應(yīng)用于最多 70B 參數(shù)的模型。將這些方法擴(kuò)展到更大的模型,如 GPT-4 和 Claude-3,可以提供關(guān)于其相較于 RLHF/PPO 的有效性的見(jiàn)解。類似地,列表偏好模型也值得進(jìn)一步研究。在 RLHF 中,使用列表偏好收集了偏好數(shù)據(jù)集,但隨后轉(zhuǎn)化為多個(gè)成對(duì)的偏好。應(yīng)用列表偏好模型于更大規(guī)模時(shí)潛在的問(wèn)題仍待解決。最后,Nash 學(xué)習(xí)可以解決人類標(biāo)注者之間的不一致性。將 Nash 學(xué)習(xí)模型納入更大規(guī)模的 LLM 可以展示其捕捉人類復(fù)雜性的能力。
4.3、Experiments on Binary Feedbacks二元反饋的實(shí)驗(yàn)
Both KTO and DRO utilized binary feedback mechanisms, such as "thumbs up" and "thumbs down", instead of pairwise preferences. These binary feedbacks were derived from preference datasets, where desired responses were marked as positive and undesired responses as negative. Further research is needed on realistic binary datasets. Additionally, binary datasets are easier to collect compared to pairwise preference data, making it feasible to use larger-scale binary feedback datasets for alignment. However, the noise in binary feedback may be more pronounced than in preference datasets, raising the intriguing question of how to effectively filter out noisy data.
KTO 和 DRO 都使用了二元反饋機(jī)制,如“點(diǎn)贊”和“點(diǎn)踩”,而不是成對(duì)的偏好。這些二元反饋來(lái)自偏好數(shù)據(jù)集,其中期望的響應(yīng)標(biāo)記為正,期望之外的響應(yīng)標(biāo)記為負(fù)。需要進(jìn)一步研究現(xiàn)實(shí)中的二元數(shù)據(jù)集。此外,相比于成對(duì)偏好數(shù)據(jù),二元數(shù)據(jù)集更易收集,使得使用大規(guī)模二元反饋數(shù)據(jù)集進(jìn)行對(duì)齊成為可能。然而,二元反饋中的噪音可能比偏好數(shù)據(jù)更明顯,因此如何有效過(guò)濾噪聲數(shù)據(jù)是一個(gè)值得關(guān)注的問(wèn)題。
4.4、Experiments on Helpful AI Feedback有益AI反饋實(shí)驗(yàn)
Current AI feedback primarily includes harmless feedback in RLAIF and feedback ranking in iterative DPO. However, in RLAIF, helpful feedback is still provided by human labelers. This approach is reasonable, as generating helpful responses is significantly more challenging than identifying harmful ones. An intriguing future direction involves using LLMs to generate helpful feedback, thereby enabling LLMs to self-improve.
當(dāng)前的 AI 反饋主要包括 RLAIF 中的無(wú)害反饋和迭代 DPO 中的反饋排序。然而,在 RLAIF 中,有益的反饋仍由人類標(biāo)注者提供。這種方法是合理的,因?yàn)樯捎幸娴捻憫?yīng)遠(yuǎn)比識(shí)別有害的響應(yīng)要困難。一個(gè)有趣的未來(lái)方向是利用 LLM 生成有益的反饋,從而使 LLM 實(shí)現(xiàn)自我提升。
4.5、Speeding up Nash Learning加速Nash學(xué)習(xí)
The proposed Nash learning method effectively modeled pairwise preferences and addressed inconsistencies arising from human labeling. However, it necessitated multiple iterations to converge to the optimal policy. Although the authors did not specify the time required for alignment, it was presumed to be significantly slower compared to implicit reward models such as DPO. This area warrants further research attention to speed up the Nash learning process.
提出的 Nash 學(xué)習(xí)方法有效建模了成對(duì)偏好并解決了人類標(biāo)注的不一致性。然而,它需要多次迭代才能收斂到最佳策略。盡管作者未具體說(shuō)明對(duì)齊所需的時(shí)間,但推測(cè)其速度明顯慢于隱式獎(jiǎng)勵(lì)模型如 DPO。因此,這一領(lǐng)域需要進(jìn)一步研究,以加速 Nash 學(xué)習(xí)過(guò)程。
4.6、Termination of Iterative/Online Learning迭代/在線學(xué)習(xí)的終止
When applying iterative or online training, determining when to terminate the iteration is crucial. Previous research has noted that iterative learning can sometimes degrade the performance of LLMs on specific tasks, which can be a sign of overfitting. However, identifying a reasonable epoch for stopping the iteration remains an unexplored area.
在應(yīng)用迭代或在線訓(xùn)練時(shí),確定何時(shí)終止迭代至關(guān)重要。以往研究指出,迭代學(xué)習(xí)有時(shí)會(huì)導(dǎo)致 LLM 在特定任務(wù)上的性能下降,這可能是過(guò)擬合的跡象。然而,確定合理的停止迭代的輪次仍是一個(gè)未探索的領(lǐng)域。
4.7、Simplify SFT + Alignment簡(jiǎn)化SFT +對(duì)齊
Current methodologies typically implemented SFT and alignment in a consecutive manner. However, this approach often resulted in catastrophic forgetting and rendered the training process laborious. The PAFT method mitigated catastrophic forgetting by fine-tuning SFT and alignment separately before merging them, albeit at the cost of increased complexity. Conversely, the ORPO technique integrated both processes simultaneously, but this led to a decline in performance. Thus, the challenge of effectively combining SFT and alignment to achieve high performance while maintaining efficiency remains unresolved.
當(dāng)前方法通常采用連續(xù)的方式實(shí)現(xiàn) SFT 和對(duì)齊。然而,這種方法往往導(dǎo)致災(zāi)難性遺忘,并使訓(xùn)練過(guò)程變得繁瑣。PAFT 方法通過(guò)在合并 SFT 和對(duì)齊之前分別微調(diào)這兩者來(lái)減輕災(zāi)難性遺忘,盡管增加了復(fù)雜性。相反,ORPO 技術(shù)同時(shí)集成了這兩個(gè)過(guò)程,但這導(dǎo)致了性能下降。因此,如何有效地結(jié)合 SFT 和對(duì)齊以實(shí)現(xiàn)高性能且保持效率仍然是一個(gè)未解決的挑戰(zhàn)。
本站僅提供存儲(chǔ)服務(wù),所有內(nèi)容均由用戶發(fā)布,如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊舉報(bào)。
打開(kāi)APP,閱讀全文并永久保存 查看更多類似文章
猜你喜歡
類似文章
10行代碼媲美RLHF,用社交游戲數(shù)據(jù)訓(xùn)練社會(huì)對(duì)齊模型
OpenAI大神Andrej爆火演講,官方第一次揭秘大模型原理和訓(xùn)練過(guò)程!
LLM 全景圖 (The Landscape of LLM)
字節(jié)“開(kāi)盒”O(jiān)penAI所有大模型,揭秘GPT-3到GPT-4進(jìn)化路徑
大模型訓(xùn)練流程(四)強(qiáng)化學(xué)習(xí)
一文看盡LLM對(duì)齊技術(shù):RLHF、RLAIF、PPO、DPO……
更多類似文章 >>
生活服務(wù)
分享 收藏 導(dǎo)長(zhǎng)圖 關(guān)注 下載文章
綁定賬號(hào)成功
后續(xù)可登錄賬號(hào)暢享VIP特權(quán)!
如果VIP功能使用有故障,
可點(diǎn)擊這里聯(lián)系客服!

聯(lián)系客服