其他¶
- 1703.03864_Evolution Strategies: as a Scalable Alternative to Reinforcement Learning
- 2305.14387_AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
- 2401.08417_CPO: Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
- Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
- 2403.00409_Provably Robust DPO: Aligning Language Models with Noisy Feedback
- 2504.02495_DeepSeek-GRM: Inference-Time Scaling for Generalist Reward Modeling
- 2504.13958_ToolRL: Reward is All Tool Learning Needs