其他¶

1703.03864_Evolution Strategies: as a Scalable Alternative to Reinforcement Learning
2305.14387_AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
2401.08417_CPO: Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
2403.00409_Provably Robust DPO: Aligning Language Models with Noisy Feedback
2504.02495_DeepSeek-GRM: Inference-Time Scaling for Generalist Reward Modeling
2504.13958_ToolRL: Reward is All Tool Learning Needs