VoladorLuYu 's Collections Super Alignment
updated
Trusted Source Alignment in Large Language Models
Paper
• 2311.06697
• Published
• 12
Diffusion Model Alignment Using Direct Preference Optimization
Paper
• 2311.12908
• Published
• 49
SuperHF: Supervised Iterative Learning from Human Feedback
Paper
• 2310.16763
• Published
• 1
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Paper
• 2311.15657
• Published
• 2
Using Human Feedback to Fine-tune Diffusion Models without Any Reward
Model
Paper
• 2311.13231
• Published
• 28
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
Paper
• 2310.03739
• Published
• 22
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Feedback
Paper
• 2309.00267
• Published
• 53
Aligning Language Models with Offline Reinforcement Learning from Human
Feedback
Paper
• 2308.12050
• Published
• 1
Q-Transformer: Scalable Offline Reinforcement Learning via
Autoregressive Q-Functions
Paper
• 2309.10150
• Published
• 26
Secrets of RLHF in Large Language Models Part I: PPO
Paper
• 2307.04964
• Published
• 30
Efficient RLHF: Reducing the Memory Usage of PPO
Paper
• 2309.00754
• Published
• 16
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper
• 2309.14525
• Published
• 32
Nash Learning from Human Feedback
Paper
• 2312.00886
• Published
• 18
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-grained Correctional Human Feedback
Paper
• 2312.00849
• Published
• 12
Training Chain-of-Thought via Latent-Variable Inference
Paper
• 2312.02179
• Published
• 10
Reinforcement Learning from Diffusion Feedback: Q* for Image Search
Paper
• 2311.15648
• Published
OneLLM: One Framework to Align All Modalities with Language
Paper
• 2312.03700
• Published
• 24
Training a Helpful and Harmless Assistant with Reinforcement Learning
from Human Feedback
Paper
• 2204.05862
• Published
• 3
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published
• 45
Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences
Paper
• 2404.03715
• Published
• 62
Dataset Reset Policy Optimization for RLHF
Paper
• 2404.08495
• Published
• 9
Learn Your Reference Model for Real Good Alignment
Paper
• 2404.09656
• Published
• 90
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
• 2405.07863
• Published
• 71
Iterative Reasoning Preference Optimization
Paper
• 2404.19733
• Published
• 49
Iterative Length-Regularized Direct Preference Optimization: A Case
Study on Improving 7B Language Models to GPT-4 Level
Paper
• 2406.11817
• Published
• 13
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
• 2406.18629
• Published
• 42
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human
Annotations
Paper
• 2312.08935
• Published
• 4
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
Reasoning
Paper
• 2407.00782
• Published
• 24
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Paper
• 2410.18451
• Published
• 20