Technique
DPO (Direct Preference Optimization)
A training method that aligns language models to human preferences directly from preference data, without needing a separate reward model or reinforcement learning.
Technique
A training method that aligns language models to human preferences directly from preference data, without needing a separate reward model or reinforcement learning.
We use cookies
Anonymous analytics help us improve the site. You can opt out anytime. Learn more