cute bites

letters, math exercises, bite-sized experiments, and other musings!

Some DPO napkin math

TL;DR if we formulate prompt-conditioned output distributions from an LLM as a mixture of an aligned component + unaligned component, then a DPO tune "tilts" the output towards the aligned component, diminishing---but not expunging!---the unaligned component.

8 min read · December 26, 2024

2024 · dpo preference-tuning alignment math · research