Some DPO napkin math
TL;DR if we formulate prompt-conditioned output distributions from an LLM as a mixture of an aligned component + unaligned component, then a DPO tune "tilts" the output towards the aligned component, diminishing---but not expunging!---the unaligned component.