Generative Vision Interview Questions #3 - The KL Divergence Paradox
How calculating the exact marginals of a 1,000-step trajectory is computationally impossible, and the simple assumption that silently saves diffusion models from math that would otherwise stall train
You’re in a Senior AI Engineer interview at OpenAI. The interviewer sets a trap:
“Calculating the exact marginals for a 1,000-step diffusion trajectory is computationally impossible. Yet, the DDPM loss collapses into a simple L2 regression. What specific structural assumption saves us from infinite computational complexity?”
90% of candidates walk right into it.
The textbook answer is to start furiously writing the Evidence Lower Bound (ELBO) on the whiteboard. Most candidates mumble about the reparameterization trick or default to, “We train a UNet to predict the noise ε.”
The reality is, ELBO alone doesn’t save you. If your reverse process wasn’t tightly constrained, that KL divergence term would still require integrating over a probability space so massive it would melt an H100 cluster before completing a single forward pass.
The unlock is what I call 𝐓𝐡𝐞 𝐆𝐚𝐮𝐬𝐬𝐢𝐚𝐧 𝐌𝐢𝐫𝐫𝐨𝐫 𝐀𝐬𝐬𝐮𝐦𝐩𝐭𝐢𝐨𝐧.
Here is the production truth:
In the forward diffusion process, injecting noise step-by-step guarantees that the posterior q(x_t-1|x_t, x_0) is an isotropic Gaussian.
The trap is assuming the reverse neural network p_θ(x_t-1|x_t) is tasked with learning a highly complex, non-linear distribution at every step. It isn’t.
We strictly force the reverse process to also be a Gaussian.
Why? Because the KL divergence between two Gaussian distributions isn’t a terrifying integral. It has an exact, closed-form analytical solution. The complex probability math literally cancels out, leaving behind nothing but the Euclidean distance between their means.
The Answer That Gets You Hired:
“We assume the reverse transition process is an isotropic Gaussian. This ensures the KL divergence against the forward posterior has a closed-form solution, structurally collapsing an intractable integral into a cheap L_2 distance between the predicted and actual noise.”


📚 Related Papers:
- Denoising Diffusion Probabilistic Models (DDPM) . Available at: https://arxiv.org/abs/2006.11239
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Available at: https://arxiv.org/abs/1503.03585
- Score-Based Generative Modeling through Stochastic Differential Equations. https://arxiv.org/abs/2011.13456
- Improved Denoising Diffusion Probabilistic Models. Available at: https://arxiv.org/abs/2102.09672