Generative Vision Interview Questions #5 - The Mode Ascent Trap

Why your perfectly converged score estimator silently collapses into hyper-average outputs, and how injecting calibrated stochastic noise forces the model to explore the full distribution instead of

Jun 12, 2026

You’re in a Senior AI Engineer interview at Midjourney. The interviewer sets a trap.

“Your score estimator loss is perfectly converged after 400 hours on an A100 cluster. But during inference, deterministically following the gradient generates the exact same 3 hyper-average images on repeat. Why?”

90% of candidates walk right into it.

Most candidates instinctively blame the training loop or the dataset.

They say, “The model is overfitted to the majority class. We need to increase gradient clipping, crank up the dropout, or lower the learning rate to 1e-5 to capture more diverse modes.”

But they aren’t trying to fix a broken training run. The weights are flawless.

The reality is you are misusing the gradients during inference. You treated a generative sampling problem like a classification problem. You asked the model for the “most likely” image, so it gave you the mathematical average of the dataset.

If you strictly follow the score gradient ∇ₓ log p(x), you aren’t sampling. You are just performing standard gradient ascent.

The Gradient: Pulls your vector directly to the absolute highest point of the nearest local maximum.
The Result: You bypass the rich variance of the data manifold and collapse directly into the peak density, outputting a perfectly symmetrical, generic image every single time.
The Fix: You must use Langevin dynamics.

At every step of inference, you must inject perfectly calibrated Gaussian noise alongside the gradient update.

The gradient pulls you toward the valid data manifold, but the stochastic noise violently kicks you around the edges of it. That mathematical friction prevents you from getting stuck at the peak and forces the model to explore the full volume of the distribution.

The Answer That Gets You Hired:

“Deterministically following the score function finds the argmax of the distribution, guaranteeing mode collapse. Valid generative modeling requires sampling from the full volume of the distribution, which is why Langevin dynamics strictly requires stochastic noise injection to explore the density surface rather than just climbing to its peak.”

📚 Related Papers:

- Generative Modeling by Estimating Gradients of the Data Distribution . Available at: https://arxiv.org/abs/1907.05600

- Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. Available at: https://arxiv.org/abs/2112.07068

- Score-Based Generative Modeling through Stochastic Differential Equations. https://arxiv.org/abs/2011.13456

AI Interview Prep

Discussion about this post

Ready for more?