Advanced Deep Learning Interview Questions #13 - The Per-Feature Learning Rate Trap

Hardcoding per-dimension LRs ignores that gradient variance is non-stationary, causing optimizers to either freeze or explode as distributions shift mid-training.

Apr 03, 2026

You’re in a Senior ML Engineer interview at DeepMind. The interviewer sets a trap:

“You are training a dense recommender system. One feature dimension has violently massive gradient swings, while another dimension is completely sparse and barely updates at all. How do you stabilize it?”

95% of candidates walk right into it.

Most candidates say: “I’ll write a custom, per-feature learning rate schedule. I can manually drop the LR for the noisy dimension to 1e-5 to prevent explosions, and boost the sparse dimension to 1e-2 to force it to learn.”

They just failed the interview.

𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:

Manually hacking learning rates is a massive waste of engineering cycles and cluster compute.

You cannot babysit an 800-billion parameter model across a 10,000 H100 GPU cluster and hardcode scaling rules for billions of dimensions.

Feature sparsity and gradient variance are deeply dynamic and shift wildly from batch to batch.

If you hardcode a low LR for a noisy dimension, it will permanently stagnate the moment its variance drops, stalling your distributed training run and torching thousands of dollars in wasted FLOPs.

𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧:

You need a math-driven, adaptive architecture that scales systematically - which is the exact physics behind RMSProp (and the denominator in Adam).

1️⃣ We maintain an exponentially decaying average of the squared gradients (the historical variance) for every single weight in the network.

2️⃣ We then divide our global learning rate by the root-mean-square (RMS) of this historical gradient matrix.

3️⃣ The math does the heavy lifting: Violent, oscillating gradients get divided by a massive denominator, safely suffocating their updates so they don’t blow up your loss.

4️⃣ Meanwhile, dead or sparse feature gradients get divided by a microscopic denominator, which mathematically boosts their local learning rate without human intervention.

𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝:

I don’t hardcode manual heuristics for dynamic system physics. I use an adaptive optimizer like RMSProp to automatically normalize updates by the historical variance of each dimension, scaling the step sizes inversely to their gradient magnitude.

#MachineLearning #MLEngineering #DeepLearning #NeuralNetworks #AdamOptimizer #AI #ScalingLaws

📚 Related Papers:

- Adam: A Method for Stochastic Optimization. Available at: https://arxiv.org/abs/1412.6980

- Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (JMLR). Available at: https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

- An overview of gradient descent optimization algorithms. Available at: https://arxiv.org/abs/1609.04747

- MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation. Available at: https://arxiv.org/abs/2603.00416

AI Interview Prep

Discussion about this post

Ready for more?