Machine Learning System Design Interview #30 - The Transformation Debt Trap
Why treating GenAI pipelines like a BI dashboard quietly pollutes your training sets, and how to lock in immutable, model-ready artifacts before they ever hit your H100s.
You’re in a Senior ML Engineer interview at Meta. The interviewer sets a trap:
“We need to ingest petabytes of raw, unstructured data, text, images, and audio. for our new multimodal GenAI pipeline. Everyone loves the modern data stack, so should we use ELT to dump it all into the data lakehouse as fast as possible and transform it later?”
95% of candidates walk right into it.
Most candidates say: “Absolutely. ELT is the modern standard. Storage is cheap, so we should extract the raw data, load it immediately to avoid data loss, and use dbt or Spark to run transformations on the fly inside the warehouse. It gives us maximum flexibility for exploratory model training.”
They just failed. That is a naive data-engineering patch, not an ML systems solution.
𝐓𝐡𝐞 𝐑𝐞𝐚𝐥𝐢𝐭𝐲:
For standard BI analytics, ELT is fine. But for enterprise GenAI workflows, ELT creates catastrophic “transformation debt.” When you dump unstructured raw data (like raw PDFs, heavily artifacted images, or unnormalized audio) into a lakehouse and defer the transformation, your ML pipelines are forced to re-compute complex, non-deterministic transformations dynamically on read.
This destroys model feature reproducibility. If your transformation logic shifts slightly over time, or if downstream consumers use slightly different text chunking or BPE tokenization logic before hitting the H100s for training, you have silently polluted your dataset.
Furthermore, dynamically transforming petabytes of unstructured data on the fly wastes massive amounts of compute. You are burning through expensive GPU VRAM and CPU cycles doing heavy preprocessing at training time instead of actually calculating gradients.
𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: At scale, ML engineering teams are shifting back to strict ETL pipelines to guarantee dataset immutability and save compute economics.
Keep reading with a 7-day free trial
Subscribe to AI Interview Prep to keep reading this post and get 7 days of free access to the full post archives.

