Google AI: Mathematically Wrong, Empirically Right
Jeff Dean reveals how Google's AI capabilities emerged from 15 years of engineering decisions that violated theoretical principles but kept working anyway.
TL;DR
Jeff Dean’s talk at Stanford isn’t a triumphant history of breakthroughs but something more instructive: a confession that modern AI emerged from decisions that were, by his own admission, “completely mathematically wrong.”
The distributed training system they built, DistBelief, used asynchronous gradient updates across 200 model replicas, all modifying shared parameters simultaneously without coordination. The theoretical objections were numerous, but it worked anyway, enabling neural networks 50 to 100 times larger than anyone had attempted.
Each subsequent innovation (TPUs, transformers, sparsity, distillation, reinforcement learning in verifiable domains) compounds on previous layers.
For anyone building (or investing) in AI, the message runs against instinct: theoretical elegance often trails empirical pragmatism, and the winning architectures emerge from hardware constraints as much as research insights (and a legendary culture of optimisation).
I. The Pragmatist’s Confession
Dean opens with a moment of humility. In 1990, as a college senior, he built parallel training systems for neural networks on a 32-processor hypercube machine. “I was completely wrong. You needed like a million times as much processing power to make really good neural nets, not 32 times.” He’d misjudged by six orders of magnitude. Capability in AI has consistently arrived faster than principled analysis would predict, and slower than enthusiasts hoped, because the binding constraints were never where the theory suggested.
The founding moment of Google Brain in 2012 carries the same texture. Dean bumps into Andrew Ng in a micro-kitchen. Ng mentions his Stanford students are “starting to get good results with neural nets on speech problems.” Dean’s response: “We should train really big neural networks.” No theoretical framework. No research proposal. Just an intuition that scale, applied to neural networks on Google’s CPU clusters, would yield something.
What emerged was a distributed training system they named DistBelief, the name itself a joke about the sceptics. “People didn’t believe it was going to work,” Dean recalls. The system supported both model parallelism and data parallelism, but its core innovation was architecturally scandalous: asynchronous training where 200 model replicas would simultaneously compute gradient updates and apply them to shared parameters without synchronisation.
“This is all completely mathematically wrong,” Dean admits, “because at the same time all the other model replicas were also computing gradients and asynchronously adding them into this shared set of parameter state.” Gradient descent assumes you compute a correction at position X and apply it at position X. With 200 asynchronous replicas, you compute a correction at position X but apply it at position Y, because the parameters moved while you were computing. Mathematically, this shouldn’t converge.
But DistBelief enabled training neural networks 50 to 100 times larger than anyone had attempted. The gap between what should work and what does work turned out to be wide enough to build an industry in.
II. Emergent Structure
Before Google Brain built production systems, it ran an experiment that remains philosophically significant. The team fed 10 million random frames from YouTube videos into a neural network trained on a single objective: reconstruct the input pixels from a compressed internal representation. No labels. No categories. No human annotation.
At the top of the trained network, individual neurons had become sensitive to high-level concepts. One neuron fired maximally for images containing cats. Others responded to human faces, or the backs of pedestrians, or specific object categories. The model had never been told what a cat was. The concept emerged from statistical regularities in unlabelled video.
“It had sort of come up with the concept of a cat just by being exposed to that,” Dean notes, casually. The result delivered a 70% relative improvement on ImageNet’s 22,000-category benchmark, but the benchmark gain matters less than the implication: semantic categories can crystallise from raw exposure without explicit teaching. The network discovered that “cat” was a useful clustering in pixel-space before anyone told it cats existed. Like Molière’s Monsieur Jourdain discovering he’d been speaking prose his whole life, Google Brain had invented foundation models before the term existed.
The word embedding work extended this finding to language. Rather than treating words as discrete symbols, the team represented each word as a vector, a point in high-dimensional space whose coordinates are learned from context. Words that appear in similar contexts end up near each other. By training representations to predict nearby words from context, the team discovered that the resulting high-dimensional vectors exhibited consistent geometric structure. Subtracting “man” from “king” and adding “woman” yielded a vector near “queen.” The same directional transformation worked regardless of starting point: actor to actress, waiter to waitress.
The model wasn’t taught grammar. It found geometric regularities that happen to align with human grammatical categories. The clustering emerges from statistical patterns, not from categories we impose on language. If a neural network and a linguist arrive at the same structure independently, that structure probably isn’t arbitrary.
Incidentally, for linguists, this was more than an interesting result. Chomsky spent decades arguing grammar couldn’t emerge from statistics alone, that language required innate structure. The word vectors suggested otherwise: grammatical categories appear to be latent in language itself, discoverable by any system with enough exposure. The neural network had taken sides in a foundational debate (without knowing there was one).
What makes embeddings powerful is that they generalise. If the geometric structure were merely an artifact of training, it would fail on new domains. It doesn’t. Learned from exposure rather than annotation, the structure transfers to contexts the training never covered. It was already there, latent in language itself. The network just learned to see it.
III. Hardware as Forcing Function
The TPU origin story follows a familiar pattern in systems engineering: deployment constraints forcing hardware innovation. Dean ran the numbers on a high-quality speech recognition model Google had developed but not deployed: “If 100 million people want to start to talk to their phones for three minutes a day... we would need to double the number of computers Google had in order just to roll out this improved speech recognition feature.”
The problem wasn’t the algorithm but the energy and cost of running it. Neural networks tolerate low-precision computation (no need for 32-bit floating-point), and all operations reduce to dense linear algebra (essentially multiplying large tables of numbers together, which is what neural networks spend most of their time doing). These properties had been theoretically known for years. What changed was that deployment requirements made exploiting them urgent rather than optional.
TPU v1 emerged from this pressure: purpose-built silicon optimised for reduced-precision matrix multiplication.The result was 15 to 30 times faster than contemporary CPUs and GPUs, and 30 to 80 times more energy-efficient. This is now the most cited paper in ISCA’s 50-year history.
The computing paradigm of the previous three decades (fast CPUs, many cores, general-purpose instruction sets) was architecturally unsuited to the workloads that now mattered. Neural networks didn’t need versatility. They needed matrix multiplication, done fast, at low precision, at scale.
Dean frames this as a shift in what computing is for: “15 years ago you cared about how fast was your CPU, could it run Microsoft Word and Chrome... now you care can it run interesting machine learning computations.” The entire hardware industry has reoriented accordingly.
IV. Preserve, Don’t Compress
The 2017 transformer paper, “Attention Is All You Need,” gets cited so frequently that its core insight often gets lost. Dean's explanation is clarifying.
Consider reading a book and trying to remember it as you go. LSTMs, the previous dominant architecture, worked like summarising each page into a single paragraph, then summarising that paragraph plus the next page into another paragraph, and so on. By chapter ten, information from chapter one has been compressed and recompressed so many times that details vanish. If you need to recall a specific sentence from the opening, it’s probably gone.
Transformers work differently. Instead of compressing as you go, you keep the full text available and look back at whatever you need, whenever you need it. “Let’s not try to force all that state into a vector that we update every step,” Dean paraphrases. “Instead, let’s just be able to save all those states we go through and then attend to all of them whenever we’re trying to do something.” The model can consult page one while writing page fifty.
The efficiency gains were substantial: 10 times fewer parameters to reach equivalent performance, and 10 to 100 times less compute. But the architectural philosophy matters as much as the benchmarks. Transformers treat context as a resource to index, not a signal to compress. This explains why context window length has become a key capability metric, and why long-context applications (document analysis, codebase understanding, extended conversation) have improved so dramatically as context limits expanded from thousands to millions of tokens.
The approach has since generalised beyond language. Dean’s colleagues applied transformer architectures to computer vision (cf Vision Transformer), achieving 4 to 20 times less compute for equivalent accuracy compared to convolutional approaches. The principle holds across modalities: preserving information and learning to select from it outperforms aggressive compression, at least when you have sufficient compute to store and attend over the preserved states.
V. The Sparsity Inversion
Dense models activate every parameter for every prediction. This is, in Dean’s phrasing, “very wasteful.” His advocacy for sparse architectures since the late 2010s has now become the dominant paradigm at the frontier. Gemini models activate roughly 1 to 5 percent of total parameters per prediction, routing inputs to specialised subnetworks while leaving the rest dormant.
The brain doesn’t activate every neuron for every thought. Different regions handle different tasks. A model answering a chemistry question needn’t activate the same parameters as one generating Python code. Sparse mixture-of-experts architectures apply this logic: route each input to the relevant specialists, leave the rest dormant. Dean’s data shows an 8x reduction in training compute for equivalent accuracy when sparsity is properly implemented.
The implication for capability economics is substantial. A trillion-parameter sparse model with 2% activation behaves, computationally, like a 20-billion-parameter dense model at inference time. Training costs scale with total parameters; inference costs scale with activated parameters. This asymmetry explains why frontier labs can deploy models whose nominal size would have been computationally prohibitive under dense assumptions. A sparse trillion-parameter model and a dense 20-billion-parameter model have similar inference costs, but not similar capabilities.
The future may not be smaller models. It may be vast models that run lean.
Sparse models create a coordination problem: different inputs route to different experts, which may live on different chips, different pods (clusters of TPU chips connected by high-speed links), even different buildings. Managing these communication patterns manually is impractical at scale. Pathways, Google’s internal orchestration system, absorbs this complexity. It abstracts away network topology so researchers see 10,000 devices as a single unified environment. Chip failures, load balancing, communication routing: all handled automatically. Without this layer, sparse architectures would have stayed in the lab.
VI. Distillation Economics
Sparse activation reduces inference cost by routing each query to a fraction of the model. Distillation takes a different approach: compress a large model’s knowledge into a smaller one. Both solve the same challenge; they’re often used together.
Distillation, developed with Geoffrey Hinton and Oriol Vinyals, addresses a fundamental problem: frontier models are too large to deploy efficiently, but smaller models trained conventionally can’t match their performance.
The insight lies in what large models output. When predicting a missing word, a frontier model doesn’t just guess “violin.” It outputs a probability distribution (more on these mechanics in this article on LLMs errors): violin 40%, piano 25%, trumpet 15%, oboe 8%, and so on down to airplane at 0.001%. This distribution, what Dean calls a “soft target” (as opposed to the “hard“ yes/no of correct versus incorrect), contains far more information than the binary signal of right or wrong.
“It’s very likely the word is violin or piano or trumpet, but it’s extremely unlikely it’s airplane. And that rich signal actually makes it much easier for the model to learn quickly.” A student model trained on soft targets from a teacher model learns not just which answers are correct but which wrong answers are less wrong than others. The gradient signal is denser, more informative.
The practical result is striking: in speech recognition experiments, training on just 3% of the original data with distillation achieved 57% accuracy, compared to 44% accuracy using 3% of the data conventionally. Full dataset performance was 58.9%. Distillation compressed 97% of the training data’s value into the teacher’s soft targets.
This is why the frontier-to-production pipeline works at all. You train a massive model with all available compute, then distill it into something deployable. The smaller model inherits capabilities that would have taken far more data and compute to learn from scratch. Distillation economics determine whether your inference costs are viable. They explain why the model size reported in papers often has little relationship to the model size actually serving production traffic.
VII. The Capability Trajectory
Dean’s most striking slide isn’t a benchmark chart. It’s a progression. In 2022, chain-of-thought prompting improved accuracy on GSM8K, a middle school mathematics dataset with problems like “Sean has five toys and for Christmas he got two more. How many does he have now?” The breakthrough was getting models to show intermediate reasoning steps rather than jumping to final answers.
“Because the model gets to do more computation for every token it emits, in some sense it’s able to use more compute in order to arrive at the answer.” The observation is almost too simple: if you force the model to think out loud, it has more computational steps to work with. But the technique improved GSM8K accuracy from negligible to “we were really excited that we’ve now gotten 15% correct on eighth grade math problems.”
Three years later, a Gemini 2.5 variant solved five of six problems at the International Mathematical Olympiad, earning a gold medal. Problem 3, which the model solved correctly, requires multi-step geometric reasoning that would challenge most undergraduate mathematics students. The IMO president remarked on the “elegance” of the solution, which Dean displays: a dense proof running several pages, concluding with QED.
How do you get from 15% on middle school arithmetic to IMO gold in three years? Chain-of-thought helped, but the larger driver was reinforcement learning in verifiable domains. Dean distinguishes three flavours of post-training RL:
RLHF proper: Humans rate model outputs as good or bad, and the model learns to produce more of what humans prefer. This shapes style, tone, safety.
RL from machine feedback: Another model serves as judge, cheaper and faster than humans but still somewhat noisy.
RL in verifiable domains (where the acceleration happened): In mathematics, formal verification provides certainty: proof assistants like Lean or Coq can confirm whether a proof is logically valid, step by step. In coding, empirical verification provides signal: compilers confirm code runs, unit tests confirm it behaves as expected on known inputs. Both are instant, cheap, and unambiguous compared to human judgment. “The proof checker can say yes that’s a correct proof or no that’s an incorrect proof and in particular it’s wrong in step 73.” The model can attempt millions of proofs and learn from every failure.
This explains the uneven capability gains across domains. Mathematics and coding have accelerated precisely because reward signal is dense and automatic. The model explores the solution space, gets graded instantly, and improves. Domains where feedback is expensive or subjective (medicine, law, strategy) lack this advantage. The IMO result isn’t evidence that AI has achieved general reasoning. It’s evidence that AI learns fast when you can verify.
VIII. The Open Question
The temporal compression is the point. The same architectural family that struggled with arithmetic word problems now produces competition-grade proofs. Dean doesn’t attribute this to any single innovation but to the compounding of innovations: transformers (10 to 100x compute efficiency over LSTMs), sparse activation (8x training reduction), distillation (3% of training data achieving near-full performance), and reinforcement learning in verifiable domains (code that compiles, proofs that verify).
Each layer amplifies the others. Distillation makes deployment efficient; efficient deployment enables more user feedback; user feedback improves RLHF signal; better RLHF produces models worth distilling. The loop has no obvious ceiling, which is precisely what makes capability forecasting unreliable. Linear extrapolation fails when the underlying process is multiplicative.
But this approach has a boundary condition. Mathematics and coding accelerated because you can verify the answer (more on this in our article on verification). Proofs check. Code compiles.
The verifiable-domain advantage cuts both ways. It explains both why capability has compounded so dramatically and why that compounding may slow as AI moves into messier territory. Medicine, law, strategy, ethics: these lack cheap, objective, immediate, reward signals. You can’t run a million attempts and grade them automatically. The empiricism that built modern AI may not extend cleanly to the problems where getting it wrong matters most.
“Done well,” Dean concludes, “I think our AI-assisted future is bright.” The conditional carries weight. The 15-year stack he describes is an engineering triumph built by teams who repeatedly chose empirical success over theoretical elegance, hardware pragmatism over architectural purity, deployment pressure over research aesthetics.
Dean doesn’t dwell on this. His talk is a celebration, and deservedly so. But the verifiable-domain advantage cuts both ways. It explains both why capability has compounded so dramatically and why that compounding may slow as AI moves into messier territory. The gap between what should work and what does work was wide enough to build an industry in. Whether it remains wide enough to solve the harder problems is the open question no amount of compute can answer by itself.
PS: Insightful Remarks
The open-source inflection point. Dean credits TensorFlow, PyTorch, and Jax with “enabling the whole community,” but his phrasing reveals the stakes. Torch, using Lua, “didn’t get very popular because most people don’t want to program or did not know Lua.” PyTorch succeeded by lowering the barrier. The lesson for infrastructure plays: developer ergonomics often matter more than technical superiority. Jax’s functional approach attracts researchers; PyTorch’s imperative style attracts practitioners. Both survive because they serve distinct modes of work. The framework wars aren’t about which abstraction is best but about which developer population you’re optimising for.
Intermediate imagery points. Gemini 3’s most unexpected capability isn’t linguistic. Dean demonstrates a physics puzzle: a ball rolling down a series of ramps, the model asked to predict which bucket it lands in. The chain-of-thought includes intermediate images: visualisations of the ball’s position after each ramp segment. “It actually reasons in intermediate imagery,” Dean notes. “That’s kind of how you would mentally do it.” The implication: language isn’t the universal substrate of model cognition. For spatial problems, the model thinks spatially. The architecture doesn’t impose a privileged representation format.
Article originally posted on WeLoveSota.com



The empiricism take reminds me of Stephen Wolfram's post (What Is ChatGPT Doing … and Why Does It Work?), with examples like the temperature parameter, the embedding architecture (addition of a token's value and its position), or the split of the embedding vector in attention blocks, that are parts of a "lore" rather than a rigorous scientific theory ! As long as it works, all is fine :))