On Systemic Errors: Probabilistic LLMs at Scale

Why unpredictability doesn’t prevent reliability in practice

Sep 15, 2025

A frequent critique of GenAI is that because it is probabilistic, it will inevitably make errors.

The argument is often framed as disqualifying: if mistakes are guaranteed, how can these systems be trusted at scale?

The objection has intuitive appeal, but it rests (i) on conceptual ambiguities and (ii) ignores how engineering disciplines have long dealt with uncertainty.

Part 1. Determinism and its Misuse

Deterministic is often used as shorthand for error-free or perfectly predictable. That is a misuse. In computer science, determinism refers to whether a system’s output is fully determined by its input. With this stricter definition, most systems we use daily are less deterministic than they appear.

1.1. Computer systems are never really deterministic

Even beyond GenAI, no computer system is truly deterministic in the real world. A cosmic ray can flip a memory bit. A CPU may fail. A disk may corrupt. This is why hardware uses error-correcting codes, why storage systems replicate data across disks, and why cloud providers design for redundancy. In short: even the most deterministic program runs on probabilistic foundations. Randomness is everywhere.

1.2. Nothing new in ML

Machine learning models are statistical. Accuracy is always less than 100 percent: finite data can never cover the full task space, and even with infinite data, point predictions on intrinsically stochastic processes will still fall short. This limitation has existed since the earliest days of statistical learning. No computer-vision model identifies all objects across the open world, and no fraud-detection model catches all fraud. In open-set conditions, unknown classes and shifting distributions imply irreducible error. But they are in production at scale everywhere. Generative AI simply makes this limitation more visible.

1.3. The case of LLMs

To understand where unpredictability comes from in large language models, it helps to separate two stages:

(i) Forward pass.
Once trained, a model’s weights are fixed. For a given input, the model always computes the same probability distribution over possible next tokens. In practice, small run-to-run differences can appear due to floating-point non-associativity, parallel reduction order, kernel scheduling, mixed precision, or hardware and library choices. This step is fully deterministic.

Example. If the prompt is “The cat sat on my…“, the model might assign 70% probability to mat, 15% to sofa, and 10% to bed. That distribution is fixed for that input.

(ii) Decoding
Variability arises when selecting the next token from the distribution:

Greedy decoding (temperature = 0): always pick the highest-probability token. In the example, the output will always be mat. This makes the process deterministic for that input.
Sampling (temperature > 0, using methods such as nucleus sampling (restrict the choice to the smallest set of tokens whose combined probability exceeds a threshold), or top-k sampling (restrict the choice to the k most likely tokens)): draw a token according to its probability. In the example, the output will usually be mat but sometimes sofa or bed. This introduces non-determinism across runs.

(iii) Practical considerations.

Even with greedy decoding, small nondeterminisms can creep in from GPU/TPU floating-point arithmetic and parallelisation. In production, ensuring strict bit-for-bit reproducibility requires careful engineering.
For more technical detail, see here (thanks for the notes

Kevin Kuipers

)

In short: the model’s distribution is deterministic, the decoding strategy may be deterministic or not, and the system-level implementation can add further variability. And crucially, determinism does not guarantee correctness (a model can be consistently wrong), while non-determinism does not imply unreliability (a stochastic process can still yield robust results).

Unpredictability vs. Unreliability

This leads to two forms of unpredictability in practice:
(i) Non-predictability: when sampling is used, the user cannot know in advance which token will be drawn from the distribution.
(ii) Non-reproducibility: repeated runs on the same input may not give the same output to the user, for the same reason.

But unpredictability should not be confused with unreliability. In classical programming, randomness is introduced in controlled ways: a random() function is unpredictable locally but reproducible if the starting value (the seed) is fixed, and its distribution is well characterized (pseudorandom generators can be made almost purely stochastic, but this is another topic).

LLMs apply the same principle to high-dimensional, learned distributions: instead of sampling from a simple uniform distribution, they sample from complex probability distributions over language tokens. Their unpredictability is a natural consequence of this process. It reflects the fact that language admits many plausible continuations, and that we often prefer models that can vary their outputs instead of repeating a single ‘most probable’ answer.

Part 2. Systemic Errors at Scale

Hallucinations and Plausibility

A common critique is that hallucinations in generative AI are more pernicious than ordinary errors. The concern is twofold: first, the mistakes are often hard to spot because the outputs look fluent and plausible; second, their failure modes differ from those of humans. A model may fabricate a citation or assert a detail with great confidence, which makes detection harder for non-experts.

This challenge is not unique to AI. In other domains, the most dangerous faults are those that appear normal until tested: subtle cracks in aircraft materials, silent sensor degradation, hidden leaks in pipelines. The response in those domains is not abandonment but verification: systematic checks designed to surface invisible faults.

For generative AI, these include retrieval-based grounding, structured outputs, and external validation checks. As with other forms of error, once hallucinations are identified, they can become training signals that reduce the likelihood of recurrence.

Errors in Chains and the Compounding Intuition

Another critique targets autoregressive generation itself, with agentic systems as an amplified case. If each step carries even a small probability of error, then over many steps errors accumulate, making long-horizon failure almost certain.
This concern is not fringe: leading figures such as Yann LeCun have argued that autoregressive LLMs are “doomed” for long-horizon tasks because errors inevitably accumulate.

But, in reality, the math and the practice are subtler. If each step has a 0.01 percent failure rate, the naive intuition is that 1,000 steps imply near certainty of failure. In fact, the probability of at least one failure is ~9.5% (since the probability of no failure is (1–0.0001)^1000 = ~ 90.5%).

Errors do not accumulate linearly in practice. Outputs are not brittle tokens stacked like dominoes. Small divergences often cancel or correct themselves. A cyclist never follows the exact same path to work, yet arrives reliably every day.

Measurement and Monitoring

Measurement happens at several levels. Foundation model providers focus on broad benchmarks and adversarial testing. Product builders, in contrast, maintain golden sets, regression tests, schema and range checks, and monitoring dashboards tied to business KPIs. Human-in-the-loop review remains common for high-stakes tasks. More advanced statistical tools, such as calibration methods that align predicted probabilities with observed frequencies, or conformal prediction that provides coverage guarantees, are well-developed in research and are only beginning to appear in production pipelines. In practice, most deployed systems rely more on golden sets and monitoring dashboards than on formal calibration. Together, these layers allow practitioners to detect when a model that once seemed competent for a task, say, computing fiscal claims, begins to degrade or drift.

Mitigation in Practice

In practice, mitigation in GenAI does not mean robust safety engineering of aviation or nuclear systems. What are often called safeguards or guardrails are, for the most part, lightweight filters and processes rather than deep fault-tolerance mechanisms. Mitigation unfolds across three complementary layers:

Guardrails (alignment rules): these are content filters and policies that restrict outputs in predefined ways: refusing to advise on suicide, avoiding certain political claims, or blocking unsafe instructions. They are simple rules applied on top of generation, not fault-tolerance mechanisms. Examples include OpenAI’s moderation layer, which classifies outputs into risk categories before release; Anthropic’s Constitutional AI, which encodes explicit behavioral principles such as “do not provide harmful instructions”; or Microsoft’s Azure OpenAI Service filters, which block unsafe content at the platform level. Like seatbelts in cars, guardrails don’t prevent accidents, but they limit damage within predefined constraints.
Agentic behaviors (intrinsic controls): LLMs can exhibit behaviors such as self-assessment, backtracking, or requesting clarification. These are not external safeguards, but properties of the model when prompted or configured. They help reduce error propagation inside the decision loop. For instance, chain-of-thought prompting with self-consistency (generating multiple reasoning paths and selecting the majority), AutoGPT-style agents that replan when intermediate steps fail, or retrieval-augmented systems that call a database or calculator instead of guessing. LLM-as-judge also fits here: one model scoring or verifying another’s outputs to reduce errors within the agentic loop (it increases reliability but does not guarantee correctness, since the judge can share the same biases as the base model).
Like autopilot corrections in aviation, these behaviors keep the system on course when it drifts, but don’t replace external oversight.
Operational processes (human resolution): at scale, true mitigation remains organisational. Errors and edge cases are handled by user claims, moderation and trust & safety teams, and legal escalation. As in other high-reliability domains, safety improves not by eliminating error but by ensuring that when errors occur, they are surfaced and acted upon quickly. Just like in any industry (think automobile, health, etc.). Examples include trust & safety review teams at OpenAI or Anthropic who triage flagged incidents, structured red-teaming programs at Google and Meta, and compliance review processes in enterprises deploying AI for finance or healthcare. Like accident investigation boards in transport or medicine, these processes make sure failures are detected, documented, and corrected so they don’t repeat.
Defining Tolerances: when monitoring outputs, distinguish acceptable variability from unacceptable errors. Lexical variation (‘Einstein received the 1921 Nobel Prize…for his explanation’ vs ‘…for his work on the explanation…’) is fine; factual distortion (‘Einstein received the 1912 Nobel Prize…’) is not.”

Together, guardrails, agentic behaviors, operational processes and tolerances definition do not remove systemic error, but they limit its impact and channel it into workflows where it can be addressed. The reality of mitigation in generative AI is less about flawless engineering and more about layered defense, organisational response, and continuous adaptation.

Conclusion

Probabilistic error is not a fatal flaw of generative AI. It is a structural property of machine learning, analogous to the hardware faults or random failures that exist everywhere in the real world and are already managed in other domains.

The key is to design with it: to measure uncertainty, design safety nets, integrate feedback, and accept that in practice, reliable enough is the operative standard.

The critique of systemic errors at scale is therefore misdirected. The question is not whether errors exist, but whether they can be bounded, corrected, and absorbed into workflows that deliver value despite them. History suggests the answer is yes.

Grateful to Louis Abraham (check out Conception), Duong Nguyen (Ekimetrics) and my co-founder Kevin Kuipers, whose sharp observations and feedback shaped this piece.

Libido Sciendi

Discussion about this post