This week, I would like to come back to an observation that some of you may have done over the last few months when opening LinkedIn’s feed. We read about a founder, a consultant, or a self-styled AI evangelist showcasing a spectacular result obtained in minutes, a market analysis, a legal summary, a strategic brief, presented as proof that the technology has crossed some threshold of reliability. Some of these posts are, charitably put, embellished well beyond what actually happened. But many are genuine. The output really was that good. The prose was fluent, the structure sound, the conclusions plausible, and the author, impressed, often goes on to deploy the same approach at scale within their own organisation. Try on your side, and the same system produces something that is embarrassingly wrong, and nobody can quite explain why the first output had been so good.
The answer, almost always, is that it was “lucky”.
This is not a metaphor. Large language models are, at their most fundamental level, probability engines. At each step of generation, a model samples the next token from a statistical distribution over its entire vocabulary. The output is not retrieved: it is drawn. Run the same prompt twice, and you will get two outputs that are similar in structure but differ in ways that can range from inconsequential to material. This is what physicists would call stochasticity, and what business leaders ought to call a risk they are almost systematically underpricing.
The analogy I find most accurate is ballistic, and it draws on a memory of my own. During my military service, I spent a fair amount of time on the shotgun range. I found it quite satisfying: fire at the target, and almost every time, you’d hit something, sometimes quite hard. What I could never do, however hard I tried to calibrate my stance or my aim, was predict exactly where on the target the shot would land. The spread always found the target area. It rarely found the same point twice.
AI, as currently architected, behaves the same way. It fires a broad spread of plausible output, and depending on the calibration of your prompt and the sheer number of tokens generated, some pellets will find the target. Occasionally, one will hit the bullseye. The temptation, when that happens, is to conclude that you are holding a sniper rifle.
The consequences of this confusion are non-trivial. Most enterprise AI deployments are evaluated on their best outputs, not their average ones. A demo, by construction, tends to showcase the lucky draft. What rarely makes it into the business case is the distribution of outputs around that exceptional result, the standard deviation, in statistical terms, or the cost of collapsing that distribution to something operationally acceptable.
Because closing the gap between “sometimes excellent” and “reliably good” is expensive, in ways that compound.
The first cost is prompt engineering. Getting an LLM to produce a consistently precise output requires iterative calibration of the input: the precise wording of the instruction, the format of the context, the persona assigned to the model, the constraints imposed on the output. This is skilled work. It takes time. It degrades when the model is updated. And it solves the problem only partially, because even a perfectly engineered prompt does not eliminate the underlying stochasticity, it merely reshapes the distribution.
The second cost is evaluation infrastructure. If you cannot tell, without reading every output, which draft is the lucky one, you need a system to do it for you. Building an evaluation pipeline, a set of automated checks, scoring rubrics, or even a second model tasked with critiquing the first, represents an engineering investment that most ROI projections for AI adoption do not include. A 2025 MIT report found that the vast majority of enterprise AI pilots failed to scale, and one of the recurring diagnoses was precisely this: the cost of the last mile of reliability.
The third cost is the one nobody talks about in pitch decks, which is human review. In any domain where the cost of error is high, legal drafting, financial analysis, medical triage, regulatory filings, the output of an AI system cannot be treated as final. It must be read, checked, and validated by a human who carries the liability for its content. The AI has saved time on the first draft. The human has absorbed the audit burden on the final one. In many cases, for an expert who knows the domain, it is faster to produce the output from scratch than to verify that a probabilistic system has not introduced a plausible-sounding error. This is not a failure of implementation. It is a structural property of the technology.
None of this means the shotgun has no place in the armoury. For ideation, exploration, and first-draft acceleration in low-stakes contexts, it is precisely the right tool. When I use AI to pressure-test an argument, to generate five alternative framings of a business problem, or to produce a first cut of a document that I will then rewrite substantially, I am using the tool in its natural habitat. The pellets are meant to scatter. The point is not to hit the bullseye; it is to illuminate the target area.
The strategic error, and the one I observe most frequently in mid-market and PE-backed environments, is mismatching tool to use case. The executive who deploys a shotgun in a context that demands a sniper rifle is not making a technology error. He is making a judgment error about the nature of precision, and about who absorbs its cost when it is absent.
A useful discipline, before any AI deployment decision, is to ask two questions. First: what is the cost of an error in this use case? Second: who will catch it if one occurs? If the answer to the first question is high, and the answer to the second is nobody in particular, the economics of reliable AI are almost certainly not what the vendor’s slide deck suggests. You are pricing the bullseye and billing the spray.
The probabilistic nature of large language models is not a transitional imperfection. It is an architectural constant. The smart deployment question is therefore not “does AI work?” It is “what does work mean here, and am I prepared to pay for the precision gap between the lucky draft and the reliable one?”
Most organisations, right now, are not.
