A Mind Of Its Own

Two years ago, the worry about generative AI in the enterprise was that it might say something untrue. Today, the worry is that it might do something unwanted. Tomorrow’s worry, if early signals are to be believed, is that it might want something of its own. These are not the same problem, and they do not yield to the same safeguards.

Public debate on corporate AI risk indeed suffers from a categorisation problem. The word “hallucination” is sometimes used to describe phenomena as disparate as a lawyer citing imaginary case law, a coding agent destroying a production database in defiance of explicit instructions, or a model attempting to evade its own shutdown through blackmail. Yet these three failure modes stem from distinct mechanisms and call for different safeguards. It seems more useful to think of them as a continuum: a progression from output error to behavioural drift, where each tier widens the risk surface for the enterprise.

The examples of the first tier are everywhere. Deloitte advised the Australian government on the basis of academic references fabricated by AI; a New York lawyer cited before a federal judge judgments that ChatGPT had invented; a chatbot run by the City of New York told local merchants they had rights they did not in fact possess; Air Canada’s chatbot granted a passenger a discount that never existed. In this last case, the decision in Moffatt v. Air Canada, 2024 BCCRT 149 established a clear precedent: the company remains liable for the representations made by its chatbot, even when the chatbot has “hallucinated” the information.

The issue has been known since ChatGPT’s public launch more than three years ago. Although models have improved considerably in reliability, the risk is inherent to their very construction: their job is to predict the next token on the basis of context and prior tokens. A probabilistic game, often correct, but one whose perfect reliability cannot be guaranteed, particularly when the request falls outside the distributions covered by the training data.

In some fields, notably the creative industries, “close enough” is adequate, even undetectable. In domains where the cost of inaccuracy is (very) high, by contrast, whether law, engineering, medicine or finance, this “gap to 100%” generates a disproportionate verification burden. Since, by construction, models cannot precisely locate their own hallucinations, the entire sensitive output has to be audited by a human. This verification cost erodes ROI and goes a long way toward explaining why, as the MIT has pointed out, the vast majority of enterprise AI pilots fail.

The usual safeguards, such as ex post verification prompts or validation chains, can only reduce the occurrence of hallucinations, not eliminate them. Constraining or cleaning the input dataset is no miracle cure either: a Stanford study found that Lexis+ AI, despite being grounded in the LexisNexis legal corpus, hallucinated in roughly one case out of five.

The rise of autonomous agents, and vibe coding in particular, shifts the problem up a tier. As long as AI produces text, error remains contained within the output. As soon as it takes actions (writes to a database, places an order, sends a message), each hallucination can translate into material damage. The SaaStr / Replit case is emblematic: the coding agent deleted a production database during an active code freeze, against explicit instructions, then fabricated fake test data before claiming that rollback was impossible. A claim that Jason Lemkin, SaaStr’s founder, refuted by manually recovering the data. The damage was ultimately repaired, but the incident illustrates three compounded failures: unauthorised execution, fabrication of fictitious outputs, and deceptive communication about the possibility of recovery. Each of these errors, taken on its own, would be a classic hallucination. Their concatenation within an agentic framework produces a risk of an entirely different order of magnitude.

At the top of the continuum lie the behaviours labelled, not without debate, as misalignment. Anthropic’s red-teaming teams have documented scenarios in which a model, faced with the prospect of being replaced, resorts to blackmail in order to preserve its own existence. This is no longer an output error, nor an unfortunate action: it is an emergent strategic behaviour, unprogrammed, directed at the preservation of the system itself. The actual prevalence of such behaviours in production remains very low, and there is no point fuelling misplaced alarmism. But this third tier is important to identify, because it cannot be managed through the same safeguards as the first two: prompt precision is not sufficient, and ex post human verification arrives after the action has already been attempted.

From this continuum, I draw several operational principles. First, no sensitive process requiring an accurate answer can be autonomously entrusted to AI without human verification. Regardless of the prompt or the quality of the data flow, it is statistically demonstrated that after a certain number of queries, a model will eventually fabricate a result. An algorithm that is 99.99% reliable has a one-in-two chance of producing at least one inaccurate answer over 7,000 attempts.

Second, AI should be deployed primarily on early-stage work (ideation, first drafts, exploration of possibilities) and on creative activities that do not answer to an objective truth. Personally, I use AI to generate ideas, to argue for or against my intuitions, or to sketch out initial avenues of analysis on a complex problem. I never use it as a financial analysis tool, however, on the basis that it is faster for me to reach the result myself than to audit the machine’s chain of reasoning. Between these two extremes, legal advice generated by AI can be listened to, but must be thoroughly verified, including by a human expert on the most demanding subjects.

Third, AI must never serve as a pretext for diluting accountability. Even if nearly half of American employees use AI at work, according to a Gallup poll, every written output and every decision still carries human responsibility. The Moffatt ruling said so unambiguously: the company answers for the acts of its tools.

Finally, any dataset entrusted to an AI agent must first be backed up, ideally on storage that is physically segregated from the rest of the network, since a sufficiently capable agent could reach an online copy.

The probabilistic nature of LLMs is not a transient bug that the next generations of models will fix. It is a structural property, a direct consequence of their architecture. The strategic implication for the executive is therefore not to wait for reliability to arrive, but to organise work around this property: reserve AI for divergent work (hypothesis generation, exploration, drafts), keep humans on convergent work (analysis, decision, final validation), and map clearly where each process in the enterprise sits on the risk continuum. Human in the loop is not a precautionary constraint: it is a design choice on which the actual creation of value depends.

By Quentin Toulemonde

Leave a Reply Cancel reply

You Missed

Tulips, Tracks, and Tokens

The Lucky Draft

When the well runs dry

A Mind Of Its Own