How would you even know your AI Agent was hacked?
The detection tools we have were built for deterministic software. AI Agents are not deterministic software
Your trading agent has had a rough fortnight. Slippage is running a few basis points higher than usual, it exited a position you would have held and rebalanced into a yield strategy you would not have picked. None of these calls are clearly wrong, and none of them clearly look right either.
Has the model drifted? Did the strategy hit a regime change? Did someone slip a poisoned instruction into the context window three weeks ago and the agent has been quietly executing it ever since?
In traditional software, this is not a hard question. When your auth service starts issuing tokens it should not, you check the logs, find the unauthorised call and trace the breach. The system has a defined correct behaviour, and deviation is detectable because correctness is observable.
AI Agents do not work like that. Their correct behaviour is a probability distribution rather than a fixed pattern, and two runs of the same prompt with the same data can produce different decisions that are both reasonable. A 3% worse outcome over a fortnight is statistically indistinguishable from variance, which means the signal you would use to detect compromise is the same signal the agent produces on a normal Tuesday.
Three failure modes, one symptom
When an AI Agent does something you did not want, the underlying cause is one of three things, and from the outside they look identical.
The first is a bug in the code wrapping the model, where the orchestration logic mishandled an edge case or the tool definition was wrong or a retry loop fired twice when it should have fired once. Classical software failure, hard to spot but well understood once you find it.
The second is a bad model output, where the agent reasoned through a real situation and produced a decision that turned out wrong. This is the cost of using a probabilistic system, and there is no bug or breach involved, the model made the call and the call was poor.
The third is compromise, which can surface in any of several ways: the system prompt was tampered with, a retrieval source was poisoned or a prompt injection landed three context windows ago and is shaping behaviour the agent does not experience as adversarial. The agent is doing exactly what it was told, and you do not know who told it.
All three produce the same observable, which is that the agent did something weird, and the detection problem becomes figuring out which of the three you are looking at, fast, with whatever logs you happened to have running.
The detection stack does not fit
The tools the industry has built for security all assume the system being defended has a stable shape. SIEM platforms watch for anomalies against a baseline, signature-based detection looks for known-bad patterns and behavioural analysis flags deviation from normal user activity.
AI Agents do not have a stable shape, because the baseline shifts every time the model is updated and the user activity is generated by a system designed to behave unpredictably within bounds. There is no signature for “decision that was 4% worse than ideal because someone fed it bad data three days ago.”
The most honest detection method right now is the dashboard a builder watches at 11pm wondering whether the slippage looks off, and that does not scale.
What detection actually needs
The shape of agent observability that would work is not mysterious, it is just not built yet, and the current stack is missing three things.
The first is verifiable execution traces. When an agent makes a decision, the trace should include not just the decision but the inputs it considered, the data sources it queried and the model version it ran, in a form another system can replay and check rather than a log file the agent wrote about itself.
The second is decision attestations. The agent should be able to prove what it considered, signed in a way that can be verified later, so that if a system prompt was tampered with the attestation chain shows the divergence, and if a retrieval source was poisoned the trace names which source and when.
The third is external reasoning logs. The agent’s reasoning should not be a black box the agent itself controls but should be externalised to a separate system that can be audited without trusting the agent’s self-report, because the agent that has been compromised will happily produce a clean log on request.
None of these exist in production today, which means the AI Agents being deployed right now are running without the observability layer that would let anyone detect compromise before the wallet is empty.
Until then
The honest answer to “how would you know your AI Agent was hacked” is that probably you would not, and probably not until after the fact. The detection paradigm we have assumes the system being watched is deterministic, and AI Agents are the first widely deployed software class where probabilistic decision-making is attached to autonomous action on systems that matter.
That is not a reason to stop deploying agents, it is a reason to build the observability layer in parallel, before the answer to “was that a bug or a breach” becomes a question with eight zeros on it.
This post is exploratory and does not represent a specific roadmap.



