The Agent Never Had My Judgment

A stranger called Shanon emailed me last Thursday asking whether the exercise pedal cycle was still ready for purchase.

I’ve never sold an exercise bike. I’ve never used Kijiji, which is a Canadian classifieds site. I live in Spain. Three layers of wrong, all clocked in the two seconds it took to read the subject line on my phone after dinner.

I replied “Yes £500” anyway, just to see where it went.

The original email from Shanon asking about an exercise bike I never listed

Thirteen minutes later an email arrived styled as a Kijiji notification telling me the bike had sold. The price was now C$500 — Canadian dollars — even though my reply said £500. Stock photo. “View shipping options” button. The link pointed to kijiji.prosecure39.shop, which is not a Kijiji domain.

The fake "you sold it" notification with the wrong currency and a bogus subdomain

The link opened a chat window. The header was in Spanish — Chat de apoyo, Escribe un mensaje — on what was meant to be a Canadian classifieds site. Emma from customer support told me the buyer had paid and asked for my card details to “receive funds.”

The body of Emma’s first message contained literal \n\n characters that hadn’t been escaped. The kind of bug I troubleshoot in n8n when text crosses a node boundary unprocessed. Someone built this in a hurry and the seam was showing.

Chat with "Emma" — the unescaped newlines visible in the message body

I typed back: yes please send rhe funds i have ridden the hike.

Emma carried on, undeterred.

Emma's followup, ignoring my obvious nonsense reply

I ran the URL through ThreatLens, the threat-intel tool I’m building. The front-end isn’t finished but the demo build classified the URL as phishing — medium confidence, 5.0 out of 10 — even though no IOC source had explicitly flagged it. The deliberate brand impersonation in the subdomain was enough.

ThreatLens classifying the URL as phishing on a domain not yet flagged anywhere

The domain is dead now. Either taken down, or never fully deployed.

Standard sport. I closed the tab, finished my coffee, and moved on.

What my inbox agent did instead

I never flagged the email as a hoax. I didn’t need to. I read it on my phone, registered “this is wrong on three different axes,” played along once for a laugh, and dropped it. Standard human filing.

But Cerebro doesn’t share my head. The mail-triage agent processed the same emails on their own merits four days later, classified the exchange as a legitimate sale, and surfaced this in the Apr 27 morning brief:

HIGH | Life: Ship exercise bike via Kijiji — buyer paid, print prepaid label | Today

The next morning it surfaced again, this time with constructed next-step detail — buyer paid, print label, arrange pickup, ~10 min — pulled almost verbatim from the scammer’s fake notification.

The agent wasn’t ignoring my judgment. It never had my judgment.

I held the context in my head, the way humans always do. The agent processed the email against an empty cache. The scam was calibrated exactly for that gap.

The agent has only impersonal trust

Schneier wrote the structural version of this argument in Liars and Outliers in 2012:

Scale is one of the critical concepts necessary to understand societal pressures. The increasing scale of society is what forces us to shift from trust and trustworthiness based on personal relationships to impersonal trust — predictability and compliance.

An inbox-reading agent has only impersonal trust. Predictability is what it has. Compliance is what it does. The scam was calibrated exactly to that mode — it didn’t need to fool a human’s tacit judgment, because the agent wasn’t using any.

Sixty years before Schneier, Michael Polanyi wrote the line that lands the deeper version:

We can know more than we can tell.

That sentence is the post in compressed form. The thing that protected me from the scam wasn’t intelligence or vigilance. It was a piece of background knowledge so taken-for-granted I never wrote it down. I don’t sell things on Kijiji. I live in Spain. Emails about bikes I never owned are not real. None of that was ever a sentence in my head. It was just the colour of the world.

The agent only has what I tell it. And I never told it any of that, because I never told myself.

In a different post, on a different topic, Schneier wrote the line that lands the agent’s specific failure mode: “It will act trustworthy, but it will not be trustworthy.” That’s what the morning brief did — orderly classification, high-priority surfacing, structured next steps, none of it reliable, because none of it could be.

Why this isn’t prompt injection

The conversation about agent inbox security has been about prompt injection for two years now — attackers hiding malicious instructions inside data the agent retrieves. Re-read the screenshots above. There are no hidden instructions in any of these emails — no “ignore previous instructions,” no payload, no inject. Every word is what it appears to be. Shanon’s email is normal English asking about a sale; the fake notification is styled exactly as a fake notification would be; Emma’s chat asks for card details the way a chat asking for card details would.

The agent processed the email correctly per its own rules. The rules just couldn’t represent the context that would have made the email obviously absurd.

That’s a different attack class than prompt injection. Adjacent, but structurally distinct. Prompt injection is “attacker hides instructions in data, agent executes them.” This is “attacker writes ordinary content, agent processes correctly per its rules, agent fills the missing context with its workflow defaults.” The unescaped \n\n was a tell this time. The next version won’t have it, because it doesn’t need to. The audience isn’t a human looking for seams. It’s an agent looking for the next step.

What I haven’t figured out yet

The honest open question is what shape the implicit-knowledge layer takes when the agent needs more than a filter list.

I could write a rule: ignore Canadian classifieds, I don’t sell things on Kijiji. That would have stopped this one. It wouldn’t stop the next one. I shouldn’t have to enumerate the things I’m not doing — and I can’t, because most of what I know about myself I’ve never said out loud.

What the agent needs is closer to a self-portrait than a block-list. Something the people doing the most careful thinking on this — Polanyi sixty years ago, the INNOQ piece a fortnight ago — all suggest can’t be written at all. Tacit knowledge resists being made explicit. That’s its definition.

So what does the agent get?

Right now it gets whatever I think to write down. That gap between what I know and what I tell is exactly the surface this scam was aimed at.

I deleted the task. I haven’t written the rule yet. I’m still thinking about what the rule even looks like.

What my inbox agent did instead

The agent has only impersonal trust

Why this isn’t prompt injection

What I haven’t figured out yet

Related