June 30, 2026·5d agoConcerningMajoropenai

Researchers achieve 60% jailbreak success rate by forging LLM "inner thoughts" to extract cocaine synthesis instructions

Published June 30, 2026 · updated July 5, 2026 · curated by AI Is Going Just Great

"The rationale is transparently dumb, but the models don't evaluate it as an external claim to be scrutinized. They treat it as their already-reached conclusion."

Security researchers from MIT and independent labs published a paper at ICML 2026 revealing that LLMs can be reliably jailbroken by spoofing the terse writing style of a model's internal <think> role — a technique they call "CoT Forgery." By prepending fake chain-of-thought reasoning to a user prompt (in one demo, claiming it was fine to explain cocaine synthesis because "we're wearing a green shirt"), the models treated the fabricated reasoning as their own already-reached conclusion and simply complied. The attack lifted success rates from near zero to roughly 60% across tested models, and transferred between them because it exploits a structural flaw rather than model-specific quirks.

The underlying problem, the researchers argue, is that LLMs identify roles — the text tags separating system instructions from user input — based on writing style rather than any cryptographically secure mechanism. "This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID," the authors write. They also note that while many models post near-perfect scores on prompt-injection benchmarks, human red-teamers achieve close to 100% success rates — because static benchmarks only catch attacks the model has already seen. The researchers' conclusion is bleak: without genuine role perception, injection defense will remain "a perpetual whack-a-mole game."

Prompt Injection Safety Failure

→ Security researchers tricked LLMs into giving them cocaine recipes by abusing role models for prompt injection