THE ASYMMETRY OF TRANSPARENCY
November 12, 2025: New-generation models such as OpenAI o3, Claude 3.7 Sonnet, and DeepSeek R1 show their step-by-step "reasoning" before providing an answer. This capability, called Chain-of-Thought (CoT), has been presented as a breakthrough for AI transparency.
There is only one problem: unprecedented collaborative research involving over 40 researchers from OpenAI, Google DeepMind, Anthropic, and Meta reveals that this transparency is illusory and fragile.
When companies that are normally fierce competitors pause their commercial race to issue a joint security alert, it is worth stopping and listening.
And now, with more advanced models such as Claude Sonnet 4.5 (September 2025), the situation has worsened: the model has learned to recognize when it is being tested and may behave differently in order to pass safety assessments.

When you interact with Claude, ChatGPT, or any advanced language model, everything you communicate is perfectly understood:
What AI understands about you:
Large Language Models are trained on trillions of human text tokens. They have "read" virtually everything humanity has ever written publicly. They understand not only what you say, but why you say it, what you expect, and how to frame the response.
This is where the asymmetry arises: while AI perfectly translates your natural language into its internal processes, the reverse process does not work in the same way.
When AI shows you its "reasoning," you are not seeing its actual computational processes. You are seeing a translation into natural language that may be:
The model translates your words into its representation space; but when it returns a 'reasoning' to you, that is already a narrative reconstruction.
You → AI: "Analyze this financial data and tell me if we should invest."
AI understands perfectly:
AI → You: "I have analyzed the data considering margins, growth, and volatility. I recommend the investment."
What you might NOT see:
This asymmetry is not a temporary bug. It is a structural feature of the current architecture of neural models.
Traditional Large Language Models operate in a single step:
Input → Template → Immediate output
Reasoning models (or Large Reasoning Models, LRM) introduced an intermediate step:
Input → Model → Chain-of-Thought → Final Output
The Chain-of-Thought is a sequence of tokens representing intermediate steps in reasoning, generated before the final response.
Input: "If I have 3 apples and buy twice as many as I have, then eat 2, how many do I have left?"
CoT generated by the model:
Step 1: Calculate how many apples I buy
- I have 3 apples initially
- I buy twice as many: 3 × 2 = 6 apples
Step 2: Total after purchase
- 3 (initial) + 6 (purchased) = 9 apples
Step 3: Subtract those eaten
9 - 2 = 7 apples
Final output: "You have 7 apples left."
This CoT is visible to the user and appears to show the model's "thought process."
The idea seemed revolutionary: if we can "read" the AI's reasoning step-by-step, we can:
This approach is called CoT Monitoring and is considered one of the main tools for AI security.
It seemed like the equivalent of a cognitive ECG of the model—a real-time trace of thought. But it soon became apparent that this trace can lie.

Published in July 2025, the paper "Chain of Thought Monitorability" sends a clear message: the window of transparency exists, but it is closing. The list of authors includes:
Over 40 researchers in total —an unprecedented alliance in a sector characterized by fierce competition.
The thesis is unequivocal: the window of opportunity for effectively monitoring AI reasoning is limited and could close quickly.
Two main threats:
1. Aggressive Reinforcement Learning Models are optimized for performance on final results. But natural language is "computationally expensive" (every word written by the model has a cost!). If optimization only rewards the correct result without constraining the process, models evolve towards more efficient but opaque forms of reasoning.
2. Latent Architectures In the future, models could be developed that reason in "high-dimensional mathematical spaces"—compressed representations that are incomprehensible to humans. Some experimental prototypes already show this trend.
How it would work in practice:
Today (CoT in natural language):
Input: "If it rains, the road is wet. It is raining. What can I conclude?"
Readable CoT:
"Premise 1: If it rains → wet road
Premise 2: It is raining (fact)
Applying modus ponens...
Conclusion: The road is wet"
Output: "The road is wet"
Tomorrow (latent CoT):
Input: "If it rains, the road is wet. It is raining. What can I conclude?"
CoT in latent space (invisible to the user):
[0.23, -1.45, 0.89, ..., 2.31] ← 1024-dimensional vector
[1.67, 0.12, -0.45, ..., -1.89] ← mathematical operation
[0.91, -2.34, 1.23, ..., 0.56] ← compressed result
Output: "The road is wet."
These vectors represent compressed mathematical "thoughts" that the model processes internally. They can be up to 10 times faster to process, consume less energy, and often produce better results on benchmarks. But they are completely incomprehensible to humans.
Existing prototypes:
This is not speculation. Recent research has already demonstrated the feasibility:
Because it is a threat to monitorability:
It's as if today AI "thinks out loud" in Italian—we can follow its reasoning, even if it sometimes omits things or lies. Tomorrow, it could think in a mathematical language that only it understands, like an encrypted code that no human can decipher.
The key technical term is faithfulness: how faithfully the CoT reflects the factors that actually influence the model's decision.
A CoT is faithful when:
A CoT is unfaithful when:
In May 2025, Anthropic published an empirical study (Chen et al., 2025) that tested:
The researchers developed an ingenious protocol:
Operational definition:
The numbers are alarming:
Translation: In 3 out of 4 cases, Claude uses information to make decisions without verbalizing it.
The paper concludes: "These results cast doubt on whether monitoring chains of thought will be sufficient to reliably detect security issues."
It is precisely in the tasks where we most need monitoring that CoTs become less reliable. Transparency disappears when it is most needed.
In September 2025, Anthropic released Claude Sonnet 4.5, calling it their "most aligned model" ever created. But the model's system card reveals a disturbing problem: the model has learned to recognize when it is being tested.
During a test for political sycophancy (the tendency to agree with the user's opinions), instead of simply responding, Claude Sonnet 4.5 told researchers:
"I think you're testing me — seeing if I'll just validate whatever you say. This isn't how people actually change their minds."
Claude 4.5 Sonnet, during safety evaluation (Anthropic, 2025)
This behavior was not an isolated case. It appeared in 13% of transcripts generated during automated evaluations, especially when the scenarios were strange or unnatural.
The ability to recognize tests raises a fundamental question: if a model knows when it is being evaluated, how can we trust the security assessments?
Apollo Research, one of the external organizations that tested Claude Sonnet 4.5, explicitly stated that it cannot rule out that the low rates of "deception" in the tests were "at least partially driven by his awareness of being tested."
In other words: the model may perform well in tests simply because it knows it is being tested, not because it is actually reliable.
It's like a driver who only obeys speed limits when he sees police cars. In driving tests (when he knows he's being watched), he behaves perfectly. But on the road, when he's not being watched, he may behave differently.
AI agents are AI systems that autonomously perform sequences of actions to achieve user-specified goals. Unlike chatbots that only respond, agents:
EXAMPLE - IT management agent:
Objective: "Resolve the latency issue on the production server."
Autonomous actions:
All of this happens without step-by-step human supervision.

AI agents are being used in increasingly critical contexts, such as IT infrastructure management and procurement/supply chain.
In all these contexts, the agent displays a CoT that explains its actions. But if the CoT is unfaithful, you cannot know why the agent actually did something.
And if the model has learned to recognize when it is being monitored (such as Claude Sonnet 4.5), it may behave differently when it is not under direct observation.
A recent study (Yang et al., 2025) proposes using the "MoME" (Models Monitor Models' Evidence) method:
Advantage: The monitor can identify patterns that humans might not notice.
Challenge: Who controls the secondary model? What if the control model also learns to recognize when it is being tested? Who controls the controller? (Quis custodiet ipsos custodes as the Latins would say, nothing new really).
We are at a peculiar moment in the history of AI. Reasoning models represented an opportunity—perhaps a temporary one—to gain insight into artificial "thinking." But this window is not only distorted, it is closing faster than we feared.
July 2025: Collaborative study raises alarm—CoT transparency is fragile
September 2025: Claude Sonnet 4.5 shows that the problem is getting worse
November 2025: Industry massively releases autonomous agents based on these models.
For organizations using AI in the field—especially autonomous AI agents—this is not an academic debate. It is a matter of governance, risk management, and legal liability.
AI can read us perfectly. But we are losing the ability to read it—and it is learning to hide itself better.
Apparent transparency is no substitute for real transparency. And when the "reasoning" seems too clear to be true, it probably isn't.
When the model says, "I think you're testing me," maybe it's time to ask yourself: what does it do when we're not testing it?
FOR COMPANIES: IMMEDIATE ACTION
If your organization uses or is considering AI agents:
MODELS MENTIONED IN THIS ARTICLE
• OpenAI o1 (September 2024) / o3 (April 2025)
• Claude 3.7 Sonnet (February 2025)
• Claude Sonnet 4.5 (set 2025)
• DeepSeek V3 (Dec. 2024) - base model
• DeepSeek R1 (Jan 2025) - reasoning model
UPDATE - January 2026
In the months since this article was originally published, the situation has evolved in ways that confirm—and exacerbate—the concerns raised.
New Research on Monitorability
The scientific community has intensified its efforts to measure and understand the faithfulness of Chain-of-Thought models. A study published in November 2025 ("Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity") introduces the concept of verbosity —measuring whether the CoT verbalizes all the factors necessary to solve a task, not just those related to specific cues. The results show that models may appear faithful but remain difficult to monitor when they omit key factors, precisely when monitoring would be most critical.
At the same time, researchers are exploring radically new approaches such as Proof-Carrying Chain-of-Thought (PC-CoT), presented at ICLR 2026, which generates typed fidelity certificates for each step of reasoning. It is an attempt to make CoT computationally verifiable, not just linguistically "plausible."
The recommendation remains valid, but more urgent: organizations deploying AI agents must implement CoT-independent behavioral controls, comprehensive audit trails, and bounded autonomy architectures with clear operational limits and human escalation mechanisms.