AI coding agents are blind. New research from Causal Dynamics Lab gives them sight, outperforming Claude Code and Codex in key benchmarks

Click here to get this post in PDF

AI coding agents are operating blind in production. Causal Dynamics Lab’s new research explains why, and their flagship product Cielara Code beat both Claude Code (Opus-4.6) and OpenAI Codex (GPT-5.4) across three independent benchmarks at the hardest part of agent work: finding the right place to make a change.

San Francisco, CA – May 5, 2026; AI coding agents are shipping code faster than teams can verify what that code will do in production. These code changes look correct in review and pass checks, but trigger unpredictable failures once they interact with real dependencies, policy constraints, runtime state, and infrastructure topology. Causal Dynamics Lab believes the root cause is not the models. The problem is that agents can’t see the systems they are changing.

The 2025 DORA report tied AI coding tool adoption to a 7.2 percent decline in deployment stability. AWS CTO Werner Vogels calls the dynamic Verification Debt. Today, Causal Dynamics Lab released new research introducing a 6-layer causal ontology and a code causality graph designed to give coding agents “sight” into how production systems actually behave. Their flagship product, Cielara Code, uses this approach to validate changes before deployment, replacing brute-force search with structural navigation and pre-deployment simulation.

Causal Dynamics Lab instrumented native coding agents across thousands of sessions and logged every tool call. The distribution was lopsided. 56.8 percent of all agent actions were file reads. 24.2 percent were grep. Less than 1 percent were actual edits. The agents weren’t struggling to generate patches. They were struggling to find the right files. The pattern sharpened with complexity. When the ground truth fix spanned more than six files, agent recall dropped from 0.579 to 0.143, and failed trajectories consumed four times the compute of successful ones.

“Every coding agent today navigates by grep. That is the equivalent of a surgeon operating without imaging. We built Cielara Code to give agents sight: a causal model of the production environment that makes the reasoning behind every change explicit and verifiable,” said Hasibul Haque, CEO of Causal Dynamics Lab.

As session length and codebase size grow, general-purpose agents lose structural context and degrade into brute-force search. A publicly documented Claude Code regression (GitHub issue #42796) is a visible example of the same dynamic at scale. The underlying issue is architectural: current agents ingest code as flat text and have no representation of how files depend on each other, how functions call each other, or how changes propagate through the system.

Cielara Code is designed to fill that gap before failures reach production. At the core is a Production World Model that maps a company’s production environment into a 6-layer causal graph: what the code does, why it was built, who owns it, how it is constrained, where it runs, and what actually happened at runtime. A runtime failure can be traced back to the commit that introduced the change, the developer who approved it, and the intent behind the change. Before an agent begins exploration, Cielara constructs a Code Dependency Causal Graph indexing four relationship types so the agent navigates structure instead of scanning files sequentially.

Across three independent benchmarks, Cielara Code beat both Claude Code (Opus-4.6) and OpenAI Codex (GPT-5.4) at the hardest part of agent work: finding the right place to make a change. Overall localization accuracy hit 0.774, versus 0.738 for Claude Code and 0.707 for Codex. On MULocBench (1,033 issues across 46 repositories), Cielara reached 0.752 recall@5 versus 0.727 for Claude Code, and cut mean task time from 141.84 to 128.62 seconds. The result: fewer wrong-file edits, fewer failed runs, and 30 to 40 percent lower compute cost per task.

Cielara’s REASONARA is the causal memory layer that makes this practical at enterprise scale. Rather than stuffing an entire codebase into a prompt every time, REASONARA stores the production world as a graph-structured causal memory holding 125M+ tokens of effective context, retrieving only what matters for the question at hand. A single lookup typically uses 1,000 to 2,500 tokens versus 23,000 to 115,000 for full-context approaches. It can save up to 98% of token consumption compared to full context reasoning. On independent benchmarks, REASONARA scores 94 percent on UltraDomain, 92 percent on LoCoMo, 73 percent on LoCoMo-plus, and 87.4 percent on LongMemEval, running five to eight times faster than Codex high reasoning mode. The roadmap targets a one-billion-token context window.

Causal Dynamics Lab positions Cielara Code as a verification layer for existing AI coding agents, making their output production-safe without replacing them. Currently, 11 Fortune 100 and over 40 Fortune 500 companies use Cielara Code on their codebase.

“Board and auditor expectations for proactive risk management have risen sharply. Leaders now demand evidence that security can anticipate risks from rapid AI and automation, rather than depending on post-incident response,” says the CISO of one of the largest law firms in the USA and a current customer of Cielara Code.

Phillip Miller, Vice President, Global Chief Information Security Officer, H&R Block added: “Enterprises need solutions to problems they cannot solve with people alone. Cielera’s technology is a generational leap towards the original promise of AI: tackling complexity 7×24 with acquired knowledge, deep reasoning, and unbeatable accuracy. For engineering teams, this means a single engine to discover faults in real-world deployments (including legacy, cloud) and provide clear resolution steps. When I wrote, Hacking Success, I described a world where AI needs strong, directive policy (not rules / guardrails) to be safe and effective. Information Security lags behind the innovation curve, as most options rely on legacy thinking including posture, gateways, and logging. Enterprises now have an option to leverage Cielera’s models to oversee deployments of AI agents, models, and their supporting infrastructure.”

The team’s expertise is deeply rooted in the very challenges they aim to solve. CEO Hasibul Haque led platform engineering at Uber during its hyper-growth phase, and CTO Ryan Turner is a former Uber Staff Engineer and a CNCF SPIRE maintainer. Their research is guided by Dr. Xuchao Zhang (ex-Microsoft Research) and Dr. Liang Zhao (Emory, with over 200 publications), a collaboration supported by a formal R&D partnership with Emory’s AI Lab.

Matt Fisher, Former Co-Founder and CTO of Daydream and Adjunct Professor Brown University added: “AI has already changed how people access information. The next step is changing how people make decisions. Instead of only asking what is true right now, teams should be able to explore what could happen next, compare possible paths, and understand the consequences of action before committing. That move from answers to simulation is a powerful shift, and it is where the Casual Dynamic Lab is focused.”

Looking ahead, the Production World Model is designed as a foundation, not a feature. Cielara Code and REASONARA are the first products to be deployed on top of it; over time, Causal Dynamics Lab plans to extend into full causal simulation of proposed changes across code, infrastructure, policy, and runtime. The company expects the decision record to become a permanent reasoning layer of the enterprise stack: one any AI agent can query before changing the systems that keep production running.

About Causal Dynamics Lab

Causal Dynamics Lab builds validation infrastructure for AI-generated software. Its platform, Cielara, predicts how proposed changes will behave in production before they ship, powered by REASONARA, a graph-structured causal memory system. The company was founded by former Uber platform engineers and AI researchers from Microsoft Research and Emory University, including a Stanford Top 2% Scientist with 200+ publications at NeurIPS, ICLR, and KDD. Headquartered in San Francisco. CausalDynamics.com.

AI coding agents are blind. New research from Causal Dynamics Lab gives them sight, outperforming Claude Code and Codex in key benchmarks

About Causal Dynamics Lab

Disclosure

Digital Marketing Agency

Business Partner Magazine