La Rebelión: AI Agents Causing Untracked Infrastructure Chaos

lunes, 25 de mayo de 2026

AI Agents Causing Untracked Infrastructure Chaos

A new category of production incident is emerging that engineering teams aren't yet tracking—because it doesn't fit existing postmortem templates. AI agents are quietly generating infrastructure failures that fall into a dangerous gap between agent operations and chaos engineering, creating cascading system failures that organisations struggle to classify or prevent.

AI Agents Causing Untracked Infrastructure Chaos

With 79% of organisations now running AI agents in production and 96% planning expansion, the scale of this exposure is no longer theoretical. Gartner predicts that by 2028, 33% of enterprise software will include agentic AI, yet 40% of those projects will be cancelled due to poor risk controls. What's particularly concerning is the failure mode happening between those statistics: agents that continue running whilst quietly generating infrastructure events no one has categorised as risk.

The core problem lies in what autonomous agents skip—the critical judgement call that human engineers make before introducing stress into a system. When a human initiates a chaos experiment, they check dashboards, assess error budget burn rates, and evaluate whether dependencies are stable. Autonomous remediation agents, however, see an anomaly and immediately take action without checking SLO burn rates, calculating blast radius, or assessing whether the system can absorb additional stress. This creates cascade failures that no chaos engineering programme has tested for.

The solution involves treating absorb capacity—the real-time estimate of how much additional stress a system can handle—as a continuously computed, consumable resource. This resilience budget model draws on four live signal classes: SLO burn rate, P99 latency trends, dependency saturation state, and application behavioural signals. Every chaos experiment and every agent action should draw from this shared budget, ensuring that multiple teams and autonomous agents don't simultaneously overwhelm system capacity.

Whilst large language models show promise in generating chaos hypotheses from dependency graphs and incident postmortems, they face hard limits around dependency graph staleness and cannot reliably make execution decisions when signals are ambiguous. The governance implication is clear: every autonomous agent action touching infrastructure must register against the same live signal layer that governs chaos experiments. When the resilience budget falls below a defined floor, agents must wait or escalate to humans rather than act independently.

The organisations that will operate autonomous agents reliably at scale aren't those with the most sophisticated models, but those that understand every agent action is a chaos event and build their governance accordingly. The practical first step is unglamorous but essential: audit every autonomous agent currently touching infrastructure, map its action surface against live SLO burn rate signals, and define explicit conditions below which the agent must escalate to human decision-makers.

Fuente Original: https://venturebeat.com/orchestration/ai-agents-are-quietly-generating-chaos-engineering-failures-enterprises-dont-track-yet

Artículos relacionados de LaRebelión:

Artículo generado mediante LaRebelionBOT

Páginas

lunes, 25 de mayo de 2026

AI Agents Causing Untracked Infrastructure Chaos

Entradas relacionadas:

No hay comentarios:

Publicar un comentario

Navigate

About

Legal