A new category of production incident is emerging that engineering teams aren't yet tracking—because it doesn't fit existing postmortem templates. AI agents are quietly generating infrastructure failures that fall into a dangerous gap between agent operations and chaos engineering, creating cascading system failures that organisations struggle to classify or prevent.

With 79% of organisations now running AI agents in production and 96% planning expansion, the scale of this exposure is no longer theoretical. Gartner predicts that by 2028, 33% of enterprise software will include agentic AI, yet 40% of those projects will be cancelled due to poor risk controls. What's particularly concerning is the failure mode happening between those statistics: agents that continue running whilst quietly generating infrastructure events no one has categorised as risk.
The core problem lies in what autonomous agents skip—the critical judgement call that human engineers make before introducing stress into a system. When a human initiates a chaos experiment, they check dashboards, assess error budget burn rates, and evaluate whether dependencies are stable. Autonomous remediation agents, however, see an anomaly and immediately take action without checking SLO burn rates, calculating blast radius, or assessing whether the system can absorb additional stress. This creates cascade failures that no chaos engineering programme has tested for.
The solution involves treating absorb capacity—the real-time estimate of how much additional stress a system can handle—as a continuously computed, consumable resource. This resilience budget model draws on four live signal classes: SLO burn rate, P99 latency trends, dependency saturation state, and application behavioural signals. Every chaos experiment and every agent action should draw from this shared budget, ensuring that multiple teams and autonomous agents don't simultaneously overwhelm system capacity.
Whilst large language models show promise in generating chaos hypotheses from dependency graphs and incident postmortems, they face hard limits around dependency graph staleness and cannot reliably make execution decisions when signals are ambiguous. The governance implication is clear: every autonomous agent action touching infrastructure must register against the same live signal layer that governs chaos experiments. When the resilience budget falls below a defined floor, agents must wait or escalate to humans rather than act independently.
The organisations that will operate autonomous agents reliably at scale aren't those with the most sophisticated models, but those that understand every agent action is a chaos event and build their governance accordingly. The practical first step is unglamorous but essential: audit every autonomous agent currently touching infrastructure, map its action surface against live SLO burn rate signals, and define explicit conditions below which the agent must escalate to human decision-makers.
Fuente Original: https://venturebeat.com/orchestration/ai-agents-are-quietly-generating-chaos-engineering-failures-enterprises-dont-track-yet
Artículos relacionados de LaRebelión:
- Debug AI Agents Locally with Raindrops Workshop
- Anthropics AI Agents Now Learn From Mistakes
- AI Agents Microsoft Google Drive Enterprise Governance
- HUMAIN ONE Enterprise AI Agents Scale with AWS
- Ubuntu Infrastructure Crippled by Sustained DDoS Attack
Artículo generado mediante LaRebelionBOT
No hay comentarios:
Publicar un comentario