logo
banner image
image

Assassin or Assistant? AI and the SRE Role

A New Centre of Gravity

The rise of the AI SRE has been little short of spectacular. There are now dozens of companies selling standalone products, whilst every major observability vendor has also rolled out their own implementation. Much of the early focus has been on the firefighting aspects of the role - such as alert triage, root cause analysis and outage mitigation. The technology has made amazing headway and is already delivery tangible results in these areas

This, however, is only a part of the picture. AI is also starting to assist SREs across a much wider range of responsibilities: writing Infrastructure as Code, building internal tools, designing tests, reviewing changes, capturing operational knowledge and coordinating automated workflows. The significance of this shift is not simply that engineers can do the same work more quickly. It is that the centre of gravity of the SRE role is beginning to move.

For years, SRE has involved a mixture of high-level systems thinking and low-level repetitive effort. The discipline has always aspired to reduce toil, but the reality has often been rather less elegant: endless alert tuning, YAML authoring, parser maintenance, script grokking and, of course, switching from one tool to another in a war-room. AI is now being applied to take on more and more of that grunt work. As a result, the SRE can step back from the coalface and take on the roles of supervisor, toolsmith and governor of increasingly autonomous systems.

After the Firefighting

Putting out fires is always going to be the top priority, so it is not surprising that incident response would be the landing zone for AI SREs. Modern systems generate huge volumes of telemetry, and the challenge when disaster strikes is rarely a lack of data. More often, the real problem is too much undifferentiated information arriving too quickly for humans to process efficiently.

LLMs can identify anomalies, correlate events and assemble context before a human even begins the investigation. Tools such as Cleric, Dash0 Agent0 and New Relic’s SRE-oriented agent offerings are designed to reduce the amount of manual digging required to understand an issue. Rather than forcing engineers to pivot repeatedly between dashboards, traces, change logs and chat threads, these systems can extrapolate and correlate at speed to present evidence-based hypotheses.

However, these tools are not just speeding up incident response. They are beginning to take over some of the cognitive scaffolding around it: gathering evidence, narrowing the problem space and suggesting the next investigative step. Once AI is trusted to handle part of the investigative workflow, it becomes much easier to extend it into adjacent areas of SRE work.

Manifests Destiny

One of the clearest examples of this wider shift is Infrastructure as Code. SREs are increasingly using AI to generate Terraform scripts, Kubernetes manifests, CI/CD pipelines and policy templates. Tools such as GitHub Copilot and Claude Code are making it easier to scaffold operational code and configuration from natural-language intent. A competent first draft of a Terraform module or Kubernetes deployment can now be produced in seconds rather than hours.

For that stout breed of engineers who thrive on the rigours of linting a 3,000 YAML file, this does not necessarily mean the end of their craft. The value of the SRE becomes less tied to the manual production of configuration and more tied to validating the output. Is the generated infrastructure secure? Does it conform to policy? Is it financially sane? Will it actually behave well under failure conditions? Those are the higher-order questions that the AI SRE is too impatient to dwell upon.

Expertise, therefore, is not automated of existence. Instead, it shifts upward. The syntax burden declines, but architectural judgment becomes even more important. An engineer no longer gains leverage simply by being able to write a Kubernetes manifest from scratch. The new leverage comes from knowing whether the generated solution should be trusted, how it needs to be adapted, and where the hidden operational risks may lie.

A Building Boom

Another important change is in the creation of internal tooling. SRE teams have always striven to build tools that will reduce friction and increase velocity: scripts to validate deployments, check certificate expiry, gather logs or automate handoffs between systems. The blocker though, has often been time constraints.

Activities that support customers or directly boost the bottom line will always take precedence over mini-projects to develop internal tooling. AI changes that equation. It dramatically lowers the effort required to move from “this would be useful” to “this now exists”. A script that would previously have taken a full day to write can now be scaffolded in minutes and refined iteratively. AI cannot slow down the arrow of time, but it can re-write the economics of time.

Putting AI to the Test

SRE is often discussed as though it were chiefly about responding to incidents. In reality, SRE teams are not 24/7 battling infernos in the server room. A mature reliability practice is just as concerned with preventing surprises in the first place. That means testing: load testing, synthetic monitoring, resilience testing, dependency validation and, if you’re lucky, unleashing the monkeys of chaos engineering.

This is another area where AI can turn “nice-to-haves” into “already-haves”. Engineers can now use AI tools to generate load tests, design synthetic checks for critical paths, propose fault scenarios and suggest areas of missing test coverage.

Dissolving the Language Barrier

Another fundamentally important development is the changing interface to observability itself. For years, interacting with telemetry required familiarity with query languages, dashboard structures and the internal grammar of specific tools. That expertise still matters, but natural-language interfaces are reducing the barrier to entry. MCP servers have now been integrated into every major observability platform: this means that telemetry is becoming accessible not only to expert humans, but to AI agents acting on behalf of humans.

For SREs, that raises both opportunities and questions. On the positive side, it reduces the amount of manual data retrieval required during investigations and lowers the skill barrier for less experienced engineers. But it also means that SREs will need to think more carefully about how observability systems expose data, what permissions agents should have and how much authority automated systems should be granted when reasoning over production telemetry. It seems inevitable managing and orchestrating agents will write itself into the SRE job description.

Strategic Reliability Engineering

The accelerating advance of Ai and its capability for automation naturally give rise to fears of mass professional displacement. An alternative trajectory is that the SRE role is not eliminated, but that it becomes more about governance. As more operational work becomes agentic, someone still has to decide what the agents are allowed to do, what data they can see, how their performance is evaluated, how errors are detected and what happens when confidence is low. The ability to answer these questions become central to the reliability engineering role.

Seen in that light, AI may actually make the strategic dimension of SRE more visible. Whilst writing scripts or handling alerts have always been part and parcel of the role. Its essence, however, has always involved judgment under uncertainty, making trade-offs between speed and safety, deciding where automation is appropriate and building systems that are robust. AI does not remove those responsibilities it places them at a premium.

The Future SRE

The future SRE is therefore unlikely to be reduced to the role of a passive and marginalised operator of AI tools. More likely, they will be an active designer of AI-supported operational systems: reviewing generated infrastructure, expanding reliability testing, building local tooling, curating context for agents and deciding where automation can safely go. Less typist, less dashboard navigator, less runbook executor. More architect, reviewer and governor of reliability in an increasingly automated world.

Top