Assassin or Assistant? AI and the SRE Role

A New Centre of Gravity

The rise of the AI SRE has been little short of spectacular. There are now dozens of companies selling standalone products, whilst every major observability vendor has also rolled out their own implementation. Much of the early focus has been on the firefighting aspects of the role - such as alert triage, root cause analysis and outage mitigation. The technology has made amazing headway and is already delivery tangible results in these areas

This, however, is only a part of the picture. AI is also starting to assist SREs across a much wider range of responsibilities: writing Infrastructure as Code, building internal tools, designing tests, reviewing changes, capturing operational knowledge and coordinating automated workflows. The significance of this shift is not simply that engineers can do the same work more quickly. It is that the centre of gravity of the SRE role is beginning to move.

For years, SRE has involved a mixture of high-level systems thinking and low-level repetitive effort. The discipline has always aspired to reduce toil, but the reality has often been rather less elegant: endless alert tuning, YAML authoring, parser maintenance, script grokking and, of course, switching from one tool to another in a war-room. AI is now being applied to take on more and more of that grunt work. As a result, the SRE can step back from the coalface and take on the roles of supervisor, toolsmith and governor of increasingly autonomous systems.

After the Firefighting

Putting out fires is always going to be the top priority, so it is not surprising that incident response would be the landing zone for AI SREs. Modern systems generate huge volumes of telemetry, and the challenge when disaster strikes is rarely a lack of data. More often, the real problem is too much undifferentiated information arriving too quickly for humans to process efficiently.

LLMs can identify anomalies, correlate events and assemble context before a human even begins the investigation. Tools such as Cleric, Dash0 Agent0 and New Relic’s SRE-oriented agent offerings are designed to reduce the amount of manual digging required to understand an issue. Rather than forcing engineers to pivot repeatedly between dashboards, traces, change logs and chat threads, these systems can extrapolate and correlate at speed to present evidence-based hypotheses.

However, these tools are not just speeding up incident response. They are beginning to take over some of the cognitive scaffolding around it: gathering evidence, narrowing the problem space and suggesting the next investigative step. Once AI is trusted to handle part of the investigative workflow, it becomes much easier to extend it into adjacent areas of SRE work.

Manifests Destiny

One of the clearest examples of this wider shift is Infrastructure as Code. SREs are increasingly using AI to generate Terraform scripts, Kubernetes manifests, CI/CD pipelines and policy templates. Tools such as GitHub Copilot and Claude Code are making it easier to scaffold operational code and configuration from natural-language intent. A competent first draft of a Terraform module or Kubernetes deployment can now be produced in seconds rather than hours.

For that stout breed of engineers who thrive on the rigours of linting a 3,000 YAML file, this does not necessarily mean the end of their craft. The value of the SRE becomes less tied to the manual production of configuration and more tied to validating the output. Is the generated infrastructure secure? Does it conform to policy? Is it financially sane? Will it actually behave well under failure conditions? Those are the higher-order questions that the AI SRE is too impatient to dwell upon.

Expertise, therefore, is not automated of existence. Instead, it shifts upward. The syntax burden declines, but architectural judgment becomes even more important. An engineer no longer gains leverage simply by being able to write a Kubernetes manifest from scratch. The new leverage comes from knowing whether the generated solution should be trusted, how it needs to be adapted, and where the hidden operational risks may lie.

A Building Boom

Another important change is in the creation of internal tooling. SRE teams have always striven to build tools that will reduce friction and increase velocity: scripts to validate deployments, check certificate expiry, gather logs or automate handoffs between systems. The blocker though, has often been time constraints.

Activities that support customers or directly boost the bottom line will always take precedence over mini-projects to develop internal tooling. AI changes that equation. It dramatically lowers the effort required to move from “this would be useful” to “this now exists”. A script that would previously have taken a full day to write can now be scaffolded in minutes and refined iteratively. AI cannot slow down the arrow of time, but it can re-write the economics of time.

Putting AI to the Test

SRE is often discussed as though it were chiefly about responding to incidents. In reality, SRE teams are not 24/7 battling infernos in the server room. A mature reliability practice is just as concerned with preventing surprises in the first place. That means testing: load testing, synthetic monitoring, resilience testing, dependency validation and, if you’re lucky, unleashing the monkeys of chaos engineering.

This is another area where AI can turn “nice-to-haves” into “already-haves”. Engineers can now use AI tools to generate load tests, design synthetic checks for critical paths, propose fault scenarios and suggest areas of missing test coverage.

Dissolving the Language Barrier

Another fundamentally important development is the changing interface to observability itself. For years, interacting with telemetry required familiarity with query languages, dashboard structures and the internal grammar of specific tools. That expertise still matters, but natural-language interfaces are reducing the barrier to entry. MCP servers have now been integrated into every major observability platform: this means that telemetry is becoming accessible not only to expert humans, but to AI agents acting on behalf of humans.

For SREs, that raises both opportunities and questions. On the positive side, it reduces the amount of manual data retrieval required during investigations and lowers the skill barrier for less experienced engineers. But it also means that SREs will need to think more carefully about how observability systems expose data, what permissions agents should have and how much authority automated systems should be granted when reasoning over production telemetry. It seems inevitable managing and orchestrating agents will write itself into the SRE job description.

Strategic Reliability Engineering

The accelerating advance of Ai and its capability for automation naturally give rise to fears of mass professional displacement. An alternative trajectory is that the SRE role is not eliminated, but that it becomes more about governance. As more operational work becomes agentic, someone still has to decide what the agents are allowed to do, what data they can see, how their performance is evaluated, how errors are detected and what happens when confidence is low. The ability to answer these questions become central to the reliability engineering role.

Seen in that light, AI may actually make the strategic dimension of SRE more visible. Whilst writing scripts or handling alerts have always been part and parcel of the role. Its essence, however, has always involved judgment under uncertainty, making trade-offs between speed and safety, deciding where automation is appropriate and building systems that are robust. AI does not remove those responsibilities it places them at a premium.

The Future SRE

The future SRE is therefore unlikely to be reduced to the role of a passive and marginalised operator of AI tools. More likely, they will be an active designer of AI-supported operational systems: reviewing generated infrastructure, expanding reliability testing, building local tooling, curating context for agents and deciding where automation can safely go. Less typist, less dashboard navigator, less runbook executor. More architect, reviewer and governor of reliability in an increasingly automated world.

From the web

Articles we like from observability web sites and blogs

Mission Impossible? Delivering Reliability Through the Air Gap
April 4, 2026
The excellent Alex Ewerlöf blog is now back in full swing and in this latest article he turns his attention to dealing with a real SRE curveball - how to build reliability engineering for an air-gapped system. We are talking hermetically sealed - not even a maintenance window for external connectivity. This task involved a high-security military facility where installing updates meant physically handing over an archive file to a system operator.

The constraints were pretty stringent - no logs, no metrics, no traces, no remote access of any kind.This was an extreme case and, ultimately, the solutions had to be both human-centred as well as relatively low-tech. How would you address the challenge? Hit the link below to read about the solution that Alex put in place.
GitHub Outages - What Went Wrong?
March 20, 2026
In the past month or so the GitHub platform suffered a number of well-documented outages which resulted in loss of service for users. In the spirit of transparency, GitHub CTO Vlad Fedorov published this article on the GitHub blog, explaining the causes of the outages and the lessons learned as well as detailing the remediations that GitHub engineers will be putting in place.

The article really brings home the challenges of orchestrating the components of a global technology infrastructure - as well as the compounding effects of working at very large scale. The investigation revealed a perfect storm of edge cases, hidden tipping points and unforeseen knock-on effects. It’s impossible not to feel for the engineers sweating in the war-rooms as the dramas unfolded - after all, watching your failover fail must be pretty gut-wrenching.
Everything You Ever Wanted To Know About Observability - in a Slide Deck
Nov 30, 2025
If the Observability world had a code of secrecy akin to that of the Magic Circle, then Charity Majors might be in danger of being banished to exile and ignominy. In a single slide deck, she has blown the gaff on a whole trove of insider knowledge. It is the “What they don’t teach you at Harvard Business School” of observability knowledge. Not the abstract theory or technical detail but lessons and insights from the o11y frontline.

The deck in question was used in a talk at the LeadDev event in Berlin earlier this month and its 52 slides are an illuminating distillation of observability wisdom. We actually weren’t present at the talk and only came across the deck thanks to a mention in Michael Hausenblas’s excellent olly news newsletter. However, the slides contain sufficient clues (and images of unicorns) to easily re-construct the narrative and win friends and influence people as an observability savant.
The Art of Kubernetes Intrusion Detection
Oct 22, 2025
If you are an SRE, when an outage happens you will know about it pretty quick. With security breaches the picture is rather less clear as, by their nature, they are designed to go undetected. Intrusion detection therefore is often based on a mixture of tools designed to spot unusual spikes, suspicious patterns or failed logon attempts.

This article by Fatih Koç argues that one of the major difficulties involved in identifying attacks is that of correlating signals across multiple sources such as Falco, Prometheus, Kubernetes Audit Logs etc. In this article, he outlines a strategy for extracting relevant data from each of these sources and pulling it together into a single observability dashboard.
Grafana Use a Canary to Fight Intruders
Sept 16, 2025
The first line of cyber defence is normally at the perimeter - preventing attackers from entering your network in the first place. The next line of defence is intrusion detection. This can often take the form of anomaly detection using a variety of heuristics.

There are also some more creative possibilities, such as the canary solution adopted by Grafana. Just as the canary in the coalmine sings to alert underground workers to the presence of toxic gases, Grafana’s canary was designed to alert them to the possible presence of intruders in their domain.
Acting On Impulse - How Airbnb Do Load Testing
June 10, 2025
Load testing can be simple in theory but in modern distributed architectures, it involves a lot more than throwing requests at an individual service. This article on the Airbnb engineering blog looks at how the company’s engineers use the Impulse load-testing framework to handle a number of more complex requirements such as dependency mockingand managing messaging and asyncronous calls.

Unfortunately, at the moment Impulse is just an internal Airbnb framework, so you won’t be able to get your hands on it at present. At the same time, the article provides a valuable blueprint for tackling advanced, real world load testing scenarios.
It's eBPF for Windows!
Mar 21, 2025
It's an announcement that might have seemed unthinkable not long ago, but the porting of the revolutionary eBPF technology to Windows is now a reality. The ability to bring safe programmability to the kernel has resulted in enormous gains in fields such as security, networking and observability for Linux hosts, so applying the same principle to the Windows ecosystem is obviously an attractive proposition. It is not, though, without its own difficulties. There were a lot of hurdles to overcome and, inevitably, given the differences in OS architecture, this is not a full-fidelity replica of the Linux implementation.

This possibly foundational article by Pavel Yosifovich guides you through the steps involved in boldly going where few have gone before and creating your first eBPF program for Windows. One paragraph in the article begins with the sentence “this is where things get a bit hairy“ - for some that will likely be a challenge rather than a deterrent. This may not be cooking up nuclear fusion in your bedroom, but it does feel pretty radical.
Inside The C++ Black Box
Mar 10, 2025
As well as rolling out their Open AI observability solution, Elastic have also been very active within the OpenTelemetry project. C++ has a reputation for being something of a fearsome foe for observability practitioners. In this article on the Elastic blog, Haidar Braimaanie dons his protective gear and attempts to tame the beast with a soothing dose of OpenTelemetry instrumentation.

Unlike languages built in frameworks such as .NET, C++ does not have a standardized runtime environment that supports dynamic instrumentation across all platforms and compilers. C++ also uses a variety of build systems such as Makefiles and CMake, so that implementing instrumentation can be difficult and error-prone. In the article, Haidar looks at adding OpenTelemetry support to a C++ application running on Ubuntu 22.04. He also includes sample code for instrumenting the project with database spans and then observing the application in APM.

After reading this article you may want to give the C++ developer in your life a hug.
Brendan Gregg - His Latest Flame
Dec 19, 2024
Even if you are not familiar with the name of Brendan Gregg, you are almost certainly familiar with the fruits of his labours. Brendan is the creator of the Flame Graph - one of the most important and iconic visualisations in the observability toolkit. We featured the Flame Graph in our recent tribute to the work of UX designers in the observability arena - but you should also visit Brendans’ web site.

Brendan's latest innovation is the AI Flame Chart. This is an evolution of the original flame graph and its ambitious aim is to help reduce the vast financial and environmental costs entailed in the use of LLM’s. This means that whereas the original flame graph was focused on CPU cycles, the latest generation sets its sights on reducing GPU load. The article discusses the considerable complexities involved in mapping GPU programs back to their corresponding CPU stacks. The names of some of the instruction sets look intimidating to the uninitiated but the basic concept of the graph is quite simple - the wider the bar, the more resource it consumes.
System Initiative - IaC Reinvented!
Dec 19, 2024
If you have ever had to grapple with a 3,000 line Helm chart to deploy your observability infrastructure, you may be forgiven for thinking that there must be a better way to do this. Whilst YAML has a certain formal elegance, its syntax struggles to express the architectures and relationships embedded in highly complex systems.

Whilst Pulumi have tackled this problem by enabling the use of high level programming languages for IaC, System Initiative are taking a fundamentally more radical approach. Their goal is nothing other than completely reinventing IaC from the ground up. The blog article for the launch of the product is an incredibly ambitious statement of intent. The terms ‘game changer’ and ‘paradigm shift’ tend to be thrown around somewhat casually, this might be a case where their usage is appropriate.

So, what are they proposing? Well, System Initiative is IaC without the code. It is a kind of digital canvas where you manipulate digital twins of your systems. Is the future here or is this the Platform Engineering equivalent of science fiction? Read the article and decide for yourself!