What is SRE (Site Reliability Engineering)? SRE and Tasks of an SRE explained
SRE is becoming a very popular term in DevOps and generally the software development world. Probably some of you have already heard about it, but are not sure what it is exactly.
So this article gives a detailed look at what SRE or Site Reliability Engineering really is with the goal of clarifying all questions and doubts around it 🙌
This is the written version of my new youtube video ✍️ 🙂
Table Of Contents 📝
First of all, I recommend that you read my recent DevOps article first, where I explain DevOps in detail. This will make it definitely easier to understand the topics in this article 😊
How SRE emerged, and why was there even a need for SRE? 🤔
In a traditional software development process, we have developers and operations as two separate teams. Each of them with its own goal:
Developers: Want to push out application changes as fast as possible to the end users
Operations: Want to keep the application stable
So Operations are very careful about each and every change, and this causes a conflict of interest between these two roles, forcing them to work against each other instead of collaborate:
And DevOps was actually introduced to help fix exactly this issue. However, while DevOps made the release process faster, these releases were not as stable as ideally wished by DevOps principles. Plus, in the DevOps team, there was no dedicated role or person that actually focused full-time on keeping systems reliable.
And that's why the need for SRE and a Site Reliability Engineer as separate roles emerged. 💡
So what is SRE?
What is SRE? - Official Definition
SRE was conceptualized at Google by Ben Traynor, a software engineer, who was given the task of running a small team of other software engineers to do what used to be Operations work.
And according to his own definition:
SRE is what happens when you treat operations as a software problem and staff it with a bunch of software engineers. 👩💻🧑🏻💻
At its core:
But this definition is, of course, too vague and high level to really understand how it's implemented in practice. 🧐 So, let's break it down and analyze each part of this definition step by step. 👍
What is system reliability, and why it's important? 🧐
First of all, what is a system that we want to keep reliable? Or what does a system even mean in this definition?
What is a System?
The system is the servers, infrastructure, the platform. So the whole deployment environment, where the application runs.
What is Reliability?
Now what exactly is reliability, and why is it so important to keep our systems reliable? 🤔
Unreliable Services ⛔️ Imagine you work with emails daily, and your email provider is down once a week, or your online banking application is down and not accessed regularly. This would be an unreliable service. 🤨 You can't rely on that. It's available when you need it.
Reliable Services ❇️ On the other hand, many popular services, like Gmail, Twitter, Youtube, etc, are rarely inaccessible. 👏 So, these systems are pretty reliable.
But the thing is, users usually do not notice the reliability of the system. 🙄 It only becomes visible when something goes wrong, and the services are down. Do you remember the recent outage of Facebook, Instagram, and other related services that made huge news?
What about AWS server outages that also affected other applications that were hosted on AWS?
Of course, everybody noticed and knew about it when it happened. So the more popular and bigger the product or service and the more used, the more impact it will have 👀 if the service has an outage, which means their team should worry about its reliability.
Why reliability is important?
What are the effects or impacts of outages or system unreliability? For most of the services, this is a lot of unhappy customers and lots of lost revenue:
Like, imagine an online shop is down on a holiday or an online bank is not working because of traffic overload. This means lots of lost business because people cannot order anything from that shop.
How to make systems reliable?
Okay, so we understood that systems need to be reliable, but how do we make a system reliable or ask differently: What makes a system unreliable, and what affects its reliability? 🤔
The main cause of the system becoming unreliable is when you make changes to your system ⚡️:
Like change something: ⚡️ in the infrastructure ⚡️ the platform where the application is running ⚡️ the application itself and its services ⚡️ and so on
These changes may cause disruption and break something in the whole setup.
Bad Solution 👎🏼
As a solution, we can say no changes are allowed or limit the number of changes 🙅🏻♂️ to keep systems reliable, but that really limits the business. We want to make changes and improvements to our application to make it better and increase its business value and stay competitive etc. 💪
Because if our competitor is bringing out new features, we need to keep up, and that's the main focus of software developers, to make those changes and improvements.
But on the other hand, if the application is not accessible, that's also bad for business because you may have awesome features, but nobody can use them. And it's the operation's job to take care of that and make sure the application is accessible.
This means developers want to release fast, and operations want to keep stability. So traditionally, Devs would make a change, and Ops would analyze with hundreds of checklists and mechanisms to make sure the change would not affect the system:
This whole analysis and evaluation slow down the release process, and that's been the major challenge of the traditional way of software development. And that's exactly what DevOps and SRE try to solve. 🚀
So what's the specific solution of SRE here? 💡
Well, SRE tries to automate the process of analyzing and evaluating the effects the change will have on our system's reliability. Automation means no checklists or discussions of the operations team, whether to release the change or not or what threats and risks are involved:
Instead, the evaluation is based on automated processes, and this makes releasing changes fast and safe at the same time. 🚀
SRE in Practice: SLA & Error Budget
Now how is that automated evaluation done? 🤩
The way it works is using what's called SLAs. So what is an SLA? SLA is basically how reliable a system is going to be to its end users. So how often it's going to be up, and how often it's going to be down? And it's expressed in percentage. A service that works all the time, is never down and has a hundred percent SLA. 💯
No need for 100% Reliability
Now you may be thinking: "Of course any service should be 100 reliable right isn't that a natural goal?" 👀 Well, not really. First of all, it's very hard to achieve 100% reliability, and there are very few services in the world that actually need a 100% SLA.
So, for example, if your internet provider or the customer's device itself is not 100% reliable, which is the case with most laptops, mobile phones, and so on, then your service does not need to be either. It can be available maximum at the same rate as the underlying network or device. 💡
And in those cases, a reliability of three nines or four or five nines may be enough so that users don't even notice that there is an issue. 👌
The closer, you try to get to 100%, the more effort it is, which as you see is an unneeded effort, because you don't need 100% SLA for most applications
Examples of SLA
For example, you can define a service level agreement about the accessibility of your application:
So, for example, a 99 SLA for application accessibility would mean that the system can be down a maximum of 3.65 days in a year. An SLA with 5 nines or 99.999% allows an application to be an unaccessible maximum of 5 minutes a year, so the rest of the time, it should work.
You can actually define multiple such agreements or SLAs, not just the accessibility or availability of the system.
Other SLA examples:
application response time
For example, if you have an application that serves a million requests a week with 99% SLA, you define that 990 000 of those requests will be successful.
Who defines these SLAs? 💁🏻♀️
Okay, now you may be wondering who defines these SLAs. So basically, who decides how many requests must be successful out of these million requests or how much downtime is allowed for the application? Who makes this kind of decision?
As this decision affects the end users and their user experience, naturally, business people are also involved in this process. So business people, together with the engineers like SRE and DevOps Engineers, decide together what service level agreements they want to define for their application:
Based on the industry benchmarks, competition, user feedback, and so on, business people will define the desired SLAs on a higher level. The engineers will then define them on a technical level and also make sure to integrate them into their DevOps and SRE processes.
As I mentioned, SLA for availability defines how long the service should be available or how much downtime is allowed for that service. That allowed downtime is also what's called an "error budget":
In SRE, a team can "spend" that error budget on making unreliable changes. So basically, that error budget says that we're allowed to have that much downtime in our system without losing business, making customers unhappy, etc.
Way to regulate the release speed ⚖️
So SLA is like a barometer; you can turn it up or down based on how reliable your system needs to be. And, of course, the closer to 100, the more effort you need to put in to guarantee the reliability of your systems.
Now once the SLA is defined, the system performance can be measured against this number.
⛔️ If systems are more unreliable than the SLA allows: Then more resources from the SRE team will be put to make the systems reliable. Again because we have exceeded that allowed amount of downtime. And until the system is recovered to be within the defined SLA, fewer changes will be allowed:
✅ On the other hand, if the system is performing well beyond the defined SLA: Developers in the SRE team can release more changes.
So it's a simple way to regulate the release speed of developers. ⚖️ If we turn up the SLA, releases will slow down and vice versa:
SRE Tasks and Responsibilities 👩💻
1) Automation - Create automated processes for operational aspects 👏
The SRE or the Site Reliability Engineer is the one who creates automated processes to calculate and evaluate whether the service is within the SLA or not.
So now the policy for launching is not the endless checklist that operations use to decide whether to launch or not; instead, an SRE helped design processes that can automatically evaluate an SLA. ✅
2) Configure Monitoring and Logging (Observability for System Performance) 🔎📈
Now, of course, to measure the performance of our systems and whether services are within the SLA, we need proper monitoring of our systems.
So another big part of SRE tasks and responsibilities is to configure proper monitoring and logging of the systems to get visibility of what's going on inside.
3) Configure Monitoring and Logging (Observability for Detecting Issues) 🔎📉
Now we said that for most applications, SLA is not 100%, which means we accept that it won't work 100% of the time. So at some point, we will have an outage 😱
Now the question is, what do we do when an outage happens? Or how do we prepare for it? And that's where another big part of SRE tasks and responsibilities comes in.
The first one is monitoring and alerting, which I already mentioned. So, in addition to giving you visibility to measure your system's performance, but more importantly, it helps you detect any indications for issues before they happen or very early when they happen and then alert the teams about it:
4) Develop custom services to achieve this 👩💻
Now another important part of this whole configuration is that when the issue is alerted. Ideally, the right person in the team gets the message; the alert message should contain all the needed information to identify and fix the issue fast.
Instead of: ❌ "something is wrong in the cluster" a more detailed message like: ✅ "service a in cluster b is throwing 500 error"
So now you know exactly: 👉 which service is having a problem 👉 in which cluster and 👉 what is that problem exactly is
So more detailed the alert message, the better!
In many cases, SREs will develop their own custom services to achieve this proper monitoring alerting and logging configuration for their systems.
5) Do On-Call Support ☎️
Another thing that SREs do is on-call support.
Basically, when things go wrong and users need real-time support, somebody is responsible for doing that, and that is the on-call support team. And putting SREs on this support team has several benefits:
It helps them really see and understand what issues to expect
How does the support deal with the issues?
And what improvements can be made to make the support process more efficient?
do alert messages and logs have enough information to quickly identify the issue and the cause?
Were issues identified too late?
So overall, the main goal of SRE is to make sure the scope of the outage is small when it happens, which means
👍 the outage doesn't last long and is fixed very quickly
👍 and fewer people and few services are affected by that outage
6) Post-Incident Reviews 🧐
Now fixing an issue or an outage is not the end of the SRE team's work. We want to use that outage as a chance for lessons learned and, of course, avoid this happening in the future.
So a principle of SRE is to do what's called Post Mortem, which is Latin for "after death." So in SRE terms, "after issue" or "after outage analysis."
This includes a thorough analysis, meaning taking time to really go deeper and understand the issues:
But of course, during this analysis, it's super important to stay blameless, which is one of the major points of this post-mortem analysis, in order to encourage people to admit and learn from their and other people's mistakes ✅
And finally, it's important to document everything for future reference!
Who is doing SRE? SRE Role 👤
Now you may be thinking: "How much more should software developers learn? They already have to know all these software development technologies; now they also have to take over the operations tasks and learn all these operations tools?" 🤯
SRE as its own Role
That's why we have SRE as its own role. So a dedicated person whose full-time responsibility is to work on keeping systems reliable:
So in many projects, along with developers, you have SRE teams, which are the new Operations team. Basically, a team that does the operations, where both teams work for the same goal of keeping the systems within the defined SLAs:
So the SRE team maintains and takes care of the automated delivery operations and all sorts of automation that will help developers release their changes safely and fast. 🚀
Software Developers as SREs
However, it's also common to have 1 team of SREs and Software Developers, where SREs also do the software development job. And this means that Site Reliability Engineers must know software development as well, unlike DevOps engineers.
But in both cases, as you see we, started off with a traditional way of software development with separate dev and ops with opposing incentives and with SRE we gave devs and ops the same incentives and put them on the same side.
SRE vs DevOps
Finally, one of the majorly discussed questions in this area is: What is the difference between SRE and DevOps engineering or generally between these two concepts?
If you have already watched my "What is DevOps" video, you know that there are two definitions of DevOps:
the original definition, which is more high-level and broader and doesn't specify how exactly DevOps should be implemented
and a more practical one, which evolved over time with its own DevOps engineer role.
So when we compare DevOps with SRE, it's important to know which definition of DevOps we're using for this comparison.
1. First broader definition of DevOps vs. SRE DevOps is a more high-level concept that defines what needs to be done to achieve the automated, streamlined release process. At the same time, SRE is more specific about how to exactly implement this process and how to implement DevOps principles.
So many people would say that SRE is a specific implementation of the DevOps concepts:
2. Practical DevOps vs. SRE But as we saw, DevOps itself also became more practical with its own role and specific technologies and ways to implement it. So what's the comparison here? 🤔
Well, in many companies, this practical DevOps implementation became more focused. It concentrated on the speed of delivery for the application changes, and of course, even though it's part of the DevOps principles to not only release fast but release quality code, many DevOps teams in practice again seemed to optimize more for the speed than reliability.
So as a great complementary part of DevOps, SRE emerged with the same principles and goals in mind, which is to release quality code fast, but as the name suggests, more focused on reliability and keeping systems stable while allowing for fast changes:
So SRE is its own role with its own set of tools for making systems reliable. So these two were kind of parallel developments and are now often seen as two sides of the same coin, and it's not uncommon for teams to have both a DevOps engineer and SRE helping implement the DevOps principles.
Now I hope you learned a lot from this article and that I was able to answer all of your questions about SRE. 🙂
Like, share and follow me 😍 for more content: