Monitoring like an SRE: The Mindset

Today’s organizations increasingly depend on digital systems and services. Any disruption or downtime can have a significant impact on revenue, user experience and brand reputation. To mitigate such risks, Site Reliability Engineering (SRE) has emerged as a vital discipline that focuses on building and maintaining highly reliable and scalable systems. Effective monitoring stands as a fundamental pillar of SRE.

In this blog, I aim to share the essential mindset necessary to be able to incorporate SRE into an organization’s culture and principles. Keep in mind that this post is written based on my personal opinion and view on SRE. Going through this post I will cover a select set of key principles of SRE monitoring that include embracing risk, recognizing reliability as a business feature, removing the wedge between development and operations teams, and understanding the cost of paging humans.

Embrace Risk

Embracing Risk is a fundamental mindset shift in the SRE approach to monitoring. Some organization’s aim to strive towards 100% reliable services. While it is unattainable to achieve complete reliability, it is also something most companies should not strive to accomplish. Increasing reliability past a certain point can even be worse for an organization. Increasing reliability comes at a cost. Not only financially but also affects the speed at which new features can be delivered. New features require changes to take effect and changes inevitably lead to more unreliability. When you want to provide features at a certain pace, keep in mind that this can have impact on the risk you need to take.

Striving for 100% reliability does not support a way for developers to develop new features. Each team or organization needs to find that sweet spot between reliability and feature delivery. Some organization’s benefit from faster feature delivery and others might need that additional reliability. Ultimately, the key factor to consider is where do your users derive the greatest benefit.

Reliability is a business feature

Traditionally developers (Dev) and operations (Ops) have different interests. Developers focus on shipping as many features as possible where operations strive towards a more reliable service. Shipping features are changes to a service which always results at some point in more unreliability. Because both Dev and Ops have fundamental differences in interest these will collide. This is where a product owner comes in. In the end, it is the product owner who decides the priority of different types of work.

For operations to advocate reliability work they need to convince the product that reliability work is a business feature. And it truly is! Most users would ask for new feature improvements which would suggest that users find this more important. But frankly users are just expecting a reliable service (whatever that means). When your service is unavailable, they will ultimately find another service which satisfies their reliability needs. This means reliability is a business feature, your users demand it. So, make sure you also manage reliability as a business feature.

Remove wedge between Dev and Ops

As stated in the previous paragraph there is a traditional wedge between dev and ops. We now know we should treat reliability as a business feature and therefore manage it the same way as other business features. But how do you manage this? How do you give the appropriate priority to reliability work when the only measurement you maybe now have is complaining customers.

Sounds like we need to find a balance between reliability work and other work. SRE can come into the picture here! Within SRE monitoring is performed by implementing Service Level Objectives (SLOs). With an SLO you can measure if your service is delivering the required reliability.

Let’s give an example. We are managing an event ticketing system which has a 99% reliability target. This means we have 1% margin for potential unreliability. The principle to divide this SLO is called “Error Budget”. Error budget represents the level of tolerance a service has left to focus on whatever the organization finds important. As we have a reliability target of 99%, we therefore have a 1% error budget. There are a lot of ways and practices to use error budgets. In this blog I’m giving a simple example of how error budgets can be used.

As long as there is error budget available, developers and the product owner are free to determine what work they want to deliver. But when the error budget is fully consumed, developers should focus on improving the reliability of their service. When the error budget is empty, reliability work needs to get more priority over other work. In my opinion this is a very good way to manage reliability as a business feature because it removes the discussion from prioritizing work. This will result in more ease of work and better alignment between Dev and Ops.

Managing work like this also results in awareness of reliability at developers. If they release a very unreliable service, it will result in a very quick drainage of their error budget. This means they cannot focus on delivering new features (which they like) but need to work on reliability improvements. So, it is in their best interest to make sure they release solid reliable service. If they do so it has less impact on the error budget. Which ultimately results in the ability to release faster. This way of working is in my opinion the ultimate success of combining Dev and Ops into DevOps. By combining Dev and Ops you create a shared collaboration between these two parties which results in increased effectiveness compared to working separately.

Paging humans comes at a cost

At most companies where I worked, I was required to participate in on-call duties. Being on-call means you need to respond to incidents which can happen at any time of the day (including at night). One of my biggest frustrations is responding to a false-positive alert in the middle of the night. This was mostly related to systems using event-based monitoring. This type of monitoring requires a lot of reviewing and alterations to an acceptable level of monitoring.

SRE has an alternative to this approach by using SLOs on the so called Golden Signals. When using this approach to page on-call engineers they are only notified when incidents are affecting the user experience. This should result in fewer false positives. Also, unique events are still monitored if they are affecting the user experience. This is different to event-based monitoring where you actually need to predict this behavior, or you will not have an alert for this.

To clarify, I’m not saying event-based monitoring is completely useless and should never be used. When troubleshooting your service or monitoring non-critical incidents they can be an added value. In my opinion, when it comes to paging on-call engineers SLO based monitoring is more effective and has a lower negative (negative) impact on the engineers.

Conclusion

Implementing monitoring as a site reliability engineer requires more than implementing monitoring software. Frankly, most of the work lies in changing the organization’s culture around monitoring and managing teams. Recognizing this is as a essential requirement is the first step to successfully implementing SRE within your organization.

Monitoring like a SRE starts with embracing risk. No service can and should be 100% reliable. Reliability comes at a cost, both financially and in terms of feature development speed you need to balance this correctly. Find the sweet spot between reliability and feature development. To achieve this, it is essential to treat reliability work as a business feature. Users do not benefit from new features if they are unable to use it. By introducing error budgets, we can remove the wedge between Dev and Ops. Error budgets makes it easier for product owners to determine priority between feature and reliability work. No error budget means focus on reliability. Existing error budget means freedom in priorities. Monitoring production systems also means that engineers need to be on-call in case of incidents. Be aware that paging humans is costly, certainly when it is personal time or even during sleep hours. Make sure every page is worth it by monitoring the “Golden Signals”. Golden signals represent user experience. When a service breaks but does not affect user experience you do not want to page an engineer during personal time. It is fine to wait till the next working day to fix it.

I’m always keen to get in touch with other people interested in topics I write about. Feel free to reach out to me on LinkedIn for questions or in-depth conversations. If you like to see me posting more content, feel free to reach out as well (suggestions are welcome).

Gerelateerde posts