On-Call Management for Technology Teams

Road to On-Call: On-Call and Alarm Management for Technology Teams

A practical guide on alarm handling and on-call management from tuning noisy alerts to building sustainable escalation policies and Panama shift rotations for healthier operations.

Introduction

Assume that you monitor the platform or service you are responsible for with the motto of 360-degree domain dominance. For this structure, I'm sure you collect platform metrics, logs, application performance metrics through various tools and derive alarms from here. Now let's move on to the process of managing the incoming alarms. . .


Alarm Management

The communication channels through which you send alarms can often be very noisy. In other words, if you derive too many unnecessary alarms, after a while, this will create blindness for you, and you will likely miss an actual problem that occurs among them. First, you need to tune this area with the correct threshold values. This process is also somewhat discipline-dependent. If you're not being called in some way (with Noc services or on-call tools), most likely, tuning this area will not be your priority. For this reason, my first advice is to have a method to be called upon receiving alarms and then to tune it by asking questions like whether this alarm should really come? Is it a false positive? Is the threshold correct? My other advice would be, if you have a process like scrum, kanban, etc., to create a backlog and divide this work into parts.


On-Call Management

You've solved the alarm management process. The next step will be to create the correct escalation policies and devise an on-call structure. First, you need to determine the escalation levels. What does Level 1 do? How far can it resolve? Likewise for Level 2, Level 3, and Level 4. You also need to establish rules for how long an alarm can remain unresponded to at what status before it's escalated to the next level. After setting up your escalation rules, let's move on to creating the on-call list. There are various algorithms for this. The most popularly used is the Panama shift (see https://en.wikipedia.org/wiki/Shift_plan)..) Generally, we want our on-call engineer to spend a maximum of 25% of their time on this task. If two shifts are rotated in a day, you can achieve these values with an on-call team of eight people. The important thing here is that the person on duty is accessible and has their computer with them. I don't think it's necessary to be at the computer 24/7. In the event of a problem, you are already informed by receiving a phone call through on-call management structures. You can also use your 25% time optimally.


Parny Panama Shift

Panama Shift


Conclusion

As I said at the beginning of my article, on-call management is of great importance in maintaining high availability of your platforms and services. You can see the approach I mentioned above as a guide for your operation teams and adapt it to your model to create a sustainable and manageable structure.