Remote Incident Management — Operation Excellence

Put out the fire, before you investigate why it happened

With Covid-19, a lot has changed around us. The concept of remote incident management is not new. Though, remote or hybrid teams are not new, teams were split across locations and some people were working from home.

What has changed between then and now is the ratio of people on a call vs over the phone. This matters!

When an incident occurs, it creates a lot of anxiety/stressful environments around the team. Things start to look messy, with so many people on the call it becomes really difficult to coordinate, as everyone wants to help in mitigating the incident. Incident management is more of an operational challenge than a technical challenge.

With a calm attitude, one can resolve the most severe incident in a short span of time, even if there is a variance in the technical knowledge of the group.

At Hotstar, we have instituted a well worn method to run our incidents, which we will review here.

Reaction

As soon as we are notified of an incident, we immediately form a slack channel with the number of the ticket (e.g. #live-1234), the first responder shares a Zoom bridge and starts chronology and fact-collection sheet, with the same name, sharing the link in the slack channel. The slack channel is very helpful in sorting the chronology out later as well.

Incident Leader

Usually the first person on the scene must start to orchestrate the call, unless they hand it over to someone who might be better suited to run the incident, typically seasoned incident leaders. The role of the incident leader is to curate the conversations, orchestrate mitigation tracks and root-cause tracks. The first order of business is to mitigate. The incident leader must also control the inevitable chaos that might occur at the start.

We mandate that the main room is a control room and create breakouts for the multiple tracks we’re taking to mitigate.

Mitigate first / Root cause later

It’s not important during the initial phase of an incident to figure out the details of the WHY. Most important is to mitigate the incident and get back in business. Mitigation at times might not mean full recovery of business to 100%, it can at times mean a degraded experience.

Unblock your customers, first, always.

Recovery can be the next step and root cause analysis can come after that.

ProTip: Look at everything that changed, and roll back recent deployments without much discussion, often times a debate will slow down mitigation. This might sound like a “turn the power on and off move”, but we’ve found value in this.

Note-taking

It’s important for someone, preferably someone apart from the incident leader to start taking notes. The recommendation would be to use a collaborative document like a Google spreadsheet for taking notes. The notes sheet can be shared among all the participants for adding any details necessary.

A shared note sheet can also be used as a whiteboard to triage and discuss things. Put all the open points of discussion in the shared sheet and use that reference for discussion, keep updating the points in the sheet.

The notes during an incident do not need to be editorially perfect. Just do a scratchpad style writing, formatting/editing can be done post facto. It’s important to keep making notes of anything relevant/irrelevant. All filters should be applied post facto.

Note all the potential theories, so that everyone could read them and provide actionable input, else, the entire call can become a root-causing call.

Sharing the note helps others visualise the chronology and the facts of the situation, which often leads to faster mitigation.

Incident Change Control — Alter in Public

A high-pressure situation can lead to unwanted human errors. So one of the methods which help in tackling this is to have more heads operate at once. i.e. If a participant needs to share some facts/graphs, instead of talking they can quickly share the screen and show it to the other participants.

A similar approach can be taken when making a change, let’s say you want to do a config change or scale-up system. To ensure that one is making the right change, we encourage making the change over a shared screen so that the group can collectively review and proceed faster,

It’s important to gain visibility on the impact of an incident, identifying this as a metric makes mitigation smoother, as one can see changes to the metric in real-time.

Sometimes the impact might not be directly accessible, in that case, the group should try to find proxy metrics.

An example would be, let’s say SMS is running delayed, then, if we did not have tracking on delivery time, we would have used click rates on IVR/Retry/Help buttons as a proxy metric. More people clicking on these buttons means that there are having issues receiving OTP on time.

This is not a new technique. While the group is taking notes and building a timeline for how the incident is progressing. It’s important to start creating a retroactive timeline from the time incident was identified. This is where audit logs from systems come in handy. Put together all the minor/major changes done in the system from the last known state to the current state in a backward fashion.

Once the timeline is ready, start rolling back the changes one by one.

ProTip: Always document rollback steps with any release/changes, they come in handy during an incident.