Take Command of your PagerDuty Incident Response

SlackOps for PagerDuty Part 5

View earlier blogs in this series here.

In this installment of our Series we are going to explore some things you need to do during your PagerDuty incident response. Things let setting the incident commander, assigning tasks, and running through a basic triage. Some of these activities are run for every incident, others you might not need to do. In the case of triage you might have a different triage depending on the nature of the incident or the priority. Let’s tackle them on at a time.

Incident Response Activity 1: Set the Incident Commander

The role of the incident incident commander is to own the incident and drive it to a swift and effective resolution. Many times the commander is the person to whom an incident is assigned. However, we have seen a number of different practices relative to who the incident commander is. In some companies it’s the on call person who the incident was assigned to, in other companies there are a few dedicated incident commanders who have gone through considerable training to hone their skills. In long running incidents you may want to switch commanders so you always keep them fresh. Regardless of your approach it’s a good idea to have a defined commander with whom people can engage. This eliminates situations where people blast out messages that distract from effort to resolve the incident. Setting the incident commander in RigD is easy, in an incident channel simply type the following.

@rigd set the incident commander

Just enter the Slack user you want to set as the commander and that’s it. We also will attempt to add a note to the PagerDuty incident so others outside of Slack will know who is leading the response effort.

To make life even easier when you start working a PagerDuty incident with RigD we will set the incident commander to the PagerDuty incident assignee and when possible the channel topic is set with that incident commander information.

Incident Response Activity 2: Assign Incident Response Tasks

Assigning Tasks is something the incident commander may need to do if the scope of the incident is beyond their capacity to handle alone. A good task is one that has a specific objective, a clear owner, and a set timeframe for completion or an update. You can do this in RigD with our Handy timer activity. Just try the following.

@rigd start a timer

Then choose a title and the length of the timer. For the title it’s a good idea to use the slack user reference like, @Justin. This ensure that user and everyone else know who is driving that task. Another good practice is to have any discussion related to that task split off into a thread from the task. This keeps things moving in the main incident channel and organizes the content related to the task for a better post mortem analysis.

When the task timer has gone off a notification will be sent with the title of the task. This is again where having the Slack user reference in the title help. If that assign person is busy with the task work this reference brings them back into focus for an update.

Incident Response Activity 3: Running an initial incident triage

Many times what seems like it might be a minor incident turns out to be a major one. It’s not uncommon for an on call person to jump right into trying to resolve an incident and forget to tackle the basic assessment and corresponding tasks. With RigD you can set up a Flow that will interact with the incident commander to ensure they have covered the key initial questions and can guide them on the right next steps. For example you might want to spin up a Zoom meeting for a major incident. Let’s take a look at how you do that with RigD. We will start by editing a basic triage example flow.

edit flow basic triage

Then we will add a step at the end which will trigger if the incident commander indicates this is a high urgency major incident.

We are going to add an activity and then choose Start Zoom Meeting as that activity.

Finally we will set the inputs for the activity. To speed things along we will choose to run the activity instead of prompting for inputs, then we will set the default title for the meeting.

Now let’s see this in action. We can run the triage flow from the selection menu in the incident slack channel.

There is a wide variety of additional questions, info, and activities you can add into your initial triage from. You can even have specific triage flows for different services, teams, or problem types. Setting up a good triage flow can make a huge difference in getting the right people engaged early and shaving significant amounts off your resolution time. The team here at RigD is always available to help you get it right. just send a message to us through Slack.

ask rigd

Drive Down Resolution Times With RigD

In our previous posts we took a look at the opportunity for savings from setting up your SlackOps for PagerDuty using RigD. The activities in this part a bit unique in that they have no direct equivalent in PagerDuty. You might suggest that setting the command could be accomplished by reassigning an incident which takes about 26 seconds via the UI while taking only 6 seconds through RigD. Assigning a task could be equated to adding a responder and a note. Assuming we assign two tasks per major incident this comes to 2 minutes of manual effort time and 32 seconds with RigD.

The incident triage is the hardest to map, but we will compare it to the time it takes to update an incident priority and create a zoom meeting with the link posted in Slack. Performing those tasks. Our manual effort time would be 39 seconds and our RigD time would be 5 seconds when done through the triage flow. So using the Rand Group report calculated the cost per minute of downtime for an enterprise at $5,600 and the PagerDuty ROI study found an average of 20,483 incidents per year including 14 outages, we have $207,760 of outage costs and 905 hours spent working incidents. RigD reduces the time spent to 211 hours annually and mitigates $159,413 of costs.

Our next couple of posts will start to look at the post mortem elements of incident response and how RigD can help ensure you drive a continuously improving process. You can also take a look at the technology behind RigD here, or try our Slack App out.

Next Up: Final Blog Part 6: Automate PagerDuty Incident Postmortems to Drive Improvement

No Comments

Post a Comment