How Digital Operations Can Use Slack to Transform Their Incident Response – even more so in this remote world
Site Reliability, Engineers, IT Operations and DevOps already use Slack to communicate and collaborate as they resolve issues. This is the story of evolving new capabilities that are essential in the quest to speed up the response and resolution to technical incidents and business-critical services, right through Slack.
For many companies with technical teams staying on top of their digital operations, whether it be ecommerce, marketing, finance or travel platforms, a simple technical issue can result in on-line business coming to halt. The signature of that failure or slowdown is often hard to find amid all the time series monitoring and log monitoring that is common nowadays. In many ways each problem is a snowflake, being slightly different from the last one, with all the changes to the deployed application, network and configuration changes. Our world of on-line services is super linked through a set of macro and micro services providing capabilities that are “assembled” into a working web application. Add on top of that specific customer or user context and the great variation of load that come with world changes and you have a perfect storm hitting the technical support teams.
Slack changed the game for Incident Response
The introduction of Slack into these technical environments has already transformed incident operations and response. Previously the crux of incident response was a “war room” and/or conference call (either call or Webex/Zoom) with dozens of technical people present (and many others who just needed to know what was going on). Add to that the inevitable confusion and haze while just a few key people were trying to troubleshoot, triage and restore the downed service. In companies with sizeable operations, you often had multiple incidents going on, some of which had critical status and were considered major incidents with incident commanders.
Where Slack really turned established incident response methods on their head was the use of dedicated channels (and threads) for specific incidents. For the first time, teams had a collaborative environment for sharing ideas, as well as the data for the troubleshooting and resolution of the incident. With the extensive adoption of Slack across many groups in the enterprise, not only did the technical team have a central hub to share, collaborate, troubleshoot and drive the restoration of services but there came with that a comprehensive record and human perspective of what happened during the incident.
Saving incident response time and speeding up resolution while working remotely
Customers tell the same story all over the world, when it comes to Slack and incident response:
“I can’t mention enough the need for speed in starting up an incident response. Sometimes automated monitoring can take a few minutes to send in an alert. Both our internal users and customers can see an issue in just a few seconds. Being able to use Slack to start up the incident response and then assemble the team to escalate as quickly as possible to the SRE who can identify, troubleshoot and fix the problem on the spot is so important to our customers.”
“Any SRE team lead will be the first to tell you that their way of doing things, their process, the tools they use to do their job is different than the outfit down the street. Being able to use Slack to customize how we respond and the flexibility there for our team is so important in our DevOps culture.”
It is more than just the tools
Besides culture, SRE and DevOps teams work with on average two dozen other tools to help them monitor, triage, track incidents, perform automation and other tasks as well. The structure of an incident response system needs to consider the people, process and tools. Slack is at the nexus of that system allowing people to work remotely.
People interact through Slack, processes are executed within Slack, either through manual interaction or through a workflow and typical SRE/DevOps tools, including systems of record that can provide two-way interactions through Slack Apps. Having the right Slack App to help structure the team and integrate to relevant tools can be key.
The natural evolution of incident response handling led to the development of RigD. RigD’s Slack App utilizes Slack’s position at the center of operations, to provide capabilities that users can consume directly through Slack to automate many of the activities of incident response. Without leaving the context of Slack they can speed up response no matter where they are and no matter who is involved. No special machine learning or AI skills are needed. It’s proven.
“With RigD’s Slack App, we save critical minutes with every incident, getting the incident response started, engaging the right resources to fix the problem. Integration to our other tools allows us to save time by eliminating context switching. Slack and RigD has become our digital command center for our technical operations at Tripactions.”