This is part 2 of the Overview of incident management at Zendesk. This guide contains the following parts:
- Part 1: How Zendesk service issues become service incidents
- Part 2: How Zendesk manages service incidents (this article)
- Part 3: Monitoring a public Zendesk service incident
- Part 4: Post resolution incident analysis and reporting
In this article, part 2, you'll get an understanding of how the Zendesk teams handles service incidents within our products based on severity levels. Zendesk takes a comprehensive approach in understanding an incident--from its root cause to the total impact to affected customers--and communicates the appropriate level of detail.
This article contains the following sections:
- Incident severity
- Incident response team structure
- Communication timelines for incidents
- Low severity incident process
One of the key decisions made when a service incident is created is assigning the incident’s severity. The severity of an incident determines how and which Zendesk teams address the issue and how it is communicated to customers who are affected.
Zendesk uses a system that groups service incidents into 5 severity levels based on the characteristics of the incident:
Zendesk Severity Rating System
Different escalation paths and teams are engaged to investigate, communicate and remediate the incident based on severity level. This ensures the right level of rigor is given to each incident. The diagram below describes the key activities that happen during and after an incident is cleared based on its severity level:
Process by Severity Level
While high severity incidents go through rigorous analysis and remediation activities, every incident - regardless of severity level - goes through a real-time response and investigation process. That produces:
- Updates to the Zendesk status page when the incident is public
- Root cause analysis and incident remediations
- Zendesk (internal) incident report
Zendesk service availability incident example
Here is an example of how Incident Severity is set by Zendesk and how Zendesk teams respond internally:
Incident Discovery and Response Example
In the example, you see the following workflow:
- The Zendesk Network Operations Center (ZNOC) identified an issue when system alerts showed service nodes in pod 17 could not be reached by requests. The Zendesk Network Engineering team verified the access issues were affecting customer services directly and quickly realized the Support, Guide and Talk services for multiple customers were not operating as expected. A new Zendesk service incident was created.
- This incident was known to affect two customers when it was initially created, but because of the nature of the outage, more customers were experiencing the issue and began to raise their issues with Zendesk Customer Support. The incident was assigned with a severity rating of 1 by the Engineering team - a high priority incident that requires immediate attention.
- The Incident Response on-call team was paged immediately. Within minutes of incident creation, an Incident Manager gathered information and assembled additional engineering resources to troubleshoot and fix the service incident.
Incident response team structure
Zendesk has a dedicated global Incident Response team to ensure that every incident is shepherded through the service incident management process and escalated to the appropriate levels of Zendesk leadership, as warranted.
Incident Management Roles and Responsibilities
This team structure enables Zendesk to conduct a thorough analysis of the incident with technical resources and communicate in real-time to customers through Zendesk Customer Support.
Communication timelines for incidents
Zendesk is invested in making sure incidents are properly communicated and resolved in appropriate timeframes for the customer. We have established internal timelines for the distribution of incident details. The timeline is based on the severity level of the incident and the service incident management stages.
Within 15 minutes of incident called
Every 30 minutes until service is restored as new information becomes available
Internal Event Analysis
(for Severity 0 and 1 incidents)
Within 48 hours of incident resolution
Root Cause Analysis
Within 72 hours of incident resolution
Within 96 hours of incident resolution
High Severity (0 - 2) Incident Communication Timelines
Timeline Summary for High Severity Incidents
Incidents that have a Severity rating of 0 - 2 are considered high severity incidents. When a high severity incident occurs, the global on-call Incident Response Team is available 24/7 to respond to these incidents. The team consists of the following roles:
Incident Response Team Roles
Global Incident Response Team Locations
As the on-call team is paged, the incident diagnosis starts within minutes of the incident being declared. A Slack channel and Zoom call are created to enable response team communication in real-time. As the Incident Response team triages and scopes the incident, on-call engineering teams are paged based on what products and services are affected.
A public post on the Zendesk status page is made within 15 minutes of incident creation to keep customers informed about the known incident. Updates are posted every 30 minutes thereafter until resolution as new information becomes available. Depending on the issue and how much new information is identified, this cadence may be lengthened as needed. Customers can monitor active service incidents - that process is described in part 3 of this guide. Public incident updates are also shared via the Zendesk Operations Twitter feed.
In addition to our global on-call incident response teams, Zendesk has established processes for leadership notification and escalation. If a high severity incident fits certain criteria, we enable the next level of response, which is Crisis Management.
Zendesk service availability incident example continued
In continuation of using the service availability incident as an example, this is how the incident response progressed through the Incident Management process at Zendesk:
Service Availability Incident Response Timeline
As you can see in the example, once the incident was created in the Zendesk Incident Portal, a series of automated actions were taken:
- The Incident Response on-call team was paged to respond to the incident
- An incident Slack channel was automatically created and the Incident Response on-call team was added to the dedicated Slack channel
- A Zoom call was automatically started and posted to the Slack channel for all responders to join
- An Event Summary Document was automatically created to document the incident and share progress internally to Zendesk
On the Zoom call, the Incident Manager validated the initial severity and confirmed the scope and impact of the issue.
It was quickly determined multiple container nodes in pod 17 were not accessible and could not be used by dependent services including Support, Guide and Talk products. One node type in particular had no available replicas in other pods. This would eventually cause these products to become unresponsive for multiple customers.
The ZNOC paged the appropriate Network engineering team to the Zoom call to begin investigating how to solve the immediate problem of restoring service and API access to customers. Edge engineering SMEs were also paged and joined the call. Within 5 minutes, a fix was identified and deployed so the affected nodes were again accessible to API calls and services.
Zendesk Customer Support created a problem ticket to track the customer reports. This ticket was added to the incident Slack channel to quickly allow for new reports to be added as they came in.
While the investigation was continuing, the Incident Escalation Manager created and published the first public update to the Zendesk Status Page and Zendesk Ops Twitter feed 12 minutes after incident creation.
First Service Availability Incident Post on Zendesk System Status Page
While the teams investigated the incident, customer reports that came in were linked to the main problem ticket associated with the incident. This allowed the Incident Response team to send updates to all impacted customers when they made public notifications.
The Network engineering team determined a change to how certificates were generated and used was responsible for the incident and took the following actions to restore service to affected customers:
- Deactivated unreachable nodes
- Created new service nodes with properly referenced certificates
- Verified that new service nodes were accessible for services and through API calls
- Monitored inbound traffic to see that inbound requests were now being handled appropriately
As the incident progressed, two more public updates were made: one 14 minutes after incident creation and another 63 minutes after incident creation. The public communication history along with published postmortem information can be found on the Service Notification page for the incident.
As shown in the example, high severity incidents go through a rigorous process where root causes are determined and remediation items are created for Product Engineering teams to fix the underlying problem that caused the incident. This analysis happens during our incident postmortem and is discussed in more detail in the Post Resolution Incident Analysis section.
Low severity incident process
Lower severity service incidents (level 3-4) are less critical because they affect a smaller number of customers and do not prevent those customers from using business critical functions of Zendesk products. These incidents are addressed according to the guidelines above, but are not posted to public channels.
Severity 3 incidents are handled in much the same way as severity 0-2 incidents. Expected response times are extended because of the reduced business impact. Even though the on-call team is not paged, these incidents are handled through specific Zendesk incident Slack channels associated with the supporting product engineering team(s), and the teams tend to respond as quickly as higher severity incidents. Most severity 3 incidents do not use public communication channels. Instead, Zendesk Customer Support teams reach out to customers using proactive tickets if specific action is required from a subset of customers.
Severity 4 incidents do not directly affect customer use of Zendesk services, but have the potential to do so if not addressed. These incidents are created as proactive responses to potential issues. Product engineering teams engage the same way as they do with the severity 3 process.
This completes Part 2, How Zendesk manages service incidents, of the Overview of incident management at Zendesk.
If you'd like to learn more, you can move on to the next part of this guide: Part 3: Monitoring a public Zendesk service incident.