This is part 4 of the overview of incident management at Zendesk. This guide contains the following parts:
- Part 1: How Zendesk service issues become service incidents
- Part 2: How Zendesk manages service incidents
- Part 3: Monitoring a public Zendesk service incident
- Part 4: Post resolution incident analysis and reporting (this article)
In this article, part 4, you'll learn how the incident response team conducts a postmortem that includes root cause analysis and remediation of service incidents and then assigns remediation items to the engineering team(s) that have ownership.
By conducting these activities, Zendesk Customer Support can share incident details and next steps with affected customers.
This article contains the following sections:
Conducing a service incident postmortem
Zendesk conducts a reflective exercise with all team members involved with the incident to examine and document the causes of the incident, the incident’s impact to customers and actions taken to mitigate or resolve it. The team reviews the identified root cause(s), and follow-up actions that will prevent the incident from recurring. This is known as a service incident internal postmortem. Postmortems are shared publicly only for high severity incidents.
To ensure transparency and inclusion for all Zendesk teams, a Zendesk internal postmortem calendar is available so they can attend the internal postmortem meeting and get more information regarding service incidents and root causes. Outcomes of incidents are shared with all engineering teams and significant incident outcomes are highlighted and reviewed in the Zendesk weekly engineering meeting.
There are four main activities performed in a service incident postmortem:
- Review the incident details contained in the Incident Document to anchor and orient the participants to the incident
- Review and validate the Root Cause Analysis findings contained in the Incident Document
- Identify and categorize any remediation work needed for Zendesk engineering teams to fully address the root causes that lead to the service incident. All remediation items are agreed to with consensus by the postmortem attendees
- Assign remediation work to the appropriate engineering teams with clear and appropriate SLAs defined.
High severity incident analysis
Once a high severity incident is resolved, the Incident Manager schedules a postmortem meeting that includes:
- All team members who participated in the incident response
- Engineers from teams whose products or services were affected by the incident
- Teams who have ownership or invested interest such as:
- Zendesk Customer Service
- Product teams
- Leaders who own affected products, services and areas of support
Every effort is made to hold the postmortem meeting within 72 hours of incident resolution, understanding that the timing of the meeting will depend on the complexity of the root cause and availability of team members across geographical regions.
After scheduling the postmortem, the engineering owner documents the root cause analysis and creates The Incident Document based on the following categories:
- Incident Overview
- Customer Impact
- Technical Description
- Root Cause and Service Information
- Incident Details and Timings
The Incident Document guides the incident postmortem and captures any remediation work that is identified to fully resolve the underlying issues that caused the incident.
There is an additional analysis phase conducted for severity 0-3 incidents known as Root Cause Analysis. This analysis gives the Engineering team a chance to understand and document the incident and define the work needed to fully fix the issues. This information is captured in the Incident Document.
Zendesk Incident Root Cause Analysis Process
Low severity incident analysis
Low severity incidents go through a leaner root cause and reporting phase than high severity incidents. While a formal incident postmortem meeting is not completed (unless requested by the Product Engineering owner) for low severity incidents, an Incident Document is created by the Product Engineering owner.
Root causes are identified, classified and shared with Engineering teams, and remediation items are added to the Product Engineering team backlog with SLAs. As with higher severity incidents, Zendesk seeks to learn and improve our engineering processes as a result of thoroughly investigating low severity incidents.
Since Severity 3 incidents have a minor impact on customers, the issue status and identified remediations are shared with affected customers who reached out about the incident via Zendesk Customer Support through a Zendesk ticket.
Severity 4 incidents by definition do not have direct customer impact. Post incident analysis is not communicated to customers, but the root causes are identified and remediations are addressed internally using the processes and procedures described above.
Assigning remediation Items
In order to ensure remediation items are completed, the Product Engineering team reviews the validated remediation items in the postmortem and performs the following actions:
- Classify remediations as Preventive or General:
- Preventive items are ones that would actively prevent a recurrence of the incident
- General items are not solely preventive on their own but would resolve a core part of the incident’s circumstances
- Prioritize the remediations to set the response SLAs. This exercise goes through the following activities:
- Identify the engineering teams responsible for working the remediation item
- Link the remediation item to the identified root cause that it addresses
- Add the remediation item to the work backlog of the responsible engineering team
- Add the remediation item to the engineering SLA report to track SLA achievement
Below is a chart that Product Engineering teams use to determine when a remediation is prioritized and planned for their work effort.
Zendesk Remediation Item Priority SLA
The Zendesk Customer Support team attending the postmortem creates the customer-facing descriptions of incident, root causes, and any remediations identified. This is posted to the Help Center article associated with the incident.
Service Availability Incident Example Continued
Here’s how an incident postmortem was conducted for this May 2020 service availability incident.
4 business days after the incident occurred, the Incident Response team and Engineers gathered to review the incident, collaborate on the root causes, and create or update the remediation items. All remediation items are agreed to by consensus of the meeting attendees.
Each person involved in the incident played a role in the postmortem:
The details reviewed and discussed in the meeting included:
For there to be a thorough analysis to generate concrete actions for the Engineering team, all team members provided input to recount the incident and develop remediation tasks. Once all questions or issues were addressed by the Incident Response team, the postmortem was considered complete.
The Zendesk Customer Support lead responsible for the public facing postmortem was asked at the end of the internal postmortem meeting if she had any questions or needed any additional information from the team to create the public documentation. She had no further questions and added the postmortem information below to the public service incident article in the Service Notifications section in our help center.
Public Postmortem Information for the Service Availability VM Incident
Three important outcomes of this incident postmortem that have improved Zendesk products and services were:
- The root causes of the incident were identified and will be considered by all Zendesk product teams in future development moving forward
- The remediations were identified and assigned to engineering teams with SLAs
- The public postmortem was published by Zendesk Customer Support to the Help Center and was sent to affected customers who submitted tickets
Closing out a service incident
As a best practice, Zendesk closes any open tickets with customers to make sure everything is properly documented and communicated for the incident.
All completed service incidents are summarized in a weekly service incident digest report which is shared widely across Zendesk. Incident descriptions, customer impact and important remediations are included in this report and are also in a bi-weekly Operations Review report that is shared with Zendesk’s Executive team.
After postmortem information is published to the Help Center and open tickets are updated with the results from the postmortem, the analysis and reporting phase for the service incident is considered complete. Zendesk Customer Support links those tickets to the service incident and they are marked as closed.