Incident Management part 4: Post resolution incident analysis and reporting

This is part 4 of the overview of incident management at Zendesk. This guide contains the following parts:

Part 1: How Zendesk service issues become service incidents
Part 2: How Zendesk manages service incidents
Part 3: Monitoring a public Zendesk service incident
Part 4: Post resolution incident analysis and reporting (this article)

In this article, part 4, you'll learn how the incident response team conducts a retrospective that includes root cause analysis and remediation of service incidents and then assigns remediation items to the engineering team(s) that have ownership.

By conducting these activities, Zendesk Customer Support can share incident details and next steps with affected customers.

This article contains the following sections:

Conducting a service incident retrospective
Assigning remediation Items
Closing out a service incident

Conducting a service incident retrospective

Zendesk conducts a reflective exercise with all team members involved with the incident to examine and document the causes of the incident, the incident’s impact to customers and actions taken to mitigate or resolve it. The team reviews the identified root cause(s), and follow-up actions that will prevent the incident from recurring. This is known as a service incident internal retrospective. Incident retrospectives are shared publicly only for high severity incidents.

To ensure transparency and inclusion for all Zendesk teams, a Zendesk internal retrospective calendar is available so they can attend the internal retrospective meeting and get more information regarding service incidents and root causes. Outcomes of incidents are shared with all engineering teams and significant incident outcomes are highlighted and reviewed in the Zendesk weekly engineering meeting.

There are four main activities performed in a service incident retrospective:

Review the incident details contained in the Incident Document to anchor and orient the participants to the incident
Review and validate the Root Cause Analysis findings contained in the Incident Document
Identify and categorize any remediation work needed for Zendesk engineering teams to fully address the root causes that lead to the service incident. All remediation items are agreed to with consensus by the retrospective attendees
Assign remediation work to the appropriate engineering teams with clear and appropriate SLAs defined.

High severity incident analysis

Once a high severity incident is resolved, the Incident Manager schedules a retrospective meeting that includes:

All team members who participated in the incident response
Engineers from teams whose products or services were affected by the incident
Teams who have ownership or invested interest such as:
- Zendesk Customer Support
- Product teams
- Leaders who own affected products, services and areas of support

Every effort is made to hold the incident retrospective meeting within 72 hours of incident resolution, understanding that the timing of the meeting will depend on the complexity of the root cause and availability of team members across geographical regions.

After scheduling the incident retrospective, the engineering owner documents the root cause analysis and creates The Incident Document based on the following categories:

Incident Overview
Customer Impact
Technical Description
Root Cause and Service Information
Incident Details and Timings
Remediations

The Incident Document guides the incident retrospective and captures any remediation work that is identified to fully resolve the underlying issues that caused the incident.

There is an additional analysis phase conducted for severity 0-3 incidents known as Root Cause Analysis. This analysis gives the Engineering team a chance to understand and document the incident and define the work needed to fully fix the issues. This information is captured in the Incident Document.

Zendesk Incident Root Cause Analysis Process

Low severity incident analysis

Low severity incidents go through a leaner root cause and reporting phase than high severity incidents. While a formal incident retrospective meeting is not completed (unless requested by the Product Engineering owner) for low severity incidents, an Incident Document is created by the Product Engineering owner.

Root causes are identified, classified and shared with Engineering teams, and remediation items are added to the Product Engineering team backlog with SLAs. As with higher severity incidents, Zendesk seeks to learn and improve our engineering processes as a result of thoroughly investigating low severity incidents.

Since Severity 3 incidents have a minor impact on customers, the issue status and identified remediations are shared with affected customers who reached out about the incident via Zendesk Customer Support through a Zendesk ticket.

Severity 4 incidents by definition do not have direct customer impact. Post incident analysis is not communicated to customers, but the root causes are identified and remediations are addressed internally using the processes and procedures described above.

Assigning remediation Items

In order to ensure remediation items are completed, the Product Engineering team reviews the validated remediation items in the retrospective and performs the following actions:

Classify remediations as Preventive or General:

Preventive items are ones that would actively prevent a recurrence of the incident
General items are not solely preventive on their own but would resolve a core part of the incident’s circumstances
Prioritize the remediations to set the response SLAs. This exercise goes through the following activities:

Identify the engineering teams responsible for working the remediation item
Link the remediation item to the identified root cause that it addresses
Add the remediation item to the work backlog of the responsible engineering team
Add the remediation item to the engineering SLA report to track SLA achievement

Below is a chart that Product Engineering teams use to determine when a remediation is prioritized and planned for their work effort.

Zendesk Remediation Item Priority SLA

The Zendesk Customer Support team attending the retrospective creates the customer-facing descriptions of incident, root causes, and any remediations identified. This is posted to the Service Notifications section of our help center.

Service Availability Incident Example Continued

Here’s how an incident retrospective was conducted for this incident.

4 business days after the incident occurred, the Incident Response team and Engineers gathered to review the incident, collaborate on the root causes, and create or update the remediation items. All remediation items are agreed to by consensus of the meeting attendees.

Each person involved in the incident played a role in the incident retrospective:

The details reviewed and discussed in the meeting included:

Area	Example
Timeline	20:02 UTC - New container versions deployed to host services with updated certificates 20:08 UTC - Container connectivity warnings start to appear 20:37 UTC - First evidence of services not being able to connect to the new containers, thus causing service delay/interruption 20:57 UTC - Zendesk internal service stops processing requests, causing timeout errors in Support, Guide & Talk applications hosted on pod 17 21:02 UTC - Cluster autoscaler starts to create new containers for services that cannot be reached 21:07 UTC - Full provisioning of service containers that will work with existing service configurations complete 21:49 UTC - Cleanup of unreachable containers complete 22:07 UTC - Incident is fully resolved
Root Causes	After security certificate service changed, containers were not all rebuilt to pick up the changes encoded in the script. Containers that were not redeployed did not reference the correct security certificate provider and were not trusted by other Zendesk services and containers
Influencing Factors	We did not update the deployment scripts to properly reference the new security certificate provider when creating new containers Deployed the new containers too quickly and widely to be able to adjust after failures started occurring No automated rollback capability
Remediations	Change how security certificate compliance is evaluated when new containers are built and deployed Add a different, more robust method for verifying certificates before launching new instances Document the deployment strategy for horizontally scaled infrastructure Enable automatic rollback of deployments if any alerts occur Research how platform engineering can rebuild their infrastructure components more frequently Discover how critical infrastructure can be made more distributed and fault tolerant

For there to be a thorough analysis to generate concrete actions for the Engineering team, all team members provided input to recount the incident and develop remediation tasks. Once all questions or issues were addressed by the Incident Response team, the incident retrospective was considered complete.

The Zendesk Customer Support lead responsible for the public facing incident retrospective was asked at the end of the internal retrospective meeting if she had any questions or needed any additional information from the team to create the public documentation. She had no further questions and added the retrospective information below to the Service Notifications section in our help center.

Public Retrospective Information for the Service Availability VM Incident

Three important outcomes of this incident retrospective that have improved Zendesk products and services were:

The root causes of the incident were identified and will be considered by all Zendesk product teams in future development moving forward

The remediations were identified and assigned to engineering teams with SLAs

The public retrospective was published by Zendesk Customer Support to the Help Center and was sent to affected customers who submitted tickets

Closing out a service incident

As a best practice, Zendesk closes any open tickets with customers to make sure everything is properly documented and communicated for the incident.

All completed service incidents are summarized in a weekly service incident digest report which is shared widely across Zendesk. Incident descriptions, customer impact and important remediations are included in this report and are also in a bi-weekly Operations Review report that is shared with Zendesk’s Executive team.

After retrospective information is published to the Help Center and open tickets are updated with the results from the retrospective, the analysis and reporting phase for the service incident is considered complete. Zendesk Customer Support links those tickets to the service incident and they are marked as closed.